Translation Tribulations: embedded objects

Showing posts with label embedded objects. Show all posts

May 18, 2014

memoQ 2014: a first look

I couldn't make it to memoQfest this year - the first one in Budapest that I have missed since the event began in 2009. But the first family visit since that same year took priority, so my exposure to the upcoming memoQ 2014 version was strictly second hand until today.

I wasn't too happy with thing I heard on the Yahoogroups user list. In fact, when I read one message describing how the new transcription feature for bitmap graphics in some files required the Product Manager version, I was quite annoyed. The reality - a whole month before the official release - is very good for both freelancers and corporate outsourcers, and I think by the time this version makes its official debut in June there will be many good reasons to smile. I'm frankly amazed at how much Kilgray seems to be getting its act together and balancing the needs of users at all levels.

This afternoon I downloaded the first test release (alpha??) of memoQ 2014, installed it and began to take a cautious tour. My first impression was that it looked the same. And then, bit by bit, subtle and excellent small differences began to emerge. I looked for and found major new features I had heard about and discovered many interesting things not mentioned along the way.

The grammar checking feature seems to be implemented in a sensible way, though it actually doesn't work at all right now for me. But I can see where it's headed, and it is going in a good direction.

I had a quick look at the new plug-ins, particularly TaaS, and made notes about testing the potential for teamwork. What I have seen of TaaS for its much-advertised terminology extraction is a huge disappointment, and those who have followed my comments on Twitter will know I have nothing good to say about this EU boondoggle, but I see potential for other possibilities that nobody has really talked about, and if my instinct is right, this could be really useful. But I will need to invest a lot of testing time for the approach I have in mind.

The Project home view has gotten even more impossibly cluttered with the addition of "People", a rather sensible reworking of role assignments that even in the Translator Pro version clearly acknowledges that most freelance translators are not, in fact, 'islands' in their work.

This will surely make the small screen (netbook) usage problems worse if Kilgray does not redesign the view a bit, but in every other respect I see this as a significant improvement of project workflow, emphasizing the relationships between project participants in a better way.

One little bit that I stumbled across was the new way of handling the export of unfinished translations. This is a nice way of recognizing the frequent pressure in some projects to export incomplete stages of work.

I have had ways of dealing with this need for years in memoQ, but this new approach will make things simpler and obvious for all users.

There is a nice little feature for tracking time too:

This will facilitate record keeping for some jobs involving time charges.

The feature I have looked at in some depth so far, which makes me very happy, is Kilgray's very sophisticated handling of embedded objects and graphics, which sets new standards in many ways. I think there is still a key feature missing to make it the equal of OmegaT for handling charts with data stored as XML in the MS Office file (though I have not had time to check this yet), but what I have seen so far goes way beyond similar features I have seen in STAR Transit and Déjà Vu X2.

Embedded objects and images are imported as separate files from within the media and embeddings folders of the Microsoft Office file. I see a few potential problems with the current way of displaying a file and its objects and media. I've had projects with multiple files having embedded Excel spreadsheets, PowerPoint slides and other objects as well as any number of pictures needing to be localized. One recent project had 59 spreadsheets embedded in a DOCX file. Without an accordion or tree structure to collapse the subordinate structure view and show the embedded content again, the overview will be lost quickly. But this is a very good start. Note how the main file includes a count of the segments in the subordinate objects and graphics. (And take note of the new progress bar with different colors for different process stages like translation and proofreading.)

Bitmap texts can be recorded with a new transcription feature, which is also compatible with voice recognition. I dictated my German source texts with Dragon Naturally Speaking set to German, then switched to English for the translation. And of course the bitmap transcriptions are included in the word counts of the Statistics functions and the translations are written to the translation memory. I believe this is utterly unique in translation environment tools. Fluency has a transcription module too, of course, but its purpose and application are very different.

The exported translations with translated objects will look like they are not done at present, because the difficult refresh problem has not been solved by Kilgray. Each translated spreadsheet, slide, etc. will need to be opened in the document before the translation will become visible. This is much easier using the macro I published two years ago, and I am certain that by release time or soon thereafter Kilgray will find an elegant way of dealing with this difficulty. Atril handles the same problem by distributing macros as I recall.

In the recent Kilgray blog post on the six reasons to upgrade to memoQ 2014, the only overlap with the above points is the image localization. Peter Reynolds talks instead about other good stuff, such as the long-awaited project templates and Language Terminal. There are so many nice things ahead with this upgrade that we'll all just have to take it slowly, one bit at a time.

Of course the usual precautions for any new software version apply. The new version can be installed in parallel to your current version, and it can be tested while you continue to do the bulk of your work in the older, stable version. Typically it takes a few months for any new version to get the kinks out, but this allows plenty of time for planning the transition and preparing to take full advantage of the new features relevant to you. Migration is also not a trivial matter in many cases, but this time around there may be a little more help with that. More on that another time!

Jul 10, 2013

Coping with objects and graphics to translate in Microsoft Office documents

About a year ago, I published a series of posts describing a simple way to get at the objects and graphics embedded in Microsoft Office documents, such as Microsoft Word DOCX documents or PowerPoint PPTX presentations. These investigations were inspired by a series of jobs where I had to cope with up to 60 embedded Excel tables in a Microsoft Word document. The four related posts are:

The post titles may differ a little from the text in the links here, which is updated for a little more clarity.

I've also added two short videos to my YouTube channel which illustrate how to remove embedded objects from a DOCX for translating separately from the Microsoft Word document and how to put them back afterward.

Here's how to extract the embeddings folder from the DOCX file:

And here is how to put the translated embedded objects into the DOCX file and refresh the view of the embedded objects in your translation:

These and other videos I've produced recently are part of an effort I began recently to develop integrated courses for self-instruction and review with software tools used by many of us. These courses use the Moodle platform and offer text, screenshots, audio, video and data files such as examples of file formats to translate, backups of memoQ practice projects to restore on your local computer for training, configuration resources for memoQ, useful macros to support work with many translation environment (CAT) tools and a host of other resources and learning links.

Oct 21, 2012

Put OCR in Your Business Model

This article originally appeared on an online translators portal four years ago and was long overdue for removal there. Here is an update.

*****

Optical character recognition (OCR) software is discussed often online and at translators' events, usually in the context of how to deal with PDF files. Hector Calabia, Peter Linton and others have made a useful technical contributions on this subject in articles and forums and at various conferences. However, it is useful to consider OCR software in a broader translation business context. Document conversion is often very useful for translation purposes and greatly facilitates automated quality checks of the draft, for example, but OCR can also generate additional income for your business and reduce quotation risk.
OCR for translation
There are a number of programs available for this purpose, and which one is best for your purposes may depend on the language combinations you deal with and other factors. For years now I have used Abbyy FineReader, because years ago it gave the best test results for the particular set of European languages one of our clients offered. It is also relatively inexpensive (I paid about 100 euros for FineReader 11) and easy to use.

Many OCR conversions of TIFF, JPEG and PDF documents which I receive from agencies are difficult to use for translation purposes and require significant modification - if they can be used at all. Particularly in cases where TM tools are to be used or target texts differ significantly in length (especially when they are longer) there may be problems. The best ways to avoid these problems are

avoid automatic settings for OCR conversions; use zone definitions instead
avoid saving the converted texts with full formatting in most cases
use a suitable post-OCR workflow to clean up the converted document by joining broken sentences, removing superfluous characters, fixing conversion errors, etc.

If the idea of doing individual zone definitions on each page of a 100 page document is intimidating, take heart. Programs such as Abbyy FineReader often allow you to define layout templates, speeding up the work considerably. One translator I know became so skilled at the use of these OCR templates and was so good with his conversions that agencies hire him just to do high-quality OCR work for them. Which brings me to….

OCR as an income-generating activity for the translator or agency
Hardcopy, scanned documents, faxes and PDF documents generally require more work for translators than electronically editable documents and require different, sometimes more fallible quality control measures than a typical workflow for a translator using original electronic documents in a translation memory system. If no conversion is performed, it is more time-consuming to check terminology or use concordances during the translation, and it is also unfortunately too easy for eyes to skip over bits of text. Under time pressure this can lead to very serious problems. Even with conversion, the OCR text requires careful checking against the original document to identify and correct any errors introduced (and there will be some at times with even the best OCR software). So it is not at all unreasonable for a translator to charge a higher rate for dealing with hardcopy, scanned documents, faxes and PDF documents.

There are a number of ways to incorporate these higher charges into your business model. The two obvious ways are a premium (surcharged) word/line/page rate and hourly service charges. I usually offer both options to my clients, with the word/line rate surcharge representing the “fixed” rate and the hourly rate the “flexible” rate where I make an non-binding estimate and they may end up paying more or less according to the actual effort. For pure OCR conversion jobs where I am not doing the translating, I charge a typical proofreading rate or a bit more, because I go through the entire document and see that it is correctly formatted for translation work and that obvious errors are fixed (i.e. basic spellcheck, etc.).

Sometimes I hear that “the client doesn’t want to pay for that”. Well, that’s OK, too. The client has the option of doing the work and doing it right and saving me the effort. The recognition that there is additional effort involved and that this effort should be compensated is important. But usually there is a way to sugar-coat the "bitter" cost pill, and this is where your marketing savvy comes into play. Some win-win arguments you might present include:

the availability of an editable source text the client can use for future versions;
the ability to create TM resources using the OCR text (which can save time/money later);
potentially better quality assurance, especially with tight deadlines.

Returning a clean, nicely formatted OCR of the source document is often good "advertising". End clients may appreciate how this saves time and allows them to use the original text in a variety of ways (attorneys may like to quote arguments from the opposing side, and copy/paste beats retyping). Discriminating agencies may recognize your skill at creating documents that don’t go crazy when edited (because of screwy text boxes, bad font definitions and other format errors) and offer you more work. If your language pair is in low demand or is very competitive, this may be one more way of distinguishing yourself from the pack.
I got started doing OCR work and charging for it after suffering through the conversion of several long PDF documents by more manual methods. I finally wised up, bought FineReader and started to use it with most of the hardcopy, scanned documents, faxes and PDF documents I received simply because it enabled me to use my TM tools and do better quality checks. I started sending the cleaner-looking source texts converted with OCR along with the target text translations, and soon I started getting requests for paid OCR work. A number of my agency clients then began to buy OCR tols and use them with varying degrees of success. Even if they do all the conversion work, I still win if they do it right, because I save time for what I enjoy more – the translation.

OCR as tool for quotation
Some people I know still haven’t learned to do a high-quality OCR (or they don’t care to), but they still use the software effectively in a very important area of their business: quotation and risk limitation.

There are lots of good tools out there for text counting, which is important to many methods of costing and time planning in the translation business. Some people even still do it manually, which, though time consuming, is not a bad way of checking the numbers from an electronic estimate. A number of factors can result in text counts being too low – embedded objects, such Excel tables or PowerPoint slides in a Microsoft Word documents, or graphics with text - or even too high (as is the case with at least one CAT tool counting RTF and MS Word files). Keep using whichever method you prefer - I won't try to persuade you that any one approach is best. I use a number of methods myself.

When translating larger documents, however, or documents with a complex structure, it is often useful to have a “sanity check” for your text counts. On a number of occasions I have received translation jobs from agency clients where the text count was given a X words, where in fact there were quite a few more words embedded in Excel objects, bitmap graphics, Visio charts, etc. which had not been measured by the method used. In a few cases these clients had to take a loss on the job after giving a fixed price bid to the end client. Using OCR to check your estimates can prevent such an unfortunate scenario.

To do this, print the document (whatever it is) to a PDF file. Then run the PDF file through an OCR program with automatic settings (to save time – you don’t need to translate this OCR). Save the text and count it. There will probably be a bit more text due to headers or footers or perhaps garbage from graphics, but the results should be close to your other estimate. (You can always subtract an appropriate factor for the text count in headers and footers to improve your OCR estimate.) If there is a major deviation, this is a clear sign that you should take a much closer look at the document(s) before quoting the job.

Searchable scanned documents
Another use I have found for OCR in recent years is creating searchable "text-on-image" documents from scanned PDFs, TIFF files and other bitmap formats. Although I have used these searchable PDFs mostly for reference while I work (searching for bits of text while viewing the original, unadulterated context) and supplied them to clients on only a few occasions, the potential for an additional value-added service is fairly obvious in this case.

Conclusion
OCR software is an essential tool for the work of many translators today, even more so than CAT software in many cases. Not just a tool for recovering “lost” electronic documents or making legacy typed material more accessible for translation work, it also offers possibilities for generating additional projects and income, differentiating one’s services and reducing risks when quoting large jobs. Key features of whatever OCR you choose should include the ability to select text areas for conversion and to determine their sequence in the converted text (using user-defined zones). Various options for saving the converted text (full page format, limited text formatting and no formatting) are also very helpful. Most important of all, though, is a good quality-checking workflow for your OCR documents (possibly including formatting) to avoid difficulties in the translation process and ensure that your work has a polished, professional appearance.

OCR software is another good tool for improving your visibility with clients and making your work processes easier in an age when many archiving and ERP systems are focused on the retention of PDF documents or TIFFs and even actively discourage saving original formats. The major providers of this software often have free, functional demonstration versions to use before making a purchase decision. Try several options and choose the best one for you. You won’t be sorry.

Aug 29, 2012

Replacing bitmap graphics in MS Office 2007/2010

A few weeks ago I published a series of posts about different aspects of embedded objects in DOCX, XLSX and PPTX files and how to translate them a bit more conveniently. When I described this procedure to a colleague this afternoon, he thought it was a solution for convenient substitution of bitmap graphics in these files as well. Well, sort of, but not quite.

If you rename the extension of a DOCX file to ZIP and open the ZIP file using Windows Explorer (in order not to mess up the compression), you will see a folder named word.

Inside this word folder are other folders of interest:

The embeddings folder has the objects such as Excel tables or PowerPoint slides described in previous posts. The bitmap graphics or pictures are in the media folder, however:

The view inside the media folder above shows one bitmap graphic (the JPEG file) and various other files with images of the embedded objects (an equation, and Excel table and a PowerPoint slide). Only the bitmap files are of real interest. If other graphic files localized for the target language are named the same as the original files in the media folder and substituted there, when the ZIP files is renamed to have its original extension, the substituted graphics will appear in the document the next time it is opened.

This way, for example, screen shots for an entire file can be substituted quickly. One could, of course do this by a number of other means, but this way is fairly convenient and could probably be automated without much ado if your organization needs to make such substitutions a lot.

Addendum: I was curious about those other files in the media folder - the ones with the views of the embedded objects. So I deleted them to see if they would re-generate when the document is opened. Instead, this message was displayed at the location of each object:

Double-clicking the "broken" object display opened the object and restored the view. So clearly, refreshing object views involves updating the content of the media folder.

Aug 4, 2012

Coping with embedded "BIN" objects in MS Office documents

When I published a procedure for getting at embedded objects in Microsoft Office documents, I mentioned that older documents in MS Office 2003 formats could be saved as Office 2007/2010 equivalents in order to access the embedded objects via Windows Explorer after renaming the extension to ZIP. What I failed to mention is that older format embedded objects are stored with a BIN extension, not the proper extension of the application with which they are associated. The icon above, for example, is for an embedded PowerPoint 2003 slide.

There are a few ways of dealing with this. If you know what the object should be, just re-name the extension to fit (PPT in this case). Or if you are importing to a CAT tool, specify the proper filter for the *.bin file. Here's an example for memoQ 6:

The number at the end of the file name before the extension indicates the order of the objects in the document, which may be helpful in identifying the new extension to use. If you want to put the translated objects back in the embeddings folder, remember to change the extensions of the older objects back to BIN.

Examining embedded objects in Microsoft Word

Recently I described a method for translating embedded objects in Microsoft Office documents. The final step in that method requires these objects to be refreshed by opening them manually or using a corresponding macro.

The macro below is intended for inspecting all the embedded objects in a Microsoft Word document. When run, it opens each of these objects if possible, regardless of type, and leaves the corresponding editing window open. This allows last minute changes to be made conveniently before the objects are saved and refreshed in the document view. A similar approach can be used for objects embedded in Excel or PowerPoint, though the references are a little different.

This macro could also be used to test quickly whether a large document has embedded objects to be dealt with. Sometimes it's hard to recognize these. It does not, however, open bitmaps inserted as pictures.

Sub openEmbeddedObjects()
   Dim longShapeCount As Long
   On Error Resume Next
   Application.ScreenUpdating = False
   longShapeCount = ActiveDocument.InlineShapes.Count
   If longShapeCount > 0 Then
      For i = 1 To longShapeCount
        ActiveDocument.InlineShapes(i).OLEFormat.Edit
      Next
   End If
   Application.ScreenUpdating = True
End Sub

Jul 27, 2012

Translating embedded objects in Microsoft Office documents

Yesterday a colleague sent me a note to say he had been searching my blog for information about translating compound Microsoft Office documents (that is documents with embedded objects) in memoQ and couldn't find any. I presume he was referring to the article about how often one CAT tool is not enough - combined workflows with other tools can frequently help solve many tricky translation problems, and DVX2 or STAR TRANSIT are definitely useful options for preparing compound Microsoft Office documents for translation in memoQ. Some time ago I recommended using STAR TRANSIT as a pre-processing tool to one of my agency friends, and he carried out a very large, complex project successfully using memoQ's excellent integration features for STAR TRANSIT projects.

There is, of course, another simple way to translate the embedded objects in a Microsoft Office document that does not involve purchasing other software licenses. I don't usually talk about it, because there are a few limitations, and until recently I had not figured out how to avoid corrupting the files when I tried to do things the "easy" way. This approach is not limited to memoQ and will actually work with most CAT tools - so SDL Trados Studio users can do this as well, for example.

It is useful to know that the Microsoft Office 2007/2010 file formats (DOCX, PPTX, XLSX) are really just ZIP files containing XML and a bunch of other stuff. That stuff includes a folder with the embedded objects in formats that can be dealt with directly.

If you have an older, binary MS Office document (DOC, PPT, XLS) with embedded objects, convert it to a 2007/2010 format.

If you rename the file extension DOCX, PPTX or XSLX to ZIP and unpack the ZIP file, inside the folder you will find a folder called "embeddings". The files in that folder can be copied elsewhere and usually handled directly in your CAT tool. But problems usually arise when you put them back, rezip the folder and change back to the original extension. The compression gets screwed up, and the Microsoft Office file is corrupted and won't open.

The only reliable method I have found for avoiding this is to use the Windows Explorer (under Windows 7) to open the ZIP file:

Here's what the "guts" of one DOCX file with a bunch of embedded Excel tables looks like:

Inside the word folder you'll find the embeddings folder:

The contents of the embeddings folder look like this:

Simply copy the embeddings folder somewhere safe, translate its contents, then copy them back to the ZIP file using Windows Explorer. Then rename the ZIP extension to the original extension for the file.

If you open the file and look at it, you'll get a shock. When you see all the objects in their original language, you might think something went wrong. Nothing bad has happened; you merely need to refresh the objects. This can be done by opening each briefly to edit or using a macro to open each object and close it again quickly. In a job with dozens of embedded objects in a long file, this macro is a helpful shortcut.

Given how easily accessible this embedded content actually is, one has to wonder why other major CAT tool providers like SDL and Kilgray have failed to offer the option of importing embedded content in their filters up to now. Let's hope they do soon. In the meantime, this workaround should enable many people to deal with this complex and irritating file format challenge.

Here's a summary of the procedure once again:

Rename the *.???x file to *.zip
Under Windows 7, right-click on the ZIP file and open it using the Windows Explorer. Using ZIP tools of any kind risks corruption by changing the compression ratios.
Find the embeddings folder inside the ZIP structure. Copy this elsewhere and use it as the source for translation. It will contain all the embedded objects as single files.
Copy the translated content back into the embeddings folder in the ZIP structure.
Rename the ZIP file to its original extension.
Open the file and refresh each embedded object (which will initially appear not to have been translated) by right-clicking and opening it from the context menu or running a macro to do that.

Search me!