Pages

Jul 27, 2012

Translating embedded objects in Microsoft Office documents

Yesterday a colleague sent me a note to say he had been searching my blog for information about translating compound Microsoft Office documents (that is documents with embedded objects) in memoQ and couldn't find any. I presume he was referring to the article about how often one CAT tool is not enough - combined workflows with other tools can frequently help solve many tricky translation problems, and DVX2 or STAR TRANSIT are definitely useful options for preparing compound Microsoft Office documents for translation in memoQ. Some time ago I recommended using STAR TRANSIT as a pre-processing tool to one of my agency friends, and he carried out a very large, complex project successfully using memoQ's excellent integration features for STAR TRANSIT projects.

There is, of course, another simple way to translate the embedded objects in a Microsoft Office document that does not involve purchasing other software licenses. I don't usually talk about it, because there are a few limitations, and until recently I had not figured out how to avoid corrupting the files when I tried to do things the "easy" way. This approach is not limited to memoQ and will actually work with most CAT tools - so SDL Trados Studio users can do this as well, for example.

It is useful to know that the Microsoft Office 2007/2010 file formats (DOCX, PPTX, XLSX) are really just ZIP files containing XML and a bunch of other stuff. That stuff includes a folder with the embedded objects in formats that can be dealt with directly.

If you have an older, binary MS Office document (DOC, PPT, XLS) with embedded objects, convert it to a 2007/2010 format.

If you rename the file extension DOCX, PPTX or XSLX to ZIP and unpack the ZIP file, inside the folder you will find a folder called "embeddings". The files in that folder can be copied elsewhere and usually handled directly in your CAT tool. But problems usually arise when you put them back, rezip the folder and change back to the original extension. The compression gets screwed up, and the Microsoft Office file is corrupted and won't open.

The only reliable method I have found for avoiding this is to use the Windows Explorer (under Windows 7) to open the ZIP file:



Here's what the "guts" of one DOCX file with a bunch of embedded Excel tables looks like:

Inside the word folder you'll find the embeddings folder:

The contents of the embeddings folder look like this:


Simply copy the embeddings folder somewhere safe, translate its contents, then copy them back to the ZIP file using Windows Explorer. Then rename the ZIP extension to the original extension for the file.

If you open the file and look at it, you'll get a shock. When you see all the objects in their original language, you might think something went wrong. Nothing bad has happened; you merely need to refresh the objects. This can be done by opening each briefly to edit or using a macro to open each object and close it again quickly. In a job with dozens of embedded objects in a long file, this macro is a helpful shortcut.

Given how easily accessible this embedded content actually is, one has to wonder why other major CAT tool providers like SDL and Kilgray have failed to offer the option of importing embedded content in their filters up to now. Let's hope they do soon. In the meantime, this workaround should enable many people to deal with this complex and irritating file format challenge.

Here's a summary of the procedure once again:
  1. Rename the *.???x file to *.zip 
  2. Under Windows 7, right-click on the ZIP file and open it using the Windows Explorer. Using ZIP tools of any kind risks corruption by changing the compression ratios. 
  3. Find the embeddings folder inside the ZIP structure. Copy this elsewhere and use it as the source for translation. It will contain all the embedded objects as single files. 
  4. Copy the translated content back into the embeddings folder in the ZIP structure.
  5. Rename the ZIP file to its original extension. 
  6. Open the file and refresh each embedded object (which will initially appear not to have been translated) by right-clicking and opening it from the context menu or running a macro to do that.


6 comments:

  1. Kevin, good morning,

    One important thing to remember is that for Excel files each embedding usually contains everything that was in a workbook at the time of pasting into Word. For instance, if you paste a chart into Word, the file inside the DOCX archive will contain the entire workbook that contained that chart. Special attention is needed to find the actual content to translate in the file they obtain this way. So sometimes it is easier to open the embedded object in Word, copy and paste into Excel, and translate the new Excel workbook.

    ReplyDelete
  2. @Stas: True... depending on how the Excel files got into the other document and what their content is, there may be quite a bit which is not of interest to the translation. But tools such as memoQ offer fairly straightforward ways of choosing the content subset in an individual file and excluding all else. So in a worst case one would do a quick search for the visible content, identify the cell ranges for what needs to be translated and specify these. Copying and pasting techniques often involve format trouble.
    Another thing I like about accessing the embeddings folder is that the objects are numbered sequentially as far as I can tell. Before working out this method, I opened and saved individual embedded objects, then re-inserted them later after translation, replacing the original objects. But once when doing a financial report with over 30 embedded tables in the Word document I overlooked a few tables (easy to do when the author intersperses embedded Excel spreadsheets with real Word tables) and my numbering scheme for the file names got a bit confused. This cost time and nerves when the company's CEO and I were phoning each other after midnight putting it all back together again. Trust me, this method here can save a lot of grief and keep a better overview when things get really complex.
    Of course having filters that will just import all the content in one go might be better, but then if the content for individual objects is not identifiable in Star Transit or DVX (I don't know if this is the case), how do you quickly identify and exclude the content you don't want to translate? Probably by resorting to this method :-)

    ReplyDelete
    Replies
    1. Hello, Kevin. Well, I guess there is no need for any special filter to exclude non-translateable embedded content. The Office Open XML specification must specify what embedded content is actually visible in the document. Of course, this leaves a whole number of other issues like cell references to be taken care of, but it should be possible for translation environment tool vendors to implement this. I think support for embedded content will be a major milestone, like the recent SDL support for Word Revisions feature in exported documents. So we'll have to wait at least one major version, i.e. a couple of years :-), at least with the desktop environments. Maybe the online translation environments will be quicker this time...

      Delete
  3. Thank you Kevin. I'll be referring to this often - at least until that happy day when the tool providers get it together to include this very basic and necessary function as a standard filter! Hope does spring eternal... :-)

    ReplyDelete
  4. I just wanted to say that I still refer to this post and find it fantastically useful--particularly the advice to refresh the embedded files in order to see them in the translated language, since that is the step I'm most likely to forget. So glad you posted this. Thank you!

    ReplyDelete
  5. I just found this post and it saved my day. Thank you.

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)