Translation Tribulations: fixing MS Office documents

Jul 14, 2014

My translated document won't export! A 3-step preventive solution

A file export failure in memoQ

How many times have I heard that? Experienced it myself in various CAT tools? No idea. Lots.

About three years ago I published an article on the "pseudotranslation" feature of memoQ, which had been introduced in version 5. This was a feature I had made good use of years before in Passolo (before SDL did its Packman thing with the good company) to determine whether all text in software to be localized was accessible to the translation environment. In that article and on many other occasions, I have discussed the idea of roundtrip testing files to be sure they can be translated and then transformed afterward back into the desired formats. Often I just refer to this as "roundtripping".

Roundtripping is very simple. Anyone can do it, and it generally takes about a minute, sometimes a bit more, often less. And it more or less guarantees that your plans to translate a file and get a technically usable result will succeed. Roundtripping can be done with any respectable CAT tool and possibly with some of the ones that aren't.

Here's how it goes:

Import the document(s) you intend to translate into your translation environment (Wordfast Pro, OmegaT, SDL Trados Studio, memoQ, etc.).
Copy the entire source text exactly - including all tag structures - to the fields for target text. (This is actually a pain in the ass in OmegaT currently because even its developers don't understand how that somewhat hidden, idiotically command-line based function works. But for everyone else pretty much it's a piece of cake.)
Export the target text document (which of course is exactly the same as the source text) from the translation environment and ensure that it opens properly in its relevant application.

If there are problems in Step 3, then your source document is either corrupted (very likely) or the working environment screwed up the document on import. If the file type involved is one you work on regularly, you can be pretty sure that the problem lies with the original document and has nothing at all to do with your translation tool.

Corrupted documents occur with some frequency when PDFs are converted to editable formats such as RTF, DOC or DOCX, particularly by persons without a proper understanding of the best procedures for doing so. Even top-end tools like Omnipage or Abbyy Finereader sometimes create documents with hidden flaws in their file structure, which might open in a word processor but which go to Hell once imported in a CAT tool. Table structures used to be particularly vulnerable to corruption and probably still are.

So a smart outsourcer or translator roundtrips files before the actual translation starts to avoid last-minute panics and missed deadlines. It's fast, free insurance.

But what can you do if the corrupt file is all you have?

Sometimes nothing except go back to the client and ask for a new file. But I have also noticed rather often that corruption can be avoided by zipping file attachments to e-mail, and it seems that the corruption of unprotected files often occurs when these files are downloaded from the mail server. So if you can, try another copy of the attached file off your e-mail server.

In the case of Microsoft Office files (for Word, Excel and PowerPoint), re-saving the file in a different format causes its structure to be reworked by the application and often repaired. Sometimes that corrupt DOCX file can simply be re-saved as DOCX and all will be well with a roundtrip in your CAT tool, but if that doesn't help, saving the DOCX as RTF and then re-saving that RTF file once again as DOCX will effect the necessary "repairs" to the file and ensure that you can get a usable result.

Why not just translate that RTF file if it's OK? I prefer not to, because if there are tags present, these may be represented differently in some working environments (such as memoQ, which shows very different tagging in RTF and DOCX), and this messes up my matching a bit and obscures the tag function as well, forcing me to look at a printout or PDF too often to see what the markup is about.

Get it right the first time

The files I work on often have unusual abbreviations which affect segmentation (and require me to update my rules), or I join and split segments while I work on a complex patent or legal pleading in order to make the work go better. This takes time. And if I discover at the end of a three-day job with 10,000+ words that my translation will not export to a target file, then I can look forward to a lot of extra time recreating my desired segmentation, especially if I was lazy and did not update the segmentation rules. While features like memoQ's "TM-driven segmentation" can overcome this somewhat, there are limits, and those limits are exceeded in cases where I might join 7 or more segments because the source language segmentation rules were seriously suboptimal.

So take a minute. Or two. And roundtrip those documents before you start translating or send the job out for someone else to do!

Search me!

Jul 14, 2014

My translated document won't export! A 3-step preventive solution