Jul 14, 2014

My translated document won't export! A 3-step preventive solution

A file export failure in memoQ
How many times have I heard that? Experienced it myself in various CAT tools? No idea. Lots.

About three years ago I published an article on the "pseudotranslation" feature of memoQ, which had been introduced in version 5. This was a feature I had made good use of years before in Passolo (before SDL did its Packman thing with the good company) to determine whether all text in software to be localized was accessible to the translation environment. In that article and on many other occasions, I have discussed the idea of roundtrip testing files to be sure they can be translated and then transformed afterward back into the desired formats. Often I just refer to this as "roundtripping".

Roundtripping is very simple. Anyone can do it, and it generally takes about a minute, sometimes a bit more, often less. And it more or less guarantees that your plans to translate a file and get a technically usable result will succeed. Roundtripping can be done with any respectable CAT tool and possibly with some of the ones that aren't.

Here's how it goes:
  1. Import the document(s) you intend to translate into your translation environment (Wordfast Pro, OmegaT, SDL Trados Studio, memoQ, etc.).
  2. Copy the entire source text exactly - including all tag structures - to the fields for target text. (This is actually a pain in the ass in OmegaT currently because even its developers don't understand how that somewhat hidden, idiotically command-line based function works. But for everyone else pretty much it's a piece of cake.)
  3. Export the target text document (which of course is exactly the same as the source text) from the translation environment and ensure that it opens properly in its relevant application.
If there are problems in Step 3, then your source document is either corrupted (very likely) or the working environment screwed up the document on import. If the file type involved is one you work on regularly, you can be pretty sure that the problem lies with the original document and has nothing at all to do with your translation tool.

Corrupted documents occur with some frequency when PDFs are converted to editable formats such as RTF, DOC or DOCX, particularly by persons without a proper understanding of the best procedures for doing so. Even top-end tools like Omnipage or Abbyy Finereader sometimes create documents with hidden flaws in their file structure, which might open in a word processor but which go to Hell once imported in a CAT tool. Table structures used to be particularly vulnerable to corruption and probably still are.

So a smart outsourcer or translator roundtrips files before the actual translation starts to avoid last-minute panics and missed deadlines. It's fast, free insurance.

But what can you do if the corrupt file is all you have?
Sometimes nothing except go back to the client and ask for a new file. But I have also noticed rather often that corruption can be avoided by zipping file attachments to e-mail, and it seems that the corruption of unprotected files often occurs when these files are downloaded from the mail server. So if you can, try another copy of the attached file off your e-mail server.

In the case of Microsoft Office files (for Word, Excel and PowerPoint), re-saving the file in a different format causes its structure to be reworked by the application and often repaired. Sometimes that corrupt DOCX file can simply be re-saved as DOCX and all will be well with a roundtrip in your CAT tool, but if that doesn't help, saving the DOCX as RTF and then re-saving that RTF file once again as DOCX will effect the necessary "repairs" to the file and ensure that you can get a usable result.

Why not just translate that RTF file if it's OK? I prefer not to, because if there are tags present, these may be represented differently in some working environments (such as memoQ, which shows very different tagging in RTF and DOCX), and this messes up my matching a bit and obscures the tag function as well, forcing me to look at a printout or PDF too often to see what the markup is about. 

Get it right the first time
The files I work on often have unusual abbreviations which affect segmentation (and require me to update my rules), or I join and split segments while I work on a complex patent or legal pleading in order to make the work go better. This takes time. And if I discover at the end of a three-day job with 10,000+ words that my translation will not export to a target file, then I can look forward to a lot of extra time recreating my desired segmentation, especially if I was lazy and did not update the segmentation rules. While features like memoQ's "TM-driven segmentation" can overcome this somewhat, there are limits, and those limits are exceeded in cases where I might join 7 or more segments because the source language segmentation rules were seriously suboptimal.

So take a minute. Or two. And roundtrip those documents before you start translating or send the job out for someone else to do!

6 comments:

  1. >This is actually a pain in the ass in OmegaT

    Only if you need a TMX with the pseudo translation. If all you need is the translated document, there's actually nothing to do: just use Project > Create Translated Documents and you get the target document.

    ReplyDelete
    Replies
    1. I second this. I have rarely ever found any issue in getting the translated file.
      If you can see the text in OmegaT, you will get the translated file. It's as simple as that.

      Delete
    2. Perhaps not, Mulyadi. I'll take the recent file with trouble and try to put it through and see what happens. Unfortunately, there are always surprises with all of these tools, even in areas where experience can give one great confidence.

      Delete
  2. Great advice Kevin!

    I wrote a similar article with some advice on how to solve some common structural errors in Word files that prevent exporting the translated version. It is a bit SDL Studio-centric, but the part about how to handle the Word files before importing them should apply globally (and apologies for the self-plugging; really not my intention).

    I still find it hard to understand why this is not part of the project preparation process (what needs to be done manually, more often than not will not be done). I guess that infesting the tool with MpT plugins and other nonsense is more important than a very basic - yet extremely important - check that should be performed as part of the project creation stage rather get discovered at its end.

    ReplyDelete
    Replies
    1. Self-plug all you want, Shai - your content is always worthwhile, and if you don't, others will anyway :-) I agree that more integrity checks of documents are needed in the working environments. I think Tom Imhoff (in Hamburg, localix.biz) did something like this for the SDL Trados Studio integration he created for the SaaS project management solution OTM from LSP.net. I don't recall all the details, however, but I do seem to remember the solution architect talking about various validity checks.

      Delete
    2. In Studio if you have the Professional Version it's easy enough to add this so it's part of the Project Prep. But for the Freelance Edition you can't customise the project workflow using the UI. So you do need to run two preparation steps and use your memory a little!

      Delete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)