Feb 23, 2014

Cleaning up a crappy OCR job for translation

It's a sad fact in the professional work of translators that a lack of understanding on how to deal effectively with various PDF formats causes enormous loss of productivity and results which are not really fit for purpose. The aggressive insistence of many colleagues possessed of a dangerous Halbwissen on using half-baked methods and inappropriate tools contributes to the problem, but, bowing to the wisdom about arguing with fools, I now mostly sit back with a bemused and amused smile and watch the tribulations of those who believe in salvation by PDF import filters and cheap or free OCR. "TANSTAAFL" is a true as it ever was.

Just before the weekend I got an inquiry from an agency client I rather like. Nice people, good attitude, but struggling sometimes trying to find their way with technology despite some in-country "expert" training. This inquiry looked a bit like ripe fish at first glance. The smell got stronger after I was told that because the corporate end client had converted the PDF for their annual report and begun to edit the mess (and comment it heavily too) in the OCR file that this would be all there was to work with. It was a thoroughly appetizing sight when imported into a translation environment:

There are so many issues in that tossed salad of translation terror that I don't even know where to start describing them.

The screenshot above was in memoQ. How does it look in SDL Trados Studio? Often just as messy. In this case, this was the result in an older version of Studio:

SDL Trados Studio choked and refused to import the file!

I do have the latest version of SDL Trados Studio 2014, but unfortunately it's on a system that does not yet Microsoft Office, because I refuse to bow to Microsoft's insistence that I must buy a Portuguese version of that software. No MS Office, no file import in this case with SDL Trados Studio. memoQ fortunately has not needed MS Office to import its old file formats since the release of memoQ 6.0.

Ugly OCR trash like this file is all too common at this time of year, and as I am busy compiling the syllabus for the workshop I want to do on better living with well-used technology for legal and financial translators, I felt obliged to take this one on as a teaching example. It's actually not as bad as it looks. On the other hand, the best approach may not always be obvious, and the best solution for one document may not apply as well or at all to another.

My first approach was to use Dave Turner's CodeZapper macros. This isn't as straightforward as it used to be since I downgraded from Microsoft Office 2003 to later versions; for some reason the toolbar refuses to stay loaded between work sessions, and there's no way I can keep track of all the abbreviations for macros on it.

I can't deal with anything more complicated than clicking the "CZL" option for "Code Zapper lite", which did a rather decent job on the heavy mess above:

But all was not quite as well as it seemed:

Text in the header and footer remained trashed, and the heavy use of comments and tabbed lists meant that there were plenty of legitimate tags to deal with which were just too confusing with the DVX-like mess of memoQ's default import and display for an RTF file.

So I went for a kinder, gentler approach. I changed my import filter settings in memoQ:

There is actually seldom any good reason to import an RTF or DOC file into memoQ using the default filter settings. And marking those two little checkboxes at the bottom often accomplishes much of what CodeZapper does. Sometimes less. A bit more in this case.

The header and footer texts were absolutely clean. Don't let the extra tags in this sample fool you: overall, there were fewer than in the code-zapped file. Now there are still a number of issues to be seen in the screenshot above, including paragraph breaks in the middle of a sentence and awful manual hyphenation (many instances of that in the whole text) and joys like badly placed comments and links which mess up the text and prevent term identification by the software:

Source editing features of memoQ (F2) enable issues like the two above to be dealt with easily:

After a bit of repair like this in the memoQ environment (where it is really much, much easier to fix the problems of bad comment and link placement), I copied the entire source text to the target to enable me to export a cleaner source text file. I then opened this file in Microsoft Word and used various search and replace operations to fix the bad hyphenation and other problems like excess spaces. Replacing the hyphens had to be done occurrence-by-occurrence, because the style of writing in German meant that there were many legitimate instances of hyphens followed by spaces.

After all was done, the "before and after" looked like this:



The remaining tags were all legitimate formatting tags for comments, hyperlinks, tabs after section numbering, etc. These do, of course, require attention and add complexity to the work still, so they must be included in the charges for the job. memoQ makes this calculation particularly simple by allowing weighting factors to be specified in the analysis. These are the settings I typically use for a German source text:

I find this usually represents a fair minimum for the additional effort in translation and quality assurance that tags require. In this case, of course, time charges for the cleanup apply, but as you can probably guess from comparing the two analysis tables above, the customer is actually saving a lot of money by paying me to clean up the mess, and the results will be a lot more usable. My cleaned-up version of the source text will also be returned in case the authors intend to make more revisions in the source - this will save more time and money by avoiding redundant cleanup in that case.


  1. Wow, you definitely deserve a good pat on the back for this.

    Karolina Karczmarek-Giel
    Office Assistant

  2. Wow, what a nightmare. I hope you charged triple for this. Thanks for describing your process, though; I learned a lot!

  3. Well, there is actually a better solution for all this, one which is usually an option, but which most translators or agency PMs are too afraid to try. Insist on the original file. It works like a charm.

    Today there was an "update" to this job. Or so it seemed. The agency called up in a panic, a new PDF had been received from the client and they wanted me to compare and make changes. Life's too short for such BS. I basically pointed out that this was going to be very expensive and that really, for all concerned, the best thing to do now was to ask the client nicely for the original InDesign file (which I think actually was not in existence before but it clearly was now). Ten minutes later I had it, and the text was the cleanest and most trouble-free of any process yet. Everyone saves time and money. Except the translator who now makes more money in the time spent. We all win.

    So honestly... all these damned converters and techniques are fine things in a real "emergency", but we do everyone involved a great service if we just dig our heels in and insist that original format files be produced wherever possible. This has been the best and most cost-effective solution for more projects than I can count, but in those cases where I am dealing with an agency, it takes longer to convince a PM to ask for the file than it takes for the customer to send it. Such timidity is in nobody's interest in cases like this.

    1. Oh yes... that update? It wasn't. Same text, just a different file format. But with the previous format chaos those involved could not determine this. One more reason to stick to original formats where possible. I suppose I could use the word "Medienbruch" to describe this problem too.

  4. Hi Kevin,
    You did a great cleaning job here.
    Re: MS Office language versions. I had the same issue a few weeks ago when I bought Home&Business 2013. I decided not to get the 365 version but the one-off purchase to download on my machine (although it's still click-to-run, not the full MSI installer). MS kept taking me to a page where I could only download the Spanish version, and after much coming and going I managed to reach an English version here:
    It's now running happily in English on my Win 8.1 OS (which originally came in Spanish and I was able to change to English by adding a language pack, fairly painlessly).

  5. very interesting ...
    well, a couple of years ago I found my 1st final customer, so I think I can add some insight to the matter

    I received an enormous and quite undigestible PDF, as usual, but considering that it was a "final customer" I tried to have the InDesign or PageMaker file instead

    then I tested it finding it very digestible and easy to work, then I did a mock automatic translation to leave them consider the result, adding that using a more digestible file would have been translated in a lower expenditure for them

    well, can you believe it?

    they refuse this method because:
    the graphic departement was accustomed to manage PDFs, even if it meant a lot of troublesome conversions
    the IT departement, idem
    the the big boss, was accustomed to read PDFs

    the moral of the story is that nothing beats the psychological sclerosis!

    that is BTW a problem of mine too, considering how much time was needed to switch to memoQ from Trados, or to Office 2013 from Office XP ...

    1. I believe anything of it involves stupidity :-) Actually, if they like PDFs, you can have those generated free on Kilgray's Language terminal server. I did several InDesign INDD translations this week (the integration with that new tab in memoQ is very nice), and each time I created target files, I logged in to the web interface of Language Terminal and retrieved the full ZIP package (instead of just the IDML available off the tab in the Translations window). It contains a PDF for proofreading purposes.

      So if these idiots want to make extra work for themselves, they can send you an InDesign file and you send back just a PDF and let them do an OCR and waste lots of time correcting and importing the text so they feel like they have lots of "work". And some day they'll probably be fired when the company realizes the waste.

  6. Last night I received the Mother of All Disastrous PDFs from someone - the most complicated survey form I have seen in years, with text boxes split between pages, even through the middle of words. Good luck with OCR for that: it is literally impossible with any techniques I know. The PDF page order does not even display the split boxes in a double page spread so I might use something like a screenshot OCR utility. This insane mess was created in InDesign. Here once again, the translation will be simple if the INDD file can be obtained.

    We can use techniques like I describe in this post to clean up a lot of messes. But it is much, much better not to step in these cow patties in the first place and to communicate with our clients about the formats with which we can do our best work. Whether we want to admit it or not, eery bit of energy that goes into dealing with messed-up formats somehow subtracts from the energy remaining for quality translation.

  7. Great article! PDFs are probably among the worst when it comes to transferring information that needs to be worked on.

    As far as Office goes, on can download an ISO file of the relevant edition (i.e. Home&Business, Professional, etc.) and in the architecture (i.e. 32 or 64-bit) and language of one's choice from http://www.heidoc.net/joomla/technology-science/microsoft/73-office-2013-direct-download-links. The license should work just the same. MS is bluntly and rather crudely attempting to drive people to choose their subscription based plan, and the end justified the means.

    Another set or Macros similar to CodeZapper can be found at http://www.translatortools.net/, but nothing beats having the original file format.

  8. Hi Kevin,

    Yes, the disappearing CodeZapper toolbar in Word 2013 is really annoying. If anyone reading this has solved this, please post your solution here in the comments! I was wondering if it might be possible to create a macro that would automate the steps to re-attach it…


  9. I think I figured it out. Just put ‘CodeZapper 2_9_4.dot’ in this folder: ‘C:\Users\usr\AppData\Roaming\Microsoft\Word\STARTUP’

    Word now starts with CodeZapper loaded!

  10. Great article! PDFs are probably among the worst when it comes to transferring information that needs to be worked on.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)