Apr 16, 2012

Another approach to OCR for translation

OCR is often a touchy subject for translators. There is unfortunately too little expertise in this area, though the practice of converting scanned text for translation is now quite common. And recent developments in tools such as ABBYY FineReader have catered to the worst of the idiocy I have seen, starting processes automatically which are best executed manually with greater care.

Too many people rely on automatic settings for OCR conversion and save the result in a format (usually an MS Word or RTF file) which more or less preserves the look of the original. The result may look pretty to the ignorant eye, but when the translator begins work, a host of problems may arise. In CAT tools, there are usually innumerable superfluous tags (which sometimes even CodeZapper cannot clean up), even embedded in the middle of words, which prevents matching with TM and glossary entries. Kiss consistency and quality control features goodbye in such cases. In older versions of Trados, font changes and other format trashing are common. Disappearing text in ill-defined text boxes and column is often a problem, even for those who do not use translation environment tools.

For these reasons and others, I have long been an advocate of manual zone definition and (where necessary) the use of templates to achieve the best conversion results, and I usually save the results as naked text or, at most, preserving some font formatting (in which case further adjustments by mass selection are usually necessary to ensure that body text is, for example, consistently 10 point and not 9 point or 10.5 point in some spots due to image distortions).

If the result of a translation will be given to a graphic artist for subsequent layout, you do in fact perform a good deed by avoiding the "save with layout" options for your OCR text. A document with a straight text flow is much easier to import into the layout environment (such as InDesign). Of course, where such things are to be done, it is often best to try to get the content in that environment's format in the first place, though in such cases, certain clean-up (of hyphenation, kerning and column breaks, for example) is necessary to avoid problems and tag checks are essential after translation.

However, when you work with OCR texts, no matter how they are created for translation, some errors are almost inevitable. When the OCR engine has a spellchecker and "intelligent" correction features, some of these errors may even be plausible, just wrong, so beware! It is vital to have a copy of the original scanned document as a printout or perhaps on a second working screen for reference. I have followed this approach for years. But when I have a good conversion, I might translate happily for a number of pages before encountering a whiskey tango foxtrot moment in which I must consult the original to see what the text really says. This is a real problem for me in a multi-column patent with small type, and I often spent quite a bit of time looking in the scanned PDF for the relevant passage.

That was the case at least until one day, the light went on, and I realized that making a searchable PDF from the original scan could enable me to find the relevant text faster. If this searchable PDF is made from the same OCR process used to create the text to translate, then any errors will be the same, and by putting in the questionable text, you can go precisely to the right place in the document! My original post on this subject on another blog was primarily about making searchable PDFs for reference documents (to find terms and usage easily in documents not intended for translation), but I actually use this error-finding technique more often in my work lately.


  1. Hi Kevin & all,

    I usually do OCR the way you describe, with manual zone definition where needed and saving as plain text with embedded images to be reformatted as much as possible according to the original before loading into a CAT tool for translation.

    This can seem to be a somehow long and painful process, but it is worth the effort, as the result is most of the time completely fit to be sent to the customer as it is, after you export it from MemoQ.

    But my trouble and question to you about this is more on the accountings than the technical side :

    1) in pages containing quite little text to be translated, but more pictures or tables full of figures or references, how are we supposed to count and value our work, when preparing a quote ?
    Should we charge extra for OCR + DTP, besides the translation on a per word rate ? Or should we charge a flat fee per page, whatever its content ?

    2) the time spent for OCR + DTP is rarely valued by the customer, but is it by translators themselves ? Are we actually aware of the the time we spend with these (eventually unpaid) tasks / chores, and do we manage to get duly rewarded for them ?

    3) And if we are lucky enough to get paid for them, and in some (most) cases where we are short on time, should we do this OCR + DTP work ourselves or outsource it to colleagues with more spare time at the moment ? And if outsourcing, what would be a fair price per page to pay or to receive, for this type of work, given the time it takes to do it right ?

    Well, this topic has probably been tackled many times and in many other blogs, sorry if this is off-topic or repetition.

    Thanks for your insight on the matter,


  2. If your "multi-column patent with small type" is a published European patent, you can get an XML version from the European publication server:
    I guess it's the same as the file they did the publishing with, so the only OCR difficulties are those that the EPO did not spot.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)