Aug 28, 2011

Using OCR to support translation processes

I recently noticed an element on the "wall" on my business Facebook page that I hadn't paid attention to before: questions. Since I've resumed research for a tutorial project I shelved several years ago when great uncertainty in the world of translation tools made my plans impractical, I thought I might try this feature and see what sort of feedback results. The question I posed was
What sorts of project challenges do you wish you could handle with your translation environment tools (Trados, DVX, memoQ, OmegaT, whatever) that you cannot today for technical reasons or for lack of adequate explanations and examples?
It's only been up for a short time and the number of responses so far is modest, but some are quite interesting and may be revisited as blog posts or in other ways.

Some of the points raised so far involve the eternal topic of OCR for translation business. When my client Sansalone Technische Übersetzungen in Cologne first introduced me to the effective use of OCR for translation purposes many years ago, it was relatively rare in the world of commercial translation and too often incompetently performed, but I had been using that technology in one way or another for a decade already. But in many cases, "standard" procedures for optical character recognition are simply not well suited for our purposes, so I had a lot to learn.

Now some 9 years later, many of the LSPs and colleagues I know use OCR in some way, but unfortunately few do so efficiently or even usefully. Though sometimes there are TIFFs or JPEGs to be converted, most often these days the documents to be converted are obtained as some form of PDF. And there is enormous confusion and misrepresentation among translators as to what the various PDF formats are and how to deal with them.

I distinguish between two types of use for optical character recognition in my work: (1) quotation or count verification and (2) translation preparation. The former allows for considerable sloppiness, the latter for very little.

One of my LSP clients whom I introduced to OCR technology years ago uses it for one thing only: estimating text counts for quotation. Depending on the source, this may be very accurate or only a rough count (if there are serious contrast problems that can't be compensated, for example). I do this not only with PDFs and bitmap files such as JPEG or TIFF, but also with large, complex documents in other formats. For example, f someone sends me a 200 page document in an MS Word format and I need to prepare a cost estimate for services quickly, I will often print it to PDF, then run the PDF through an OCR engine and save the results as plain text for counting. One cannot always rely on the counts from Microsoft Word itself or various translation tools for text counting. Embedded objects, even editable ones, are generally not included in the counts. I have seen RTF documents where have the tables are ordinary RTF tables (which are counted) and others which look the same are spreadsheet objects (not counted unless you are a Star Transit user AFAIK). And PowerPoint files can have embedded Excel or Visio objects or other uncountable elements. Printing to PDF and doing an OCR subsequently avoids this problem and enables one to ensure that nothing big was missed when counting by other means. This procedure also serves as an "early warning" if there is text to translate that is not extracted by the translation tools. It really sucks to "finish" a long job only to discover that several thousand words in tables and diagrams remain untranslated.

Using OCR to prepare translations is often straightforward, but there are a number of traps that people commonly fall into. Do not, under any circumstances, be seduced by the automatic conversion settings of any commercial OCR program nor by options to save with the "original formatting". This is nearly always a disaster when working with translation tools. Problems may include bizarre text changes, disappearing chunks of text due to text box sizing problems, a plague of tags and more. About six years ago I wrote some guidelines on the ways to save and work with OCR text; these are a bit dated, but generally valid. I would also suggest getting and learning to use Dave Turner's Code Zapper macros for MS Word; these can not only clean a lot of garbage out of troublesome OCR texts but also from MS Word or RTF documents that suffer from the tag plague for other reasons.

It is usually best to go through documents to be converted and manually set OCR zones and their properties (such as text, picture, orientation, inversion, etc.). The extra time spent doing this will be saved later when translating or making final edits to your work. And even with long documents (or especially with long documents), it is better to save the converted text with no formatting and then reapply formatting using defined styles. This approach usually also produces text that can be transferred to DTP programs such as InDesign with less trouble. Ignoring this particular bit of advice can lead to a lot of wasted time and grief.

Before you begin translating, it is also important to scan through the converted text and look for superfluous line breaks, excessive spaces and other formatting problems and fix these so that your translation segments will be as clean as possible. Such preparation at this stage is a very good investment of time. It may also be useful to run a source language spell check to catch errors in the text conversion.

Another reason I like to invest a bit of extra effort in cleaning up an OCR source text is that it is often part of the delivery to a client who may have lost (or never had but may need) an original, editable document. This can be part of your presentation as a professional, a little service to set you apart from the competition. In some cases I have seen persons skilled in OCR offer their conversion services to busy translators or LSPs; I don't think the need for this type of service, if done right, has decreased in real terms, though I can't say what the actual demand is these days. I used to get a lot of requests for this service, though I discouraged inquiries for languages that I don't know and eventually steered most of those asking toward developing their own capabilities.

What are your experiences with OCR? Any tips for best practice with your favorite tools?


  1. Kevin, did you try Infix PDF Editor?

    I sometimes find it useful when I need to translate an editable PDF (created in some DTP software), for which there's no source files available and where the target file must look as close to the original as can be.

    This SW helps to keep the original formatting that cannot be otherwise recreated by means of WORD (tables, forms, captions, inclined text).

    It allows to edit PDFs, it also allows to export XML and import the translated one. However, one needs to fiddle with the translated file to make it look well.


  2. Yes, I did, Mykhailo - two years ago. The review is here. The program is far too limited for most purposes and will only work with electronically generated PDFs in any case. It is useful for prepress touchups or perhaps small posters but not much else. I haven't tried the version that extracts content as XML for potential use with CAT tools, but I suspect it will not be without its issues for many layouts with sentences breaking across pages, target language expansion, etc. And of course it will be utterly useless for the many bitmap PDFs in circulation. You need OCR software for that. Moreover, could you really imagine trying to translate a 200 page PDF manual with that tool? If the XML export feature doesn't work well (or you have the cheaper version that doesn't offer it), it would be a nightmare of overwriting. I do, however, find it useful for editing and extracting some diagrams.

  3. Kevin, I represent a rather small LSP and we have an external service provider to convert our pdf files and other non-editable texts into Word files. And actually, this person is working more than full time, the demand is very high. What drives me crazy is that often the customer is capable of providing the source, such as an InDesign file, that we could easily process using CAT tools but they will not pick up the phone to call their colleagues or whoever has access to the source file and will claim that the files are not within their reach instead. There is a considerable need to educate customers in this area, which would make life easier both for them and us - translators. Vaclav

  4. @Vaclav: I doubt there is anyone left in this profession who has not encountered the problem you describe. In my case, "education" starts with a heavy surcharge and polite hints about how costs might have been saved had the original files been available. In many cases, these "unavailable" files then materialize magically.

  5. Dear all,

    I tried this theory a couple of times (heavy surcharge) but I usually got a response "no need to layout the final document, just plain text".

    And if you charge your clients full wordcount, the do not care that you want this text to be processed in a CAT tool. Hey, most of them even do not know about the existence of CAT tools and the fact that a PDF is not the same as Word :).

    Kind regards,

  6. @Sebastijan: Not having been party to your exchanges with clients, I don't know how you have presented your proposal. I seldom find any resistance to PDF surcharges. But I don't talk about CAT tools and the like to any great extent. I simply say something like "it'll cost X in this format; if you can provide me with the original I can offer you a rate of X - Y%". I don't care what they understand or don't understand about PDF. I am doing the work; if I say that PDF involves significantly more time and effort even without formatting, then they can believe me and accept the proposal and its options or go away and don't bother me.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)