Sep 25, 2011

Fixing a screwed-up OCR job

I was having a pleasant day doing remote tutoring of a friend who is a new memoQ user and talking about various file type issues and approaches to quotation. She mentioned a mutual acquaintance, Helen, who became so disgusted with the incompetent attempts of agencies and some direct clients to convert PDF documents and foist them off on her as "Word documents ready to translate", which the results are, as anyone familiar with such things understands, generally not. Helen has written into her general terms and conditions of business a special clause stipulating a surcharge for both PDFs and any OCR documents. In my opinion, the surcharge for bad OCR should probably be double that for PDF, which I could use as the basis to do a proper OCR conversion.

Experienced translators are well aware of the horrors of bad OCR, including documents that look like the original but which undergo disconcerting font changes with the use of the classic Trados macros to translate, or where text blocks disappear when embedded in wrong-sized boxes, section breaks disrupt text, words display in CAT tools with tags embedded in the middle of words, thus screwing up terminology lookups and TM matches and more. I hope there is a special place in Hell for those who think that usual rates should apply to such time-wasting messes.

There are many remedies for these problems, as solutions such as Dave Turner's CodeZapper macros or the memoQ import option to ignore irrelevant tags for Word documents, but there really is no good substitute for doing the conversion right the first time. This is a skill I have taught to colleagues and clients on a number of occasions, because it saves everyone time and money.

After I ended my chat and tutorial with C, I went to work on a new project due tomorrow. I had been putting it off while working on some tutorials for next week, but I still had time for dealing with it at a relaxed pace. Then I opened the document and realized that while I was distracted earlier this week, the project manager had sent me the horror of all horrible OCR jobs, an automatic conversion that violates every principle of good OCR practice. And it's Sunday. I'm screwed. No PDF to re-do the OCR my way.

Then I realized there are two possible solutions to address this problem which do not involve medium-range missiles.

I could, if the document had a lot of text boxes in bad sequence, print the file to PDF and do another OCR of that. But the text flow in this case is mostly in one block, so that's really not an issue. I followed another procedure which gave me a raw file much like I create when doing a complex OCR, which I then quickly reformatted to give me a usable source file that would not be polluted with tag trash and inconvenient text breaks. The steps were as follows:
  1. Save the bad OCR DOC file as plain text with the desired encoding.
  2. Open the plain text document in Microsoft Word or another full-featured text editor.
  3. Check the sequence of the text flow to be sure that it is correct and complete.
  4. Correct any "broken" sentences caused by line breaks in the wrong place in the OCR document. A bit of clever search and replace can usually be used to protect desired paragraph breaks before converting the unwanted ones into spaces to restore the messed-up sentences.
  5. Do any other formatting you want for page numbering, bold text, subtitle styles or whatever. 
The resulting file can be saved as RTF, DOC, DOCX or whatever you need and then be used without trouble in your translation environment tool of choice. This will save a lot of time compared to what you will waste dealing with the poorly converted OCR document. But, as part of the customer service philosophy that encourages partners to work together in the most efficient way possible and respect the time and efforts of the other party, appropriate surcharges for the time do apply.

Cleaning up my 18 pages of garbage from the PM took a bit over half an hour altogether, including some corrections of OCR errors, and if I want to get a clean source document after my translation, I simply correct any source errors I find in memoQ as I work and export a fixed source document later.


  1. Excellent advice, very applicable also to those document that never went near OCR, but whose authors use a word processor as a mechanical typewriter (you know, a mix of tabs, spaces and hard returns instead of tables, a series of hard return to force a page break, and similar).

  2. Hi Kevin, your advice is sound, but work and time consuming. I always prefer to do the OCR anew - faster and in difference to whomever has done the first one, I KNOW what I'm doing and how to avoid various glitches (or how to correct them). Of course, you also should have the original, but my clients together with OCR file usually provides it without additional request.


  3. Kevin, have you by any chance seen a tool called "Infix Editor" by Iceni? The same company who make "Gemini", one of the several viable OCR options on the market. Infix is here: Ignoring the marketingspeak, Infix can be a cool addendum to your zoo of PDF tools **if** the PDFs you get are relatively "clean", clean meaning: they are not password-protected, they use commonly available fonts like Arial, the paragraphs/pages are not full and leave room for text expansion. If everything works as planned, roundtripping a PDF file through a translation process is actually easy, any ou can use any XML-educated CAT tool on it.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)