Experienced translators are well aware of the horrors of bad OCR, including documents that look like the original but which undergo disconcerting font changes with the use of the classic Trados macros to translate, or where text blocks disappear when embedded in wrong-sized boxes, section breaks disrupt text, words display in CAT tools with tags embedded in the middle of words, thus screwing up terminology lookups and TM matches and more. I hope there is a special place in Hell for those who think that usual rates should apply to such time-wasting messes.
There are many remedies for these problems, as solutions such as Dave Turner's CodeZapper macros or the memoQ import option to ignore irrelevant tags for Word documents, but there really is no good substitute for doing the conversion right the first time. This is a skill I have taught to colleagues and clients on a number of occasions, because it saves everyone time and money.
After I ended my chat and tutorial with C, I went to work on a new project due tomorrow. I had been putting it off while working on some tutorials for next week, but I still had time for dealing with it at a relaxed pace. Then I opened the document and realized that while I was distracted earlier this week, the project manager had sent me the horror of all horrible OCR jobs, an automatic conversion that violates every principle of good OCR practice. And it's Sunday. I'm screwed. No PDF to re-do the OCR my way.
Then I realized there are two possible solutions to address this problem which do not involve medium-range missiles.
I could, if the document had a lot of text boxes in bad sequence, print the file to PDF and do another OCR of that. But the text flow in this case is mostly in one block, so that's really not an issue. I followed another procedure which gave me a raw file much like I create when doing a complex OCR, which I then quickly reformatted to give me a usable source file that would not be polluted with tag trash and inconvenient text breaks. The steps were as follows:
- Save the bad OCR DOC file as plain text with the desired encoding.
- Open the plain text document in Microsoft Word or another full-featured text editor.
- Check the sequence of the text flow to be sure that it is correct and complete.
- Correct any "broken" sentences caused by line breaks in the wrong place in the OCR document. A bit of clever search and replace can usually be used to protect desired paragraph breaks before converting the unwanted ones into spaces to restore the messed-up sentences.
- Do any other formatting you want for page numbering, bold text, subtitle styles or whatever.
Cleaning up my 18 pages of garbage from the PM took a bit over half an hour altogether, including some corrections of OCR errors, and if I want to get a clean source document after my translation, I simply correct any source errors I find in memoQ as I work and export a fixed source document later.