This article originally appeared on an online translators portal four years ago and was long overdue for removal there. Here is an update.
Optical character recognition (OCR) software is discussed often online and at translators' events, usually in the context of how to deal with PDF files. Hector Calabia, Peter Linton and others have made a useful technical contributions on this subject in articles and forums and at various conferences. However, it is useful to consider OCR software in a broader translation business context. Document conversion is often very useful for translation purposes and greatly facilitates automated quality checks of the draft, for example, but OCR can also generate additional income for your business and reduce quotation risk.
OCR for translation
There are a number of programs available for this purpose, and which one is best for your purposes may depend on the language combinations you deal with and other factors. For years now I have used Abbyy FineReader, because years ago it gave the best test results for the particular set of European languages one of our clients offered. It is also relatively inexpensive (I paid about 100 euros for FineReader 11) and easy to use.
Many OCR conversions of TIFF, JPEG and PDF documents which I receive from agencies are difficult to use for translation purposes and require significant modification - if they can be used at all. Particularly in cases where TM tools are to be used or target texts differ significantly in length (especially when they are longer) there may be problems. The best ways to avoid these problems are
- avoid automatic settings for OCR conversions; use zone definitions
- avoid saving the converted texts with full formatting in most cases
- use a suitable post-OCR workflow to clean up the converted document by joining broken sentences, removing superfluous characters, fixing conversion errors, etc.
If the idea of doing individual zone definitions on each page of a 100 page document is intimidating, take heart. Programs such as Abbyy FineReader often allow you to define layout templates, speeding up the work considerably. One translator I know became so skilled at the use of these OCR templates and was so good with his conversions that agencies hire him just to do high-quality OCR work for them. Which brings me to….
OCR as an income-generating activity for the translator or agency
Hardcopy, scanned documents, faxes and PDF documents generally require more work for translators than electronically editable documents and require different, sometimes more fallible quality control measures than a typical workflow for a translator using original electronic documents in a translation memory system. If no conversion is performed, it is more time-consuming to check terminology or use concordances during the translation, and it is also unfortunately too easy for eyes to skip over bits of text. Under time pressure this can lead to very serious problems. Even with conversion, the OCR text requires careful checking against the original document to identify and correct any errors introduced (and there will be some at times with even the best OCR software). So it is not at all unreasonable for a translator to charge a higher rate for dealing with hardcopy, scanned documents, faxes and PDF documents.
There are a number of ways to incorporate these higher charges into your business model. The two obvious ways are a premium (surcharged) word/line/page rate and hourly service charges. I usually offer both options to my clients, with the word/line rate surcharge representing the “fixed” rate and the hourly rate the “flexible” rate where I make an non-binding estimate and they may end up paying more or less according to the actual effort. For pure OCR conversion jobs where I am not doing the translating, I charge a typical proofreading rate or a bit more, because I go through the entire document and see that it is correctly formatted for translation work and that obvious errors are fixed (i.e. basic spellcheck, etc.).
Sometimes I hear that “the client doesn’t want to pay for that”. Well, that’s OK, too. The client has the option of doing the work and doing it right and saving me the effort. The recognition that there is additional effort involved and that this effort should be compensated is important. But usually there is a way to sugar-coat the "bitter" cost pill, and this is where your marketing savvy comes into play. Some win-win arguments you might present include:
- the availability of an editable source text the client can use for future versions;
- the ability to create TM resources using the OCR text (which can save time/money later);
- potentially better quality assurance, especially with tight deadlines.
Returning a clean, nicely formatted OCR of the source document is often good "advertising". End clients may appreciate how this saves time and allows them to use the original text in a variety of ways (attorneys may like to quote arguments from the opposing side, and copy/paste beats retyping). Discriminating agencies may recognize your skill at creating documents that don’t go crazy when edited (because of screwy text boxes, bad font definitions and other format errors) and offer you more work. If your language pair is in low demand or is very competitive, this may be one more way of distinguishing yourself from the pack.
I got started doing OCR work and charging for it after suffering through the conversion of several long PDF documents by more manual methods. I finally wised up, bought FineReader and started to use it with most of the hardcopy, scanned documents, faxes and PDF documents I received simply because it enabled me to use my TM tools and do better quality checks. I started sending the cleaner-looking source texts converted with OCR along with the target text translations, and soon I started getting requests for paid OCR work. A number of my agency clients then began to buy OCR tols and use them with varying degrees of success. Even if they do all the conversion work, I still win if they do it right, because I save time for what I enjoy more – the translation.
OCR as tool for quotation
Some people I know still haven’t learned to do a high-quality OCR (or
they don’t care to), but they still use the software effectively in a
very important area of their business: quotation and risk limitation.
There are lots of good tools out there for text counting, which is important to many methods of costing and time planning in the translation business. Some people even still do it manually, which, though time consuming, is not a bad way of checking the numbers from an electronic estimate. A number of factors can result in text counts being too low – embedded objects, such Excel tables or PowerPoint slides in a Microsoft Word documents
, or graphics with text - or even too high (as is the case with at least one CAT tool counting RTF and MS Word files). Keep using whichever method you prefer - I won't try to persuade you that any one approach is best. I use a number of methods myself.
When translating larger documents, however, or documents with a complex structure, it is often useful to have a “sanity check” for your text counts. On a number of occasions I have received translation jobs from agency clients where the text count was given a X words, where in fact there were quite a few more words embedded in Excel objects, bitmap graphics, Visio charts, etc. which had not been measured by the method used. In a few cases these clients had to take a loss on the job after giving a fixed price bid to the end client. Using OCR to check your estimates can prevent such an unfortunate scenario.
To do this, print the document (whatever it is) to a PDF file. Then run the PDF file through an OCR program with automatic settings (to save time – you don’t need to translate this OCR). Save the text and count it. There will probably be a bit more text due to headers or footers or perhaps garbage from graphics, but the results should be close to your other estimate. (You can always subtract an appropriate factor for the text count in headers and footers to improve your OCR estimate.) If there is a major deviation, this is a clear sign that you should take a much closer look at the document(s) before quoting the job.
Searchable scanned documents
Another use I have found for OCR in recent years is creating searchable "text-on-image" documents
from scanned PDFs, TIFF files and other bitmap formats. Although I have used these searchable PDFs mostly for reference while I work (searching for bits of text while viewing the original, unadulterated context) and supplied them to clients on only a few occasions, the potential for an additional value-added service is fairly obvious in this case.
OCR software is an essential tool for the work of many translators today, even more so than CAT software in many cases. Not just a tool for recovering “lost” electronic documents or making legacy typed material more accessible for translation work, it also offers possibilities for generating additional projects and income, differentiating one’s services and reducing risks when quoting large jobs. Key features of whatever OCR you choose should include the ability to select text areas for conversion and to determine their sequence in the converted text (using user-defined zones). Various options for saving the converted text (full page format, limited text formatting and no formatting) are also very helpful. Most important of all, though, is a good quality-checking workflow for your OCR documents (possibly including formatting) to avoid difficulties in the translation process and ensure that your work has a polished, professional appearance.
OCR software is another good tool for improving your visibility with clients and making your work processes easier in an age when many archiving and ERP systems are focused on the retention of PDF documents or TIFFs and even actively discourage saving original formats. The major providers of this software often have free, functional demonstration versions to use before making a purchase decision. Try several options and choose the best one for you. You won’t be sorry.