What sorts of project challenges do you wish you could handle with your translation environment tools (Trados, DVX, memoQ, OmegaT, whatever) that you cannot today for technical reasons or for lack of adequate explanations and examples?It's only been up for a short time and the number of responses so far is modest, but some are quite interesting and may be revisited as blog posts or in other ways.
Some of the points raised so far involve the eternal topic of OCR for translation business. When my client Sansalone Technische Übersetzungen in Cologne first introduced me to the effective use of OCR for translation purposes many years ago, it was relatively rare in the world of commercial translation and too often incompetently performed, but I had been using that technology in one way or another for a decade already. But in many cases, "standard" procedures for optical character recognition are simply not well suited for our purposes, so I had a lot to learn.
Now some 9 years later, many of the LSPs and colleagues I know use OCR in some way, but unfortunately few do so efficiently or even usefully. Though sometimes there are TIFFs or JPEGs to be converted, most often these days the documents to be converted are obtained as some form of PDF. And there is enormous confusion and misrepresentation among translators as to what the various PDF formats are and how to deal with them.
I distinguish between two types of use for optical character recognition in my work: (1) quotation or count verification and (2) translation preparation. The former allows for considerable sloppiness, the latter for very little.
One of my LSP clients whom I introduced to OCR technology years ago uses it for one thing only: estimating text counts for quotation. Depending on the source, this may be very accurate or only a rough count (if there are serious contrast problems that can't be compensated, for example). I do this not only with PDFs and bitmap files such as JPEG or TIFF, but also with large, complex documents in other formats. For example, f someone sends me a 200 page document in an MS Word format and I need to prepare a cost estimate for services quickly, I will often print it to PDF, then run the PDF through an OCR engine and save the results as plain text for counting. One cannot always rely on the counts from Microsoft Word itself or various translation tools for text counting. Embedded objects, even editable ones, are generally not included in the counts. I have seen RTF documents where have the tables are ordinary RTF tables (which are counted) and others which look the same are spreadsheet objects (not counted unless you are a Star Transit user AFAIK). And PowerPoint files can have embedded Excel or Visio objects or other uncountable elements. Printing to PDF and doing an OCR subsequently avoids this problem and enables one to ensure that nothing big was missed when counting by other means. This procedure also serves as an "early warning" if there is text to translate that is not extracted by the translation tools. It really sucks to "finish" a long job only to discover that several thousand words in tables and diagrams remain untranslated.
Using OCR to prepare translations is often straightforward, but there are a number of traps that people commonly fall into. Do not, under any circumstances, be seduced by the automatic conversion settings of any commercial OCR program nor by options to save with the "original formatting". This is nearly always a disaster when working with translation tools. Problems may include bizarre text changes, disappearing chunks of text due to text box sizing problems, a plague of tags and more. About six years ago I wrote some guidelines on the ways to save and work with OCR text; these are a bit dated, but generally valid. I would also suggest getting and learning to use Dave Turner's Code Zapper macros for MS Word; these can not only clean a lot of garbage out of troublesome OCR texts but also from MS Word or RTF documents that suffer from the tag plague for other reasons.
It is usually best to go through documents to be converted and manually set OCR zones and their properties (such as text, picture, orientation, inversion, etc.). The extra time spent doing this will be saved later when translating or making final edits to your work. And even with long documents (or especially with long documents), it is better to save the converted text with no formatting and then reapply formatting using defined styles. This approach usually also produces text that can be transferred to DTP programs such as InDesign with less trouble. Ignoring this particular bit of advice can lead to a lot of wasted time and grief.
Before you begin translating, it is also important to scan through the converted text and look for superfluous line breaks, excessive spaces and other formatting problems and fix these so that your translation segments will be as clean as possible. Such preparation at this stage is a very good investment of time. It may also be useful to run a source language spell check to catch errors in the text conversion.
Another reason I like to invest a bit of extra effort in cleaning up an OCR source text is that it is often part of the delivery to a client who may have lost (or never had but may need) an original, editable document. This can be part of your presentation as a professional, a little service to set you apart from the competition. In some cases I have seen persons skilled in OCR offer their conversion services to busy translators or LSPs; I don't think the need for this type of service, if done right, has decreased in real terms, though I can't say what the actual demand is these days. I used to get a lot of requests for this service, though I discouraged inquiries for languages that I don't know and eventually steered most of those asking toward developing their own capabilities.
What are your experiences with OCR? Any tips for best practice with your favorite tools?