Oct 27, 2012

I dream of human-assisted machine translation

Reading in bed has a number of hazards, particularly for those who require some assistance to maintain their oxygen supply during sleeping hours and are prone to nodding off, book in hand. And woe to those whose choice of reading may cause uneasy stirrings of the mind, leading it down paths best untraveled. Since purchasing a Kindle just over a year ago I have had many unrestful nights as I read classics neglected in my youth, Dracula and Frankenstein being two of the happier examples. But little did I expect that reading a retrospective review of Paul Verhoeven's 1987 film Robocop would prove my final undoing....

They say that only two things are inevitable: death and taxes. They are wrong.

In the wake of the euro's collapse following the final meltdown of the Greek economy and its last government, German banks were faced with the problem of recouping their losses from people able to offer little more than a moderate supply of transplant organs for the ravenous Chinese market, and concerns among tour operators at the sudden increase of menus offering "long pig" indicated that international medical markets were in a state of oversupply and savory alternatives were too often the order of the day - and occasionally served as unasked-for appetizers.

Clearly, more sustainable engines of value were needed to drive the economy and extract returns from those who benefit from the strength and protection of the state. Fortunately, the persons closest to our enlightened, modern governments are ever active in the pursuit of advances in efficiency to enhance the return on the great investments made in the citizenry.

My life, or more precisely my economic utility after life, was saved by the wondrous results of a joint research project of governments, university medical centers and TAUS members determined to eliminate the exorbitant and unnecessary expense of human translation and post-editing. Although enormous progress had been made in convincing people of the progress of machine translation in its 70 year history, throughout that time the actual perfection of the results remained a maddening five years distant, the gap never quite closing as the faithful knew it should.

Some apostates among MT believers claimed that the technology could never achieve its full potential without the collaboration of qualified linguists, but these specialists remained stubborn for the most part in their resistance to using their brains for linguistic garbage recycling. Resistant, that is, until their brains became part of the recycling effort, offering value-added new "life" for the benefit of bottom lines.

A little-noticed provision of the 2014 tax reforms introduced by the Merkel government in the Germany and emulated by its subjects and acolytes around the world was conditional relief based on the reversion to the state of corporeal assets after the cessation of autorespiratory processes. And those who noticed... well, the comforting comparison with reverse mortgages and the lesser associated asset risk usually put their minds at ease.

The first indication I had that something was different was when I reached for my Kindle and found that it was no longer beside me, that in fact I could discern no "side". After a few moments' confusion, my vision cleared, and I was presented with a screen of text:
Crucial for the occurrence of a merger is the cross section, the measure of the probability that the colliding nuclei react. Sufficiently large, the cross section is usually only when the two cores with high energy collide. Which is necessary to overcome the Coulomb barrier, the electrical repulsion between the positively charged nuclei. However, the cross section is also less impact energy due to the tunneling effect is not zero. Reach the nuclei at a distance of only about 10-15 m from each other, they bind together the strong interaction, the nuclei have merged.
WTF? A timer appeared above the text, accompanied by a message to begin postediting. Huh? I stared at the jumble of words with no little confusion, wondering what their purpose could be. Colliding nuclei? Merger? Something to do with nuclear fusion perhaps, but hard to tell from this gurgled mess. I continued to stare at the text as the timer counted down. The message to begin began to flash, slowly at first, then more intensely. The countdown reached zero, and the message changed: target not achieved.

As I felt the electric surge convulse my body, I realized that I had no mouth with which to scream. But every part of my not-present body screamed for me as nerve centers received their reinforcing, teaching stimulus.


TRY AGAIN

flashed briefly before my eyes, and the text reappeared.

My unseen fingers moved across an unfelt keyboard. The cross section is critical for determining the probability that colliding nuclei will react and fusion occur....



Oct 22, 2012

Another translation jobs portal? No thanks.

After I published my recent note on online proZtitution and the race to the bottom on commercial translation portals, I received a polite e-mail from someone who is working to build a better "jobs board" for translation projects. I've received quite a few messages like this in the past four years since I started this blog.

But I fail to see a compelling case for yet another intermediary site for job auctions or anything of that sort.

There is certainly good sense in professional associations with a vetted membership upgrading their directory sites and making them easier for potential clients to find and use, and for some time now I have felt that the leaders of major organizations like the ATA, ITI, IoL, SFT, BDÜ, etc. ought to link their directories and allow better international searches as a counterweight to the less than optimal listings one finds on PrAdZ and other sites.

There's even more sense in individual language service providers improving their online presence in ways that helps prospects find them and recognize a good fit.

The largest commercial portal for translation job auctions has become increasingly irrelevant to all but the low-end commodity market that is probably better off with Gargled Translate given its expectations. The predominance of Indian, Chinese and Eastern European language sausage purveyors (LSPs) on ProZ has driven many serious brokers and service providers elsewhere; a decade ago, even five years ago, a number of interesting inquiries came through that channel, but today they inevitably find me via the BDÜ directory or other channels I more or less control.

So I wish all the aspiring intermediates success with their planned sites and hope they can in fact make a difference in some useful way, but for the most part, translators should look to "home remedies" for curing what may ail their businesses.

Oct 21, 2012

Put OCR in Your Business Model

This article originally appeared on an online translators portal four years ago and was long overdue for removal there. Here is an update.

*****
Optical character recognition (OCR) software is discussed often online and at translators' events, usually in the context of how to deal with PDF files. Hector Calabia, Peter Linton and others have made a useful technical contributions on this subject in articles and forums and at various conferences. However, it is useful to consider OCR software in a broader translation business context. Document conversion is often very useful for translation purposes and greatly facilitates automated quality checks of the draft, for example, but OCR can also generate additional income for your business and reduce quotation risk.
OCR for translation
There are a number of programs available for this purpose, and which one is best for your purposes may depend on the language combinations you deal with and other factors. For years now I have used Abbyy FineReader, because years ago it gave the best test results for the particular set of European languages one of our clients offered. It is also relatively inexpensive (I paid about 100 euros for FineReader 11) and easy to use.

Many OCR conversions of TIFF, JPEG and PDF documents which I receive from agencies are difficult to use for translation purposes and require significant modification - if they can be used at all. Particularly in cases where TM tools are to be used or target texts differ significantly in length (especially when they are longer) there may be problems. The best ways to avoid these problems are
  • avoid automatic settings for OCR conversions; use zone definitions instead
  • avoid saving the converted texts with full formatting in most cases
  • use a suitable post-OCR workflow to clean up the converted document by joining broken sentences, removing superfluous characters, fixing conversion errors, etc.
If the idea of doing individual zone definitions on each page of a 100 page document is intimidating, take heart. Programs such as Abbyy FineReader often allow you to define layout templates, speeding up the work considerably. One translator I know became so skilled at the use of these OCR templates and was so good with his conversions that agencies hire him just to do high-quality OCR work for them. Which brings me to….

OCR as an income-generating activity for the translator or agency
Hardcopy, scanned documents, faxes and PDF documents generally require more work for translators than electronically editable documents and require different, sometimes more fallible quality control measures than a typical workflow for a translator using original electronic documents in a translation memory system. If no conversion is performed, it is more time-consuming to check terminology or use concordances during the translation, and it is also unfortunately too easy for eyes to skip over bits of text. Under time pressure this can lead to very serious problems. Even with conversion, the OCR text requires careful checking against the original document to identify and correct any errors introduced (and there will be some at times with even the best OCR software). So it is not at all unreasonable for a translator to charge a higher rate for dealing with hardcopy, scanned documents, faxes and PDF documents.

There are a number of ways to incorporate these higher charges into your business model. The two obvious ways are a premium (surcharged) word/line/page rate and hourly service charges. I usually offer both options to my clients, with the word/line rate surcharge representing the “fixed” rate and the hourly rate the “flexible” rate where I make an non-binding estimate and they may end up paying more or less according to the actual effort. For pure OCR conversion jobs where I am not doing the translating, I charge a typical proofreading rate or a bit more, because I go through the entire document and see that it is correctly formatted for translation work and that obvious errors are fixed (i.e. basic spellcheck, etc.).

Sometimes I hear that “the client doesn’t want to pay for that”. Well, that’s OK, too. The client has the option of doing the work and doing it right and saving me the effort. The recognition that there is additional effort involved and that this effort should be compensated is important. But usually there is a way to sugar-coat the "bitter" cost pill, and this is where your marketing savvy comes into play. Some win-win arguments you might present include:
  • the availability of an editable source text the client can use for future versions;
  • the ability to create TM resources using the OCR text (which can save time/money later);
  • potentially better quality assurance, especially with tight deadlines. 
Returning a clean, nicely formatted OCR of the source document is often good "advertising". End clients may appreciate how this saves time and allows them to use the original text in a variety of ways (attorneys may like to quote arguments from the opposing side, and copy/paste beats retyping). Discriminating agencies may recognize your skill at creating documents that don’t go crazy when edited (because of screwy text boxes, bad font definitions and other format errors) and offer you more work. If your language pair is in low demand or is very competitive, this may be one more way of distinguishing yourself from the pack.
I got started doing OCR work and charging for it after suffering through the conversion of several long PDF documents by more manual methods. I finally wised up, bought FineReader and started to use it with most of the hardcopy, scanned documents, faxes and PDF documents I received simply because it enabled me to use my TM tools and do better quality checks. I started sending the cleaner-looking source texts converted with OCR along with the target text translations, and soon I started getting requests for paid OCR work. A number of my agency clients then began to buy OCR tols and use them with varying degrees of success. Even if they do all the conversion work, I still win if they do it right, because I save time for what I enjoy more – the translation.

OCR as tool for quotation
Some people I know still haven’t learned to do a high-quality OCR (or they don’t care to), but they still use the software effectively in a very important area of their business: quotation and risk limitation.

There are lots of good tools out there for text counting, which is important to many methods of costing and time planning in the translation business. Some people even still do it manually, which, though time consuming, is not a bad way of checking the numbers from an electronic estimate. A number of factors can result in text counts being too low – embedded objects, such Excel tables or PowerPoint slides in a Microsoft Word documents, or graphics with text - or even too high (as is the case with at least one CAT tool counting RTF and MS Word files). Keep using whichever method you prefer - I won't try to persuade you that any one approach is best. I use a number of methods myself.

When translating larger documents, however, or documents with a complex structure, it is often useful to have a “sanity check” for your text counts. On a number of occasions I have received translation jobs from agency clients where the text count was given a X words, where in fact there were quite a few more words embedded in Excel objects, bitmap graphics, Visio charts, etc. which had not been measured by the method used. In a few cases these clients had to take a loss on the job after giving a fixed price bid to the end client. Using OCR to check your estimates can prevent such an unfortunate scenario.

To do this, print the document (whatever it is) to a PDF file. Then run the PDF file through an OCR program with automatic settings (to save time – you don’t need to translate this OCR). Save the text and count it. There will probably be a bit more text due to headers or footers or perhaps garbage from graphics, but the results should be close to your other estimate. (You can always subtract an appropriate factor for the text count in headers and footers to improve your OCR estimate.) If there is a major deviation, this is a clear sign that you should take a much closer look at the document(s) before quoting the job.

Searchable scanned documents
Another use I have found for OCR in recent years is creating searchable "text-on-image" documents from scanned PDFs, TIFF files and other bitmap formats. Although I have used these searchable PDFs mostly for reference while I work (searching for bits of text while viewing the original, unadulterated context) and supplied them to clients on only a few occasions, the potential for an additional value-added service is fairly obvious in this case.

Conclusion
OCR software is an essential tool for the work of many translators today, even more so than CAT software in many cases. Not just a tool for recovering “lost” electronic documents or making legacy typed material more accessible for translation work, it also offers possibilities for generating additional projects and income, differentiating one’s services and reducing risks when quoting large jobs. Key features of whatever OCR you choose should include the ability to select text areas for conversion and to determine their sequence in the converted text (using user-defined zones). Various options for saving the converted text (full page format, limited text formatting and no formatting) are also very helpful. Most important of all, though, is a good quality-checking workflow for your OCR documents (possibly including formatting) to avoid difficulties in the translation process and ensure that your work has a polished, professional appearance.

OCR software is another good tool for improving your visibility with clients and making your work processes easier in an age when many archiving and ERP systems are focused on the retention of PDF documents or TIFFs and even actively discourage saving original formats. The major providers of this software often have free, functional demonstration versions to use before making a purchase decision. Try several options and choose the best one for you. You won’t be sorry.