Feb 23, 2014

Cleaning up a crappy OCR job for translation

It's a sad fact in the professional work of translators that a lack of understanding on how to deal effectively with various PDF formats causes enormous loss of productivity and results which are not really fit for purpose. The aggressive insistence of many colleagues possessed of a dangerous Halbwissen on using half-baked methods and inappropriate tools contributes to the problem, but, bowing to the wisdom about arguing with fools, I now mostly sit back with a bemused and amused smile and watch the tribulations of those who believe in salvation by PDF import filters and cheap or free OCR. "TANSTAAFL" is a true as it ever was.

Just before the weekend I got an inquiry from an agency client I rather like. Nice people, good attitude, but struggling sometimes trying to find their way with technology despite some in-country "expert" training. This inquiry looked a bit like ripe fish at first glance. The smell got stronger after I was told that because the corporate end client had converted the PDF for their annual report and begun to edit the mess (and comment it heavily too) in the OCR file that this would be all there was to work with. It was a thoroughly appetizing sight when imported into a translation environment:

There are so many issues in that tossed salad of translation terror that I don't even know where to start describing them.

The screenshot above was in memoQ. How does it look in SDL Trados Studio? Often just as messy. In this case, this was the result in an older version of Studio:

SDL Trados Studio choked and refused to import the file!

I do have the latest version of SDL Trados Studio 2014, but unfortunately it's on a system that does not yet Microsoft Office, because I refuse to bow to Microsoft's insistence that I must buy a Portuguese version of that software. No MS Office, no file import in this case with SDL Trados Studio. memoQ fortunately has not needed MS Office to import its old file formats since the release of memoQ 6.0.

Ugly OCR trash like this file is all too common at this time of year, and as I am busy compiling the syllabus for the workshop I want to do on better living with well-used technology for legal and financial translators, I felt obliged to take this one on as a teaching example. It's actually not as bad as it looks. On the other hand, the best approach may not always be obvious, and the best solution for one document may not apply as well or at all to another.

My first approach was to use Dave Turner's CodeZapper macros. This isn't as straightforward as it used to be since I downgraded from Microsoft Office 2003 to later versions; for some reason the toolbar refuses to stay loaded between work sessions, and there's no way I can keep track of all the abbreviations for macros on it.

I can't deal with anything more complicated than clicking the "CZL" option for "Code Zapper lite", which did a rather decent job on the heavy mess above:

But all was not quite as well as it seemed:

Text in the header and footer remained trashed, and the heavy use of comments and tabbed lists meant that there were plenty of legitimate tags to deal with which were just too confusing with the DVX-like mess of memoQ's default import and display for an RTF file.

So I went for a kinder, gentler approach. I changed my import filter settings in memoQ:

There is actually seldom any good reason to import an RTF or DOC file into memoQ using the default filter settings. And marking those two little checkboxes at the bottom often accomplishes much of what CodeZapper does. Sometimes less. A bit more in this case.

The header and footer texts were absolutely clean. Don't let the extra tags in this sample fool you: overall, there were fewer than in the code-zapped file. Now there are still a number of issues to be seen in the screenshot above, including paragraph breaks in the middle of a sentence and awful manual hyphenation (many instances of that in the whole text) and joys like badly placed comments and links which mess up the text and prevent term identification by the software:

Source editing features of memoQ (F2) enable issues like the two above to be dealt with easily:

After a bit of repair like this in the memoQ environment (where it is really much, much easier to fix the problems of bad comment and link placement), I copied the entire source text to the target to enable me to export a cleaner source text file. I then opened this file in Microsoft Word and used various search and replace operations to fix the bad hyphenation and other problems like excess spaces. Replacing the hyphens had to be done occurrence-by-occurrence, because the style of writing in German meant that there were many legitimate instances of hyphens followed by spaces.

After all was done, the "before and after" looked like this:



The remaining tags were all legitimate formatting tags for comments, hyperlinks, tabs after section numbering, etc. These do, of course, require attention and add complexity to the work still, so they must be included in the charges for the job. memoQ makes this calculation particularly simple by allowing weighting factors to be specified in the analysis. These are the settings I typically use for a German source text:

I find this usually represents a fair minimum for the additional effort in translation and quality assurance that tags require. In this case, of course, time charges for the cleanup apply, but as you can probably guess from comparing the two analysis tables above, the customer is actually saving a lot of money by paying me to clean up the mess, and the results will be a lot more usable. My cleaned-up version of the source text will also be returned in case the authors intend to make more revisions in the source - this will save more time and money by avoiding redundant cleanup in that case.

Feb 16, 2014


For years now, I've watched the German term Medienbruch and derivatives like medienbruchfrei seep their way into various, well, media. And the results are, by and large, linguistically discontinuous and/or broken :-)

The screenshot above shows a sampling from the "feast" of bilingual web alignment offered by Linguee, one which those who are familiar with the service realize must be enjoyed with some care to avoid the odd bits of broken glass and strychnine that may be found in its sometimes machine-translated, sometimes Bulgarian-inspired English. Some of the more thoughtful marketers in the translation profession may use the site as a source to research firms desperately in need of better translators rather than merely pan its buckets full of linguistic swamp sediment in desperation to find the few flakes of gold which might have settled there. Not to say that Linguee is not a valuable working tool, but its use requires a great deal of professional judgment in most cases and often further research is one is on unfamiliar ground. It is a tool, and a fool with a tool remains a fool. Like the often criticized Wikipedia, it is often a good starting point but seldom the journey's end.

Why not just Google "media discontinuity", for example, and see how often the particular phrase occurs? Uh huh. You got 8,100 results, did you? Sounds pretty authoritative, right? Look again at where the sources are. Ah, you say, Kevin ought to have used Google's advanced search options to narrow it down. Have a look at the garbage on the first page again, and tell me with a straight face that better sources will somehow rise to the top of that like tasty cream.

At this point, a colleague less overcaffeinated might point out that I'm going about this all wrong and that I might do better searching some of the monolingual corpora available free online, such as the British National Corpus (which is great for making Americans believe that Brits really do say "at the weekend" most of the time or the BYU Corpus of Global Web-based English, which allows those who master the arcana of its search rules to compare Indian weekends with those in Canada or Kenya. But these text collections, for all their value for a general understanding of common language use, are seldom satisfying for specialized use and neologisms.

A cooler head might point out the terrible bias in my approaches so far and encourage me to consider the context in which this word has occurred while percolating into my awareness in recent years. Hm. Well, usually when some German public authority decides to bring its information management practices up to 20th century standards. Since they usually follow well behind practice in English-speaking countries for this sort of things (I still see German government texts that put "E-Mail" in quotes as if it were something new and exotic), it's a good bet that a corpus of texts from similar institutions in the UK or US might reveal some interesting and useful and even natively acceptable possibilities. (That's how I learned years ago that a usual and proper term in English for Vieraugenprinzip is "dual review", not the "four-eyes principle" that some so happily click their heels to. In the elementary schools I attended, that was the principle based on which bullies punched my friends with glasses in the face.)

Juliette Scott, who writes a rather nice blog mostly concerned with legal translation issues, has done some nice (NIFTY) work as part of her doctoral thesis using monolingual corpora to study patterns of use in target languages and suggest useful strategies for building collections of specialized text. Good, perhaps even obvious stuff which applies to any domain I can think of, though the practice of building focused corpora from carefully selected sources is still far too rare among those of us who use the label "translator" or "language professional". I certainly need to do that more often.

Ah, but if I have just a monolingual corpus to understand proper use of terms in the target language, how will I know what the right translation is? That's a hard one. One that perhaps I can't answer. At least not without the use of a BAT* tool. But if that's too hard, I know just the place to go:, where "certified pros" compete for points on KudoZ and offer mutual reinforcement and enlightenment to those who run rapidly on their wheels to keep pace with changing expectations of quality.

I could of course take one of the accepted "solutions" from Linguee and feed the same sort of self-recycling GIGO loop that Google now admits to be an ever-growing part of its searched wisdom.

So what is your solution to the Medienbruch? Or are you part of the problem?

Feb 15, 2014

Indestructible Italian quality!

Language service providers like yours truly spend a lot of time talking about the high cost of crap quality. But that applies to most things, really. An old friend of mine who struggled at the start of his professional life paying his way with handyman work while creating beautiful custom furniture knew that he could afford only the finest tools though not always meat to go with the bread on his table.

I drink a lot of coffee, and I enjoy it in a variety of ways. As those who have visited my kitchen can attest, it's almost like a fetish with the various pots, presses and filters, each teasing different qualities of taste from a well-roasted bean. I thought by now I would have learned most of what I needed to know to ensure good results in the "coffee kitchen".

Distraction has taught me - or reminded me - of a few things lately. Three times now I have become involved in work or gone off shopping with friends and forgotten a moka pot on the fire. The first time, I returned to a house filled with toxic smoke from the burned plastic handles and top knob, and I was grateful the house had not burned down and the dogs were still alive. More recently I left my favorite Bialetti Brikka pot on the flame for several hours. Twice.

If anyone wants to say nasty things about Italian engineering, I will have to plead for the defense. In contrast to the complete destruction of the cheap moka pot after an hour, the Bialetti pot just needed a bit of scrubbing. Even the gasket was OK, which amazed me. The durability of the pots that cost me about €30 is so much beyond that of a €7 pot that to compare them is almost a crude joke. Oh yes, and I don't know any other manufacturer who offers such an excellent pressure valve for crema at that price.

Image from Alexandre Enkerli, Creative Commons Attribution-Share Alike 2.0 Generic license
There are obvious analogies to our work as translators, of course. But I don't need to insult anyone's intelligence by explaining them. I'll just go enjoy another shot of coffee from my indestructible Italian-engineered pot.

Feb 5, 2014

Cambridge conference - Getting Language Right!

One of the phrases best known from the Monty Python is "And now for something completely different!" This could be applied well to the conference at Cambridge (UK) next month (March 24th), which one leading conference planner and industry insider noted does not feature the "usual suspects" :-)

The sameness of conferences in the translation profession is wearing. Same topics, same speakers most of the time, perhaps a change of venue. What I find most distressing is that most (or all) of these conferences that I can recall off-hand are insiders talking to insiders. Chris Durban and others wisely advise participation in professional meetings and conferences related to one's specialties, but there are few venues where businessmen and other professionals gather to deal with issues of language that are important to good working translators, and the best opportunities for networking are generally at events where language and translation is definitely not the focus. One walks the trade fair floor for a day, mingles at event happy hours and pans the silt of interactions for the few nuggets that might be found with care.

The conference Communicating in Business – Getting Language Right features speakers involved in domestic, EU and international business, law and diplomacy and some translators working at high levels of public and private service with direct clients, who all share a deep understanding of what goes missing when we waste too much time fussing over technologies and forget that these are merely the form - irreplaceable human knowledge and skill in communication are most often the content that matters, especially so when one deals with the most important services, products, legal, social and political issues.

It's nice to see the rigid, repetitive mold of yet another Are You Ready for the Future of Translation Technology? event replaced by one that offers substance, nourishment and useful interactions for those who do not aspire to the compromised communication practices of the anonymized bulk market but instead concern themselves with carrying messages most effectively to those who want and need them.

I am a strong believer in good technology, properly applied, but private reports of a recent major translation project where all the processes worked perfectly, with translation memory tools, terminology review, planned workflows, etc. delivered an absolutely flawless format, perfect consistency and all of it perfectly unusable because the most important element was left out: the real communication specialists, the expert linguists who know the subject matter to be translated and how best to express it for another culture in the language needed. That's what the Stridonium conference is about - sharing best practice and working together as true partners in communication to stay on top of the top of the game.

Riccardo Schiaffino's About Translation blog has details of the conference, which can also be found on the Stridonium events pages.

Feb 3, 2014

Colors in memoQ lookup results - which termbase?

A subject that comes up time and again with experienced colleagues is the desire to distinguish more easily where matches come from in memoQ. Of course, clicking on a match in the Translation Results pane of memoQ's working grid provides additional information for each type of resource (the example of the LiveDocs match has different information than one would expect to see from  TM hit, a termbase match, a non-translatable or other kind of entry). But many want more obvious information in the working display and bilingual RTF exports to indicate the source of matches.

In the graphic at the left, segment matches from two different translation memories and two different LiveDocs corpora are shown. There is no visual clue to indicate the differences between these corpora and TMs. One would have to click on a particular entry and look at the meta-information at the bottom right to see which data collection the hit came from, the name of the document translated, when the translation unit was created, who wrote it, etc.

With termbases, however, the situation is now different. For those with excellent eyesight (I don't really qualify), there are subtle gradations of color to reflect termbase priority, the higher priority termbases showing darker colors for their hits. This is clever and useful, and unfortunately not able to be customized in a meaningful way at this time as far as I can tell. I might like to set a special match color for a termbase I want to take particular note of but which has a lower (and hard to distinguish) priority. Can you tell how many termbases are showing hits in the screenshot here? Look carefully.

I find the color cues used for matches in the memoQ working window quite helpful in most cases. Although these can be customized under Tools > Options > Appearance > Lookup results, I refuse to do so, because I use these color cues to explain things to other users sometimes, and I cause enough chaos telling them to use my personal customized keyboard shortcuts based on an old version of Déjà Vu, having long forgotten what the default keyboard shortcuts are. I also don't see an easy way to reset the default colors if I mess things up.

I'm grateful for the little bit of help that color differentiation in termbase results provides in recent versions of memoQ, and I hope that Kilgray takes this concept further. Similar gradations of color for TMs and LiveDocs would be helpful, and it would be very nice if custom colors could be assigned temporarily to particular resources of any type where some collections of data require special consideration. And then we need a simple way to reset those temporary color assignments.

If you agree with this or have other ideas for improving the accessibility of match result information, please write to and express your thoughts. Too often users remain passive with their frustrations and thoughts about changes or additional features needed. Kilgray tends to be a very responsive solution provider, but if the user community does not express its needs clearly and consistently, it's not reasonable to expect that what we need will happen and it's even less reasonable to be annoyed when it doesn't. In the five years I have used memoQ, things have often taken time to implement, but in that time the developers and product designers have usually given careful thought to matters and mostly exceed my expectations when they do provide the solution.

Feb 2, 2014

Online workshop plans: memoQ for legal & financial translators

As readers of this blog know, I've spent a good part of the last year or more investigating some instructional practices for translators' continuing education and forming my own opinions about what works, what does not, and what might be improved. I've looked at different approaches to blogging, developed e-mail based tutorials for one company, acquired some familiarity with remote coaching options via Skype and TeamViewer, begun the production of short training videos on a YouTube channel, learned to use Moodle and other online courseware platforms, released a PDF e-book with short tutorial modules, supported the localization of some of the preceding things into Portuguese and... probably a few other things I don't remember at the moment. Somewhere in all of that I moved countries and translated a bit to pay bills and buy dog food.

Now I am considering working with a specialist translator to plan a flexible course on the use of memoQ in an optimal way - together with other technologies as required - to achieve better results in processes involved with legal and financial translation. The intent is in no way to teach anything about legal/financial translation as a subject, but rather how to organize the software and work processes to overcome frequent problems, satisfy particular customer requirements or achieve specific improvements in quality management for the translation work.

I have specific topics in mind based on my own work and questions directed to me from specialist colleagues in these areas, but I would like to have suggestions from others, particularly those who might be interested in involvement in such a course in some way. These suggestions can take any form and can be as simple as an observation regarding a difficulty you find in this area which you suspect might have a solution involving technology or working methods.

The delivery media planned are a combination of e-mail, "live" sessions for about an hour each week for small groups or individuals using Skype or TeamViewer, with these recorded and made available for viewing and/or download in a private Moodle course forum on my server. As it makes sense to do so, supplemental material will be provided as web pages, video clips, practice files for testing, memoQ light resources (such as stopword lists or auto-translation rules), PDF "handouts" from my new book edition and other sources.

I've set up an (experimental) mailing list - "Translate Solutions" to discuss this, other topics related to continuing education for translation technology and education/training resources. You are welcome to join the discussion there with a subscription request to translate_solutions-subscribe (at)

The scheduling and detailed subject matter of the course will be announced as specific content requests are received and assessed.

So let the fun begin.

Feb 1, 2014

The fix is in for PDF charts

Over four years ago, I reviewed Iceni Infix after I began working with it. I'm not as strong a fan as some, because I generally have little enthusiasm for direct editing of PDFs and dealing with frequent problems such as missing unusual fonts and having to play the guess-my-optimum-font-substitution game, but I do find it useful in many situations. I found another one of those today.

A new client of a friend works with a horrible German program to produce reports full of charts. The main body of the text is written in Microsoft Word and is available as a reasonable DOCX file, but the charts are a problem, as they are available only in the specific, oddball tool or PDF format. Nobody wants to deal with that software, really. It is supported by no translation tools vendor I am aware of, and like another example of incompatible German software, Across, it enjoys the obscurity it deserves.

After thinking about the approach needed in this case, I realized that if the graphics could be isolated conveniently on pages, the XML export from the PDF document would contain only information from the graphics. After translation, the format could be touched up with Infix before making bitmap screenshots at an enlargement which would yield decent resolution when sized in  layout. Of course, in projects involving multiple languages the XML files could be used with great convenience.

Selecting and deleting the text on the pages with Iceni Infix is really a no-brainer. The time charge for such work will be quite reasonable. And exporting the XML or marked-up text to translate is also quite straightforward:

The exports can be handled in nearly any CAT tool, so TMS and terminology resources can be put to full use. Or you can edit in a simple, free tool like Notepad++ or an XML-savvy editor.

The screenshot above shows the XML in memoQ. No customization of the default filter is required. Reports from other users who have worked in a similar way indicate that OmegaT and other environments generally have few, if any, problems. In one case there was trouble re-integrating the graphics in a project that also had 50 pages of text, but there may have been other issues I am not aware of in that case.

With the content in the TM, if the chart data are made available in another format, the translations can be transferred quickly to that for even better results. The same approach can be used for a very wide variety of other electronically generated graphic formats (except some of the really insane ones I've seen where the text is broken up; I don't know if Iceni sanitizes such messes or not).

I think this is an approach which can benefit many of us in a variety of projects. It is not really suited for cases of bitmap graphics, but I have other approaches there in which Iceni Infix may also play a useful role and allow CAT integration. Licenses for the tool are quite reasonably priced, and the trial version (in Pro mode) is entirely suited for testing and learning this process.