Feb 23, 2014

Cleaning up a crappy OCR job for translation

It's a sad fact in the professional work of translators that a lack of understanding on how to deal effectively with various PDF formats causes enormous loss of productivity and results which are not really fit for purpose. The aggressive insistence of many colleagues possessed of a dangerous Halbwissen on using half-baked methods and inappropriate tools contributes to the problem, but, bowing to the wisdom about arguing with fools, I now mostly sit back with a bemused and amused smile and watch the tribulations of those who believe in salvation by PDF import filters and cheap or free OCR. "TANSTAAFL" is a true as it ever was.

Just before the weekend I got an inquiry from an agency client I rather like. Nice people, good attitude, but struggling sometimes trying to find their way with technology despite some in-country "expert" training. This inquiry looked a bit like ripe fish at first glance. The smell got stronger after I was told that because the corporate end client had converted the PDF for their annual report and begun to edit the mess (and comment it heavily too) in the OCR file that this would be all there was to work with. It was a thoroughly appetizing sight when imported into a translation environment:


There are so many issues in that tossed salad of translation terror that I don't even know where to start describing them.

The screenshot above was in memoQ. How does it look in SDL Trados Studio? Often just as messy. In this case, this was the result in an older version of Studio:

SDL Trados Studio choked and refused to import the file!

I do have the latest version of SDL Trados Studio 2014, but unfortunately it's on a system that does not yet Microsoft Office, because I refuse to bow to Microsoft's insistence that I must buy a Portuguese version of that software. No MS Office, no file import in this case with SDL Trados Studio. memoQ fortunately has not needed MS Office to import its old file formats since the release of memoQ 6.0.

Ugly OCR trash like this file is all too common at this time of year, and as I am busy compiling the syllabus for the workshop I want to do on better living with well-used technology for legal and financial translators, I felt obliged to take this one on as a teaching example. It's actually not as bad as it looks. On the other hand, the best approach may not always be obvious, and the best solution for one document may not apply as well or at all to another.

My first approach was to use Dave Turner's CodeZapper macros. This isn't as straightforward as it used to be since I downgraded from Microsoft Office 2003 to later versions; for some reason the toolbar refuses to stay loaded between work sessions, and there's no way I can keep track of all the abbreviations for macros on it.


I can't deal with anything more complicated than clicking the "CZL" option for "Code Zapper lite", which did a rather decent job on the heavy mess above:


But all was not quite as well as it seemed:


Text in the header and footer remained trashed, and the heavy use of comments and tabbed lists meant that there were plenty of legitimate tags to deal with which were just too confusing with the DVX-like mess of memoQ's default import and display for an RTF file.

So I went for a kinder, gentler approach. I changed my import filter settings in memoQ:


There is actually seldom any good reason to import an RTF or DOC file into memoQ using the default filter settings. And marking those two little checkboxes at the bottom often accomplishes much of what CodeZapper does. Sometimes less. A bit more in this case.


The header and footer texts were absolutely clean. Don't let the extra tags in this sample fool you: overall, there were fewer than in the code-zapped file. Now there are still a number of issues to be seen in the screenshot above, including paragraph breaks in the middle of a sentence and awful manual hyphenation (many instances of that in the whole text) and joys like badly placed comments and links which mess up the text and prevent term identification by the software:



Source editing features of memoQ (F2) enable issues like the two above to be dealt with easily:



After a bit of repair like this in the memoQ environment (where it is really much, much easier to fix the problems of bad comment and link placement), I copied the entire source text to the target to enable me to export a cleaner source text file. I then opened this file in Microsoft Word and used various search and replace operations to fix the bad hyphenation and other problems like excess spaces. Replacing the hyphens had to be done occurrence-by-occurrence, because the style of writing in German meant that there were many legitimate instances of hyphens followed by spaces.

After all was done, the "before and after" looked like this:

BEFORE

AFTER

The remaining tags were all legitimate formatting tags for comments, hyperlinks, tabs after section numbering, etc. These do, of course, require attention and add complexity to the work still, so they must be included in the charges for the job. memoQ makes this calculation particularly simple by allowing weighting factors to be specified in the analysis. These are the settings I typically use for a German source text:


I find this usually represents a fair minimum for the additional effort in translation and quality assurance that tags require. In this case, of course, time charges for the cleanup apply, but as you can probably guess from comparing the two analysis tables above, the customer is actually saving a lot of money by paying me to clean up the mess, and the results will be a lot more usable. My cleaned-up version of the source text will also be returned in case the authors intend to make more revisions in the source - this will save more time and money by avoiding redundant cleanup in that case.


Feb 16, 2014

Medienbruch.

For years now, I've watched the German term Medienbruch and derivatives like medienbruchfrei seep their way into various, well, media. And the results are, by and large, linguistically discontinuous and/or broken :-)


The screenshot above shows a sampling from the "feast" of bilingual web alignment offered by Linguee, one which those who are familiar with the service realize must be enjoyed with some care to avoid the odd bits of broken glass and strychnine that may be found in its sometimes machine-translated, sometimes Bulgarian-inspired English. Some of the more thoughtful marketers in the translation profession may use the site as a source to research firms desperately in need of better translators rather than merely pan its buckets full of linguistic swamp sediment in desperation to find the few flakes of gold which might have settled there. Not to say that Linguee is not a valuable working tool, but its use requires a great deal of professional judgment in most cases and often further research is one is on unfamiliar ground. It is a tool, and a fool with a tool remains a fool. Like the often criticized Wikipedia, it is often a good starting point but seldom the journey's end.

Why not just Google "media discontinuity", for example, and see how often the particular phrase occurs? Uh huh. You got 8,100 results, did you? Sounds pretty authoritative, right? Look again at where the sources are. Ah, you say, Kevin ought to have used Google's advanced search options to narrow it down. Have a look at the garbage on the first page again, and tell me with a straight face that better sources will somehow rise to the top of that like tasty cream.

At this point, a colleague less overcaffeinated might point out that I'm going about this all wrong and that I might do better searching some of the monolingual corpora available free online, such as the British National Corpus (which is great for making Americans believe that Brits really do say "at the weekend" most of the time or the BYU Corpus of Global Web-based English, which allows those who master the arcana of its search rules to compare Indian weekends with those in Canada or Kenya. But these text collections, for all their value for a general understanding of common language use, are seldom satisfying for specialized use and neologisms.

A cooler head might point out the terrible bias in my approaches so far and encourage me to consider the context in which this word has occurred while percolating into my awareness in recent years. Hm. Well, usually when some German public authority decides to bring its information management practices up to 20th century standards. Since they usually follow well behind practice in English-speaking countries for this sort of things (I still see German government texts that put "E-Mail" in quotes as if it were something new and exotic), it's a good bet that a corpus of texts from similar institutions in the UK or US might reveal some interesting and useful and even natively acceptable possibilities. (That's how I learned years ago that a usual and proper term in English for Vieraugenprinzip is "dual review", not the "four-eyes principle" that some so happily click their heels to. In the elementary schools I attended, that was the principle based on which bullies punched my friends with glasses in the face.)

Juliette Scott, who writes a rather nice blog mostly concerned with legal translation issues, has done some nice (NIFTY) work as part of her doctoral thesis using monolingual corpora to study patterns of use in target languages and suggest useful strategies for building collections of specialized text. Good, perhaps even obvious stuff which applies to any domain I can think of, though the practice of building focused corpora from carefully selected sources is still far too rare among those of us who use the label "translator" or "language professional". I certainly need to do that more often.

Ah, but if I have just a monolingual corpus to understand proper use of terms in the target language, how will I know what the right translation is? That's a hard one. One that perhaps I can't answer. At least not without the use of a BAT* tool. But if that's too hard, I know just the place to go: proZ.com, where "certified pros" compete for points on KudoZ and offer mutual reinforcement and enlightenment to those who run rapidly on their wheels to keep pace with changing expectations of quality.

I could of course take one of the accepted "solutions" from Linguee and feed the same sort of self-recycling GIGO loop that Google now admits to be an ever-growing part of its searched wisdom.

So what is your solution to the Medienbruch? Or are you part of the problem?

Feb 15, 2014

Indestructible Italian quality!

Language service providers like yours truly spend a lot of time talking about the high cost of crap quality. But that applies to most things, really. An old friend of mine who struggled at the start of his professional life paying his way with handyman work while creating beautiful custom furniture knew that he could afford only the finest tools though not always meat to go with the bread on his table.

I drink a lot of coffee, and I enjoy it in a variety of ways. As those who have visited my kitchen can attest, it's almost like a fetish with the various pots, presses and filters, each teasing different qualities of taste from a well-roasted bean. I thought by now I would have learned most of what I needed to know to ensure good results in the "coffee kitchen".

Distraction has taught me - or reminded me - of a few things lately. Three times now I have become involved in work or gone off shopping with friends and forgotten a moka pot on the fire. The first time, I returned to a house filled with toxic smoke from the burned plastic handles and top knob, and I was grateful the house had not burned down and the dogs were still alive. More recently I left my favorite Bialetti Brikka pot on the flame for several hours. Twice.

If anyone wants to say nasty things about Italian engineering, I will have to plead for the defense. In contrast to the complete destruction of the cheap moka pot after an hour, the Bialetti pot just needed a bit of scrubbing. Even the gasket was OK, which amazed me. The durability of the pots that cost me about €30 is so much beyond that of a €7 pot that to compare them is almost a crude joke. Oh yes, and I don't know any other manufacturer who offers such an excellent pressure valve for crema at that price.

Image from Alexandre Enkerli, Creative Commons Attribution-Share Alike 2.0 Generic license
There are obvious analogies to our work as translators, of course. But I don't need to insult anyone's intelligence by explaining them. I'll just go enjoy another shot of coffee from my indestructible Italian-engineered pot.