Aug 29, 2011

Loriot's contributions to learning German as a second language

Gabriele Zöttl recently posted a farewell to the great German comedian Bernhard Victor Christoph Carl von Bülow, known as Loriot to most of the world. She rightly described him as an inexhaustible source of inspiration for writers. And for many others I would say. What no one has mentioned, surprisingly, is Loriot's valuable, original contributions to the learning of the German language by foreigners. Here is an example of his inspiring facilitation of international communication:





Aug 28, 2011

Using OCR to support translation processes

I recently noticed an element on the "wall" on my business Facebook page that I hadn't paid attention to before: questions. Since I've resumed research for a tutorial project I shelved several years ago when great uncertainty in the world of translation tools made my plans impractical, I thought I might try this feature and see what sort of feedback results. The question I posed was
What sorts of project challenges do you wish you could handle with your translation environment tools (Trados, DVX, memoQ, OmegaT, whatever) that you cannot today for technical reasons or for lack of adequate explanations and examples?
It's only been up for a short time and the number of responses so far is modest, but some are quite interesting and may be revisited as blog posts or in other ways.

Some of the points raised so far involve the eternal topic of OCR for translation business. When my client Sansalone Technische Übersetzungen in Cologne first introduced me to the effective use of OCR for translation purposes many years ago, it was relatively rare in the world of commercial translation and too often incompetently performed, but I had been using that technology in one way or another for a decade already. But in many cases, "standard" procedures for optical character recognition are simply not well suited for our purposes, so I had a lot to learn.

Now some 9 years later, many of the LSPs and colleagues I know use OCR in some way, but unfortunately few do so efficiently or even usefully. Though sometimes there are TIFFs or JPEGs to be converted, most often these days the documents to be converted are obtained as some form of PDF. And there is enormous confusion and misrepresentation among translators as to what the various PDF formats are and how to deal with them.

I distinguish between two types of use for optical character recognition in my work: (1) quotation or count verification and (2) translation preparation. The former allows for considerable sloppiness, the latter for very little.

One of my LSP clients whom I introduced to OCR technology years ago uses it for one thing only: estimating text counts for quotation. Depending on the source, this may be very accurate or only a rough count (if there are serious contrast problems that can't be compensated, for example). I do this not only with PDFs and bitmap files such as JPEG or TIFF, but also with large, complex documents in other formats. For example, f someone sends me a 200 page document in an MS Word format and I need to prepare a cost estimate for services quickly, I will often print it to PDF, then run the PDF through an OCR engine and save the results as plain text for counting. One cannot always rely on the counts from Microsoft Word itself or various translation tools for text counting. Embedded objects, even editable ones, are generally not included in the counts. I have seen RTF documents where have the tables are ordinary RTF tables (which are counted) and others which look the same are spreadsheet objects (not counted unless you are a Star Transit user AFAIK). And PowerPoint files can have embedded Excel or Visio objects or other uncountable elements. Printing to PDF and doing an OCR subsequently avoids this problem and enables one to ensure that nothing big was missed when counting by other means. This procedure also serves as an "early warning" if there is text to translate that is not extracted by the translation tools. It really sucks to "finish" a long job only to discover that several thousand words in tables and diagrams remain untranslated.

Using OCR to prepare translations is often straightforward, but there are a number of traps that people commonly fall into. Do not, under any circumstances, be seduced by the automatic conversion settings of any commercial OCR program nor by options to save with the "original formatting". This is nearly always a disaster when working with translation tools. Problems may include bizarre text changes, disappearing chunks of text due to text box sizing problems, a plague of tags and more. About six years ago I wrote some guidelines on the ways to save and work with OCR text; these are a bit dated, but generally valid. I would also suggest getting and learning to use Dave Turner's Code Zapper macros for MS Word; these can not only clean a lot of garbage out of troublesome OCR texts but also from MS Word or RTF documents that suffer from the tag plague for other reasons.

It is usually best to go through documents to be converted and manually set OCR zones and their properties (such as text, picture, orientation, inversion, etc.). The extra time spent doing this will be saved later when translating or making final edits to your work. And even with long documents (or especially with long documents), it is better to save the converted text with no formatting and then reapply formatting using defined styles. This approach usually also produces text that can be transferred to DTP programs such as InDesign with less trouble. Ignoring this particular bit of advice can lead to a lot of wasted time and grief.

Before you begin translating, it is also important to scan through the converted text and look for superfluous line breaks, excessive spaces and other formatting problems and fix these so that your translation segments will be as clean as possible. Such preparation at this stage is a very good investment of time. It may also be useful to run a source language spell check to catch errors in the text conversion.

Another reason I like to invest a bit of extra effort in cleaning up an OCR source text is that it is often part of the delivery to a client who may have lost (or never had but may need) an original, editable document. This can be part of your presentation as a professional, a little service to set you apart from the competition. In some cases I have seen persons skilled in OCR offer their conversion services to busy translators or LSPs; I don't think the need for this type of service, if done right, has decreased in real terms, though I can't say what the actual demand is these days. I used to get a lot of requests for this service, though I discouraged inquiries for languages that I don't know and eventually steered most of those asking toward developing their own capabilities.

What are your experiences with OCR? Any tips for best practice with your favorite tools?

Aug 27, 2011

Data exchange issues for translators & TAUS

Jost Zetzsche of the International Writer's Group has written an interesting short report on data exchange standards for translation content, how they affect individual translators, why we should care and why freelancers should be involved in the development of these standards. The report can be downloaded from the link here or from the TAUS website where you'll be asked to give contact information that you may or may not want to provide for a mailing list.

Jost's paper is a good overview which explains its points clearly for nontechnical audiences as well. He has a gift for that, which is why his Toolkit newsletter and Toolbox primer for translators are so enormously popular. In some way I find it a shame that this latest information is distributed via TAUS, and organization regarding which I hold no little suspicion, but it is important that we keep an eye on those who pursue their agenda, and some limited collaboration can in fact be a good way to do this.

Why am I suspicious of TAUS? Two things.

First, I find the organization's case for sharing all TMs rather weak. In certain areas like standard error messages for software, I can see the point. Without even considering copyright and privacy issues, I simply find the idea foolish for much of the work I do. I don't care to see most of the trash TMs that LSPs accumulate, fail to maintain properly and try to "leverage" on the backs of their translating teams. Why would I be interested in greater quantities of linguistic sewage? Even my own TM content gets stale after a while and needs an overhaul as language and usage evolve. Those who plan to leverage their 100% matches for the next decade or two should hope that their customers and users like the taste of cardboard and old shoes or they may find themselves with communication and image problems at some point.

The second problem I have with TAUS is the organization's silly, shameless shilling for machine translation. Horsefeather-stuffed essays like "Want to ride the machine translation tidal wave?" or the intimidation set piece "What options do translators really have?" with its Darwinistic principles for how translating monkeys must evolve should give TAUS very little credibility with those in the industry who do not exhibit disturbing characteristics presumed to be part of a lemming's DNA.

This is not to say that I do not support TAUS. I think all my customers' competitors should hang on their every word and pursue the full program of data leveraging and process automation, eliminating as much of the human element from translation as possible. In Germany, I fully support initiatives by the Arbeitsamt, which previously retrained displaced coal miners as occupational therapists, to offer career path alternatives to call centers by certifying the long-term unemployed as MT post-editors (part of the "evolution" touted by TAUS), perhaps even providing their services to industry in the form of the beloved One Euro Jobs in a tradition of slave labor anchored in the middle of the last century.

I am perfectly content to be sidelined by history here, to stand lonely on a high hill, a translation creationist foolishly resistant to the industry's evolution, and miss the thrill of the MT tidal wave as it washes away common sense and quality and leaves consumers and business people picking through the strewn detritus of meaning.

Where will you be? The surf is up!

Aug 26, 2011

Berliner Übersetzertreffen am 1. September

ÄNDERUNG!!!

Liebe Leute,

das Übersetzertreffen am Donnerstag, 1. September 2011, ab 20.00 Uhr wird auf vielfachen Wunsch nicht im Via He Hai stattfinden, sondern noch einmal im:

                Brauhaus Südstern
                Hasenheide 69
                10967 Berlin (Kreuzberg)
                U-Bahn: Südstern
                www.brauhaus-suedstern.de

Bei schönem Wetter natürlich wieder hinten auf der Terrasse.

Hat noch jemand eine Idee? Gern nehme ich Restaurant-Vorschläge für den Oktober entgegen (gemütlich und U-/S-Bahn-nah).

Viel Spaß am Donnerstag!
Andreas Linke


Ausblick:
Das übernächste Treffen findet wie gewohnt am ersten Donnerstag des Monats statt, und zwar am 6. Oktober 2011.

Understanding SDL Trados Studio text counts

In my recent essay on format surcharging, I proposed among other things that the added complexity of texts laden with tags might be compensated by a surcharge related to the number of tags. I intend to revisit this idea later with additional analysis and perhaps a spreadsheet tool to make things a little easier to calculate for people with a wide range of translation environment tools.

However, I faced a small problem in evaluating this approach: although I do own a license for SDL Trados Studio 2009, I use it mostly for testing and implementing cross-tool workflows, and I do very little actual translation work in that environment, because it doesn't cut it for me ergonomically. I find many of the developments at SDL in the past few years positive on the whole, driven as they are by competition such as Kilgray, with an innovative capacity to keep any company who wants to continue being taken seriously focused. But the fact is, I just can't justify the effort to do a lot with the SDL product. So there are some rather appalling gaps in my knowledge of SDL technology, which various friends discover time and again as they ask me to help them sort out the eternal refrain of Trouble with Trados. Fortunately, SDL has an excellent representative on its team, Mr. Paul Filkin, whom I believe has single-handedly done more to salvage the company's reputation and keep users loyal than anyone else in the firm in the past decade. God help SDL and users if he isn't around at some point. Sometimes he even convinces me with quiet, clear examples that Trados might have socially redeeming value.

So of course I turned to Paul with my embarrassed question about how to interpret a latter day Trados text count from Studio 2009, and he kindly obliged with the following excellent example:

The total word count for this document is 28 words:
 
In Studio it looks like this; I left the active segment as #4 to make it clearer when we get to the analysis… blue underlines are recognised placeables for my language pair:
 
The analysis is as follows:
 
So working by segment:

·         #1 : 5 words
·         #2 : 7 words, with one placeable that is counted in the wordcount
·         #3 : 7 words, with two tags.  The tags are also placeables and are not included in the wordcount
·         #4 : 9 words, with four placeables each of which are counted as a single word

So the conclusion would be:
  • all placeables are counted as words apart from tag
  • tags themselves can be identified for any manual adjustment to the overall rate for tag handling

A good explanation I think, though now I realize that I forgot to ask him whether the character counts include spaces, so the typical line calculations popular in German speaking countries can be performed without trouble or whether the acrobatics of adding the word count to the character count as we had to in Trados Classic are still called for here. RTFM time here, I believe. Or... Paul? Help!

Aug 25, 2011

Kicking the habit

My daughter turned 18 yesterday, and I decided to celebrate the occasion in a healthy way by ending another unhealthy relationship: my ProZ membership, which was due for renewal that day. Over the past few years, it has become clear that the relationship with the self-designated leading e-portal for translators was an abusive one.

When Mr. Qaddafi, a junior military officer, staged his coup in Libya some 40+ years ago, things were better at first. Roads were built. Schools. Hospitals. Then the Road to Nowhere with its master plan in the Green Book.

When a Japanese translator and MIT graduate founded ProZ and built it with the help of skilled, enthusiastic professionals and others pleased to have a new communication platform with a bit more to offer than creaky Compuserve, there was a honeymoon period with competent moderation of forums, helpful exchanges of ideas and terminology advice and more. And then, when the Ring of RuleZ was forged and darkness fell on the virtual land, nontranslating nazgul redefined right and wrong and wrote the platform's end as a unique and useful place for professionals.

ProZ has always been an easy "fix" for me. A lot of good colleagues complain that in years of quoting posted jobs there they had no success. I have other beefs with ProZ - mostly bad management. As far as the posted projects are concerned, I probably got 70% or better of those for which I choose to quote and did so at quite reasonable rates. It's all a matter of knowing when to quote and how to game the system. And having qualifications that are in demand. If your tail gets in the way of your typing and you munch peanuts and bananas over the keyboard, even Jeff Gitomer probably can't help you upsell your services.

But those little fixes of money and ego boosts of yet another easy score have stood in the way of more serious development, and like someone with Stockholm Syndrome in a violent relationship, I've found excuses for years now to stick around ProZ long after Henry & Co. applied duct tape to my mouth and required individual approval of any forum contributions I might care to make. I'm in good company there - many of the best moderators from the past have been similarly gagged for a long time, and some have been subject to summary virtual execution. I am talking about some of the most ethical, honest, competent colleagues I know.

So it's time for a bit of methadone (free account status) on the transition to a better life in healthier company. I've found it already, actually, in professional associations like the German BDÜ and private, non-indexed forums with carefully screened professionals like Stridonium. Why would I need the ProZ needle in my arm, to put up with an organization that shows open contempt for paying customers with an opinion as it debases its content for wannabes and Asian sausage shop LSPs? Why continue to host a domain there when the costs are far higher than elsewhere and the service truly sucks? Let it go. Time to kick the habit.

The quick and the dead

Steve Jobs was in the new again today. As many have expected for years, the Resurrection Man has come to the point where his failing strength and perhaps other priorities have led him to cede control at Apple. It's been a good ride. I made a lot of money over the years buying Apple stock when the doomsayers at the Wall Street Journal predicted the company's imminent demise and the world watched as it soared again.

Last December, a man I know was supposed to die. He was crazy as a loon, still is, a brilliant researcher who sometimes couldn't tell his daughters from his deceased mother or his ex-wife, but he has an unshakeable core of sanity that once led him to cheerfully announce that the doctors who told him and his wife that their newborn daughter would die shortly were daft. Twenty-seven years later the verdict of the medical experts has yet to be enforced. Last December, with Death's doors wide open, he began to plan a trip for the spring. I forgot to ask if he took it, but this week he is expected to die again, and his daughter tells me he's planning another trip. Crazy indeed.

We tie ourselves in knots with worry or plans, predictions of The Coming of TM, The End of the Economy and whatnot, and things usually turn out quite differently on the longer journey. The plan is the hand on the tiller, but to navigate the currents safely, the hand must move with subtlety and good timing. And still in depths sounded well there may be rocks close to the surface.

Mr. Jobs had a good run, because he had clear vision that saw risk as the fertile soil from which real bounty can grow. In my farming days I saw years where the water rose and ruined my neighbor's fields; once I was cut off for ten days at the end of a cul-de-sac when the road washed out. And yet Mr. Jobs continued to plant his seeds and my former neighbors and others around the world still do, and the rain or the drought comes as it may.

There are a few certainties. Children grow until they do not. Friends, family, pets live until they do not. Will the customer accept your well-planned offer at a fair price? The outcome is perhaps more certain one way or another than next week's weather, and if you do not ask for what you need, it is nearly certain that you will get what you have asked for: nothing. The people of my host country, Germany, have insurance for everything, but for little that really matters, as all the myriad payments have no dividend of courage.

Aug 7, 2011

Format surcharging in translation

One of the strategies I pursued early on in my career as a commercial translator was to equip myself with tools able to handle a great variety of source formats and then learn (mostly by nail-biting troubleshooting for my projects and those of others) to cope with the exceptions, typical problems for a given format and the insoluble and unexpected disasters of files with Hotel California workflows in various CAT tools. At a time when a majority of translators in my language combination were perceived as incorrigible technophobes and many agencies struggled to deal with the technical intricacies of IT and data exchange in translation, this was a path to appallingly rapid business growth.

How times have changed. Or haven't. Translation environment tools have evolved more in the past decade than some industry critics will admit, making more formats reliably accessible and data exchange between environments less likely to trigger calls to the local suicide hotline. Lots and lots of translators now "have" Trados or some other tool, or the tools have them. But it's a tenuous relationship in the majority of cases. More tenuous than many realize until suddenly the simply formatted Word file won't "save as target" from the TagEditor horror chamber, or they get the idea of actually using "integrated" terminology features in some tools and learn a new definition of despair.

My Luddite friends are right to speak of the complexities that can lurk in even the "simplest" translation environments, but I believe that dealing with these complexities in simple, rational ways and sharing the information will go farther toward simplifying our lives and enhancing our professional status than desperately clinging to outmoded ways that will increasingly restrict the flow of business. However, as we adopt new ways, we must think more about these complexities, the real effort involved and how to offset this effort in simple, economic terms.

Take file formats, for example. Over the years I have heard many suggestions from colleagues for how to charge different formats. Many of these seem rather arbitrary and not necessarily sensible to me, such as all the myriad ways people charge for the translation of PowerPoint slides. Work with PowerPoint can be simple and straightforward or a hideous nightmare requiring complex, creative combined strategies of pre-translation repair, filtering, dissection and reassembly and much more. Microsoft Word is seen as a simple format, but add a hundred footnotes, cross-references, formulae, "Word Art", embedded Excel and Visio abjects objects, a rainbow of colors for coding and some massive, uncompressed images for good measure and you often face quite a challenge.

How do you deal with such complexity, plan for it in your schedule and charge it in a manner which is fair to the persons performing the service and those paying for it? The answer is not easy, but the typical response to the question - ignore it and charge "usual" rates or hocus pocus some percentage mark-up - is not very satisfactory.

Discrete, pre- or post-translation tasks such as OCR, format repairs, extraction and re-embedding of translatable objects or the transfer of these to separate documents for "key pair" translation are all fairly easy to handle in an acceptable, transparent way with hourly fees for the effort. When I deal with such matters, I occasionally provide the client with detailed work instructions for how to go about performing these tasks cleanly to "save" money with the caveat that if it isn't right, the work will be re-done and charged.

I have yet to come up with a standard way of coping with files that are simply so big that they choke the tools I use or tie up my resources for an hour while exporting a translated file. Here, technological aikido is usually the most effective strategy: at various stages in the past decade, for example, I have converted graphics-laden RTF or DOC files to HTML, TTX and now DOCX to minimize troubles and speed up processing. Once I have worked out a way of avoiding those big resource tie-ups (often at the cost of hours or days of thought and experimentation), I feel I don't have to consider the charge issue (but of course I'm really wrong). However, the risks of format failure are so great in my experience that "round trip" tests must  be performed to ensure that once a translation has taken place the results can be transferred to their deliverable form without much ado. If I forget to do this under pressure, I very often regret it. Think of round trip workflow testing for the files you translate as a possibly life-saving pre-flight safety check. You might not die in a crash, but business relationships will.

One issue I have meditated on for a very long time and mentioned at intervals in translators' forums without finding a reasonable answer is that of markup tags. Many tools didn't even used to count them; at one point Déjà Vu was the only one I was aware of that did. The best answers that colleagues seemed to offer for tag-laden documents, which inevitably require more work and frequently lead to stability problems, was to "charge more" or "run like Hell". Both good answers, really, but lacking in the quantitative rigor my background in science leads me to prefer.

The solution arrived somewhat unexpectedly with the beta version of memoQ 5. At first I thought that SDL Trados Studio 2009 offers no solution here, but with that tool the context in which you view the statistics is important. Look at the analysis under "Reports", not "Files". Any counting tool that reports tag frequency can be used to calculate this solution, if need be with a spreadsheet if the factors cannot be added the tool's internal statistics for words or characters.



The solution is obvious and really wasn't far from the discussion which has taken place over the years: simple word or character weighting for the tags. However, it was not until I saw the fields in the new memoQ count statistics window that I really began to think about what those factors should be.

In SDL Studio 2009 the tag statistics are found in the analysis under "Reports" as mentioned and look like this (thank you to Paul Filkin for the technical update and the graphic):



I thought about my own experience with tags in files over the years and the actual extra effort of inserting them for formatting at the beginning and end of segments or somewhere inline. For reasons I won't try to explain in an over-long post, I figure that a single tag costs me the effort of about half a word, or given the average word length in my source language, about 3.5 characters. So I put in "0.5" words and "3.5" characters as the weight factors in memoQ, and my count statistics are increased to compensate for the additional effort involved.

Now you may disagree on the appropriate factor, saying it should be more or perhaps less. That's OK. I consider this a matter open to negotiation with clients. The important thing for me is that we have a quantitative basis for discussion and negotiation which anyone may check. It's important that this and other issues relevant to project planning and compensation be brought out of the closet and discussed rationally. Not just to get "fair" compensation, but to educate those involved in the processes about the effort involved and to set more realistic project goals as well.

For some of the OCR trash that clients produce ineptly and try to foist off on translators as source files, this "tag penalty" may encourage better practice or at least offset the effort of using Dave Turner's CodeZapper and other methods to clean up the mess. (However basic structural problems caused by automated settings in OCR tools will never be overcome this way.)

In any case, this is a technique which I hope will inspire discussion and study to find its best application in various environments. And I do hope to see widespread adoption of such options in modern translation environment tools to further offset the grief occasionally encountered in modern translation.




Aug 6, 2011

"The Future is here" and the end is near

On September 9th, the Dutch Association of Translation Agencies (ATA the Lesser) will be hosting a conference on the bubbly, bright future of MT post-editors and why all good translators should be eager to hop on that gravy train and ride it to the greatest challenge of their professional careers: turning pig shit into gold.

The conference keynote speech will be by industry prophet Renato Beninatto, which will undoubtedly be full of entertaining claims and predictions. Most of the presentations will be in English, hopefully machine translated to convey the real quality of that present future. Presentations will include a sales workshop by Renato, and Atril, SDL and Plunet will present their products, presumably with some MT-related spin. The rest of the workshop titles are clearly focused on MT editing processes.

Information on the conference program is available in Dutch and English. Those who read both pages will note the date discrepancy on early bird rates. I presume that the Dutch information citing August 10th is correct and someone simply botched the translation and editing of the English page.

Those who need to collect PE points to maintain their Dutch certifications will receive 5 points for attendance.

See you in the future!

August-Übersetzertreffen in Hohen Neuendorf


Liebe Kolleginnen und Kollegen,

die Einladung zum Übersetzertreffen kommt diesmal schon ein bisschen früher, denn ich werde gar nicht dabei sein. Das Treffen ist am:

                Donnerstag, 18. August 2011, ab 19.00 Uhr

Ziel ist das:

                Ristorante Castagno
                Käthe-Kollwitz-Straße 58
                16540 Hohen Neuendorf
                S-Bahn S1: Hohen Neuendorf

Am S-Bahn-Ausgang auf der Schönfließer Straße geht es nach links über die Brücke bis zur Puschkinallee. In die Puschkinallee einbiegen, und kurz darauf ist an der Abzweigung Käthe-Kollwitz-Straße das Ziel schon erreicht.


Grüße
Andreas Linke

Vorschau:
Das übernächste Übersetzertreffen findet wie üblich am dritten Donnerstag des Monats statt - am 15. September 2011.

Aug 2, 2011

Übersetzertreffen in Berlin-Kreuzberg


Liebe Leute,

hier die Einladung zum nächsten Übersetzertreffen am:

                Donnerstag, 4. August 2011, ab 20.00 Uhr

Wir gehen noch einmal in das:

                Brauhaus Südstern
                Hasenheide 69
                10967 Berlin (Kreuzberg)
                U-Bahn: Südstern
                www.brauhaus-suedstern.de

Bei hoffentlich schönem Wetter sind wir hinten auf der Terrasse und genießen das selbstgebraute Bier.


Bis Donnerstag!
Andreas Linke

Ausblick:
Das übernächste Treffen findet wie gewohnt am ersten Donnerstag des
Monats statt, und zwar am 1. September 2011.