Aug 7, 2011

Format surcharging in translation

One of the strategies I pursued early on in my career as a commercial translator was to equip myself with tools able to handle a great variety of source formats and then learn (mostly by nail-biting troubleshooting for my projects and those of others) to cope with the exceptions, typical problems for a given format and the insoluble and unexpected disasters of files with Hotel California workflows in various CAT tools. At a time when a majority of translators in my language combination were perceived as incorrigible technophobes and many agencies struggled to deal with the technical intricacies of IT and data exchange in translation, this was a path to appallingly rapid business growth.

How times have changed. Or haven't. Translation environment tools have evolved more in the past decade than some industry critics will admit, making more formats reliably accessible and data exchange between environments less likely to trigger calls to the local suicide hotline. Lots and lots of translators now "have" Trados or some other tool, or the tools have them. But it's a tenuous relationship in the majority of cases. More tenuous than many realize until suddenly the simply formatted Word file won't "save as target" from the TagEditor horror chamber, or they get the idea of actually using "integrated" terminology features in some tools and learn a new definition of despair.

My Luddite friends are right to speak of the complexities that can lurk in even the "simplest" translation environments, but I believe that dealing with these complexities in simple, rational ways and sharing the information will go farther toward simplifying our lives and enhancing our professional status than desperately clinging to outmoded ways that will increasingly restrict the flow of business. However, as we adopt new ways, we must think more about these complexities, the real effort involved and how to offset this effort in simple, economic terms.

Take file formats, for example. Over the years I have heard many suggestions from colleagues for how to charge different formats. Many of these seem rather arbitrary and not necessarily sensible to me, such as all the myriad ways people charge for the translation of PowerPoint slides. Work with PowerPoint can be simple and straightforward or a hideous nightmare requiring complex, creative combined strategies of pre-translation repair, filtering, dissection and reassembly and much more. Microsoft Word is seen as a simple format, but add a hundred footnotes, cross-references, formulae, "Word Art", embedded Excel and Visio abjects objects, a rainbow of colors for coding and some massive, uncompressed images for good measure and you often face quite a challenge.

How do you deal with such complexity, plan for it in your schedule and charge it in a manner which is fair to the persons performing the service and those paying for it? The answer is not easy, but the typical response to the question - ignore it and charge "usual" rates or hocus pocus some percentage mark-up - is not very satisfactory.

Discrete, pre- or post-translation tasks such as OCR, format repairs, extraction and re-embedding of translatable objects or the transfer of these to separate documents for "key pair" translation are all fairly easy to handle in an acceptable, transparent way with hourly fees for the effort. When I deal with such matters, I occasionally provide the client with detailed work instructions for how to go about performing these tasks cleanly to "save" money with the caveat that if it isn't right, the work will be re-done and charged.

I have yet to come up with a standard way of coping with files that are simply so big that they choke the tools I use or tie up my resources for an hour while exporting a translated file. Here, technological aikido is usually the most effective strategy: at various stages in the past decade, for example, I have converted graphics-laden RTF or DOC files to HTML, TTX and now DOCX to minimize troubles and speed up processing. Once I have worked out a way of avoiding those big resource tie-ups (often at the cost of hours or days of thought and experimentation), I feel I don't have to consider the charge issue (but of course I'm really wrong). However, the risks of format failure are so great in my experience that "round trip" tests must  be performed to ensure that once a translation has taken place the results can be transferred to their deliverable form without much ado. If I forget to do this under pressure, I very often regret it. Think of round trip workflow testing for the files you translate as a possibly life-saving pre-flight safety check. You might not die in a crash, but business relationships will.

One issue I have meditated on for a very long time and mentioned at intervals in translators' forums without finding a reasonable answer is that of markup tags. Many tools didn't even used to count them; at one point Déjà Vu was the only one I was aware of that did. The best answers that colleagues seemed to offer for tag-laden documents, which inevitably require more work and frequently lead to stability problems, was to "charge more" or "run like Hell". Both good answers, really, but lacking in the quantitative rigor my background in science leads me to prefer.

The solution arrived somewhat unexpectedly with the beta version of memoQ 5. At first I thought that SDL Trados Studio 2009 offers no solution here, but with that tool the context in which you view the statistics is important. Look at the analysis under "Reports", not "Files". Any counting tool that reports tag frequency can be used to calculate this solution, if need be with a spreadsheet if the factors cannot be added the tool's internal statistics for words or characters.

The solution is obvious and really wasn't far from the discussion which has taken place over the years: simple word or character weighting for the tags. However, it was not until I saw the fields in the new memoQ count statistics window that I really began to think about what those factors should be.

In SDL Studio 2009 the tag statistics are found in the analysis under "Reports" as mentioned and look like this (thank you to Paul Filkin for the technical update and the graphic):

I thought about my own experience with tags in files over the years and the actual extra effort of inserting them for formatting at the beginning and end of segments or somewhere inline. For reasons I won't try to explain in an over-long post, I figure that a single tag costs me the effort of about half a word, or given the average word length in my source language, about 3.5 characters. So I put in "0.5" words and "3.5" characters as the weight factors in memoQ, and my count statistics are increased to compensate for the additional effort involved.

Now you may disagree on the appropriate factor, saying it should be more or perhaps less. That's OK. I consider this a matter open to negotiation with clients. The important thing for me is that we have a quantitative basis for discussion and negotiation which anyone may check. It's important that this and other issues relevant to project planning and compensation be brought out of the closet and discussed rationally. Not just to get "fair" compensation, but to educate those involved in the processes about the effort involved and to set more realistic project goals as well.

For some of the OCR trash that clients produce ineptly and try to foist off on translators as source files, this "tag penalty" may encourage better practice or at least offset the effort of using Dave Turner's CodeZapper and other methods to clean up the mess. (However basic structural problems caused by automated settings in OCR tools will never be overcome this way.)

In any case, this is a technique which I hope will inspire discussion and study to find its best application in various environments. And I do hope to see widespread adoption of such options in modern translation environment tools to further offset the grief occasionally encountered in modern translation.


  1. Hi Kevin, Studio 2009 (and 2011) counts tags... and placeables... by default. But we don't do the seperate source and target counts.

  2. Paul, I thought 2011 might have somesuch on the agenda, but when I look at the analysis windows in 2009 I see no indication of those statistics. Where do I look for them?

  3. @Paul: It's also interesting that you mention the separate source and target counts for these elements. I think that information can be useful, though its application will vary. Sometimes it's an indication of trouble (omitted tags that will be caught by a tag verification or other QA step in most good tools these days), more often in my texts in indicates a simplification of the formatting or the omission of garbage like tags for optional hyphens (which ought to be cleaned up before translation but too seldom are).

  4. Hi Kevin, I dropped you an email so I could include a screenshot... but I think I just realised where you are looking. There is an Analysis Window at the bottom of the Projects screen and this is just a quick overview. But if you select the Reports view from the left hand menu you will see the full analysis reports. More information is available here.

  5. It is an interesting concept to report on both source and target. I guess you could analyse the finished file if you wanted this, but maybe interesting to have this as a further check as part of a report once you are finished. I don't see people asking for this though so perhaps the QA checks are enough?

  6. Paul, there's one simple reason to report target statistics in general in a counting tool, and that is the common practice in some countries or situations for charging jobs according to target text counts. This not my preferred way to work, because I believe in fixed price quotation wherever reasonable to allow the customer to plan budgets better, but in the case of work for German courts and certain public agencies target counting is mandated under the JVEG law. A lot of your German users I know actually charge their work according to target line counts. So if they were to start charging for tags, maybe they would want to do so based on the target count, though I have personal doubts about this being a good approach.

    I'm also a little careful about what customers "ask for" or seem to be asking for. Sometimes they simply can't envision the usefulness or desirability of an innovation until it is available, and sometimes they ask indirectly, with different words, trying to solve a problem for which something might be the non-obvious solution. I saw something like this at the memoQfest a few years ago when the statement was made about a feature that "no one" was asking for it, and in the space of five minutes it became apparent that many were, but we all used different words for problems that people had difficulty describing clearly. When that feature was implemented a year later it was a great hit. You've probably seen cases like that many times over the years.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)