Jan 2, 2011

Dining on Tag Salad

Often translation environment tools make our lives easier, but there are cases where this is clearly not true. Files and formats with a lot of mark-up pose particular problems in some cases. The screen shot above is just one (fairly mild) example of what one can encounter in a Microsoft Word file when the author has a psychedelic obsession with changing colors and fonts within a sentence (and, unfortunately, throughout a long document); more extreme examples can often be found with layout formats such as InDesign. (I have opted for an abbreviated tag view here; the full descriptive tags for the formatting would fill a page.) I had once had a file in which I counted about 100 tags embedded in a single word. Three hundred words of easy text took a full day to translate. Files with a high density of tags are also highly prone to spacing errors and other problems in many cases, and it may be extremely difficult to identify these in anything but the final format.

The Devil is in the details, and the details of a typical CAT analysis with the same content untagged in the TM will show a high fuzzy match, which will be minimally compensated with the all-too-popular discount schemes for matches that one often encounters. The reality, however, is that arranging these formatting tags can cost more time than actual translation.

The text analysis of Déjà Vu X from Atril includes a count of tags, so one could in cases like this use that information to charge the tags in some quantified way. However, most other tools do not offer such a capability, leaving us to consider the best approach to negotiating fair compensation for such a mess. Hourly charges come to mind in this particular case, though I seldom favor that for translation work.

It is a great convenience for end clients to work in complex, native or tagged formats, but it is important to recognize the extra effort this may involve, discuss this with the client and make appropriate arrangements. What approach have you taken to this problem in the past?


  1. I have not tried on many proyects but codezapper is quite ok http://tech.groups.yahoo.com/group/dejavu-l/message/99694

  2. Hi Kevin,
    Thank you for pointing out this common problem. Recently, I received a ttx file to translate where there was one or more tags before each word. I noticed however that the ttx had been created from a Word file, so I asked the PM to send me the Word file instead and used Workbench to translate it. So, that one was easy to overcome but most of the times I haven't found another alternative than to just deal with the hundreds of tags and lose much-needed time (during translation and then proofreading the text in its final format to catch any spaces or formatting problems as you pointed out).
    Have a wonderful Year!

  3. This is so true. I have often been surprised when a software project or heavily tagged text takes much longer than anticipated based on the word count. Tags are the culprits. We should all study the analysis carefully when the texts are heavily tagged and optimally charge more for this. The concordance search is also harder when lot's of tags are involved.

  4. A distinction needs to be drawn between tags associated with deliberate formatting, and "rogue" tags.

    My own experience is that major problems are almost always caued by "rogue" tags. Authors with "a psychedelic obsession with changing colors and fonts within a sentence" aren't the norm.

    You might find www.omegat.org/en/howtos/docx.html interesting, especially "Points to note when using the .docx, .xlsx and .pptx formats in OmegaT" and the subsequent sections. This is an OmegaT (and MS) perspective on the issue but I imagine most CAT tools face similar issues.

    The idea of quantifying the tag burden and adjusting pricing accordingly, as Tess Whitty suggests, is interesting. Essentially, this is simply taking certain CAT tools' fancy quantification algorithms to their logical conclusion: if they remove, for example, numbers and proper nouns from the count with the argument that they don't have to be "translated", the tag burden could equaly be added to the count with the argument that it's extra work for the translator.

    Personally though I refuse to go down this road, because if I'm going to apply differentiated pricing (rather than taking a swings-and-roundabouts approach), I want to decide for myself what elements mean more or less work for me and how much, rather than leave that decision to a software engineer.

  5. Hi Kevin, maybe talking with the client is an option. I frequently find that that solves it real quick-like :). However if that is no option, simply ignoring the formatting often goes a long way for me: for example, when clients send a PDF file and I send it though OCR. This will typically create tons of tags at sub-segment level. But they are mostly meaningless, for example in case of discrete hyphens, kerning, interline leads, etc. So in those cases, I simply highlight the entire paragraph or page or even text and reset all character-level formats. Or zap everything, as suggested above. The important thing to keep in mind, in my experience, is that there's no uniform rule, though.

    In the case of your example, I'm wondering what all those font changes are for, really...

  6. Catherine, the WB strategy is a good one and usually works; in fact the PM had suggested it in this case. However, due to factors that nobody understands clearly, the Workbench macros will not segment this content correctly. If there are three color changes in a sentence, each provokes a segment break, and "expanding" the segments didn't always work. Most other tools I worked with did not segment very well either. TagEditor did the best job, but it often failed to segment after periods, even if there were no tags nearby. These are the sort of files that make one want to call in an air strike.

    All the tags in these files are legitimate format tags. In other words, my fave solutions of using CodeZapper (for which there is a new version BTW, received last night from Dave Turner) or the memoQ DOCX filtering option are not relevant here.

    @ctaube: What are all those font changes for? Only the author's whiskey bottle knows for sure. They are quite deliberate, however. Some people think that each element to be emphasized on the page requires some unique format: a different font, a different color, a different size, bold, italic, underline, strikethrough, etc., etc. I saw a lot of that crap in the early days of the Mac when people without design training went bonkers with the formatting possibilities. Maybe this author has been in a coma since 1985 and was recently revived and sent straight to work without an explanation of how grownups use formatting these days.

  7. When the job is full of bold, italics, colour fonts etc. (as in your example) you obviously can't just remove these tags, and so I charge an extra +30% for the job if the client insists on a CAT tool being used.
    It takes me at least 30% longer to do, so I think this is a fair solution.
    This is particularly an issue with ppt presentations when every other word is in red or simply huge. Different word order (adjectives that go in front of / after the noun) in the target language makes this extra hard.

  8. In a case like that, in Deja Vu at least, it might be worth first doing the translation on a file with all font coding removed (select the whole file and press control + space bar).
    Then, having finished the translation, import the original fully-formatted file and pretranslate.
    If you've added the key terms to your termbase as you go along, you'll often find that the formatting codes fall more or less into the right place automatically.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)