Translation Tribulations: TM-driven

Showing posts with label TM-driven. Show all posts

Apr 26, 2012

Twitterview: SDL Trados Studio, memoQ, DVX2 and PDF extraction

When I began using Twitter somewhat hesitantly three years ago, I never expected that it would eventually prove to be one of the most useful social media tools for gathering information of professional value. Much of this is serendipitous; I really never know what will come floating down the twitstream or where some of the conversations in it will go. Like the direct chat I had with with a colleague in New Zealand about features she liked best in the two main CAT tools she uses, SDL Trados Studio and memoQ.

We both really appreciate the TM-driven segmentation in memoQ and the superior leverage this offers. But to my surprise, she expressed a preference for SDL Trados Studio, particularly for the quality of its PDF text extractions from electronically generated files. This is not a feature I make heavy use of in either tool, though I have used it more often lately in memoQ for alignments in the LiveDocs module and found it generally satisfactory. Most of my work involving PDF files is with scanned documents - there one has no choice but to use a good OCR tool like OmniPage or ABBYY FineReader.

So I was quite intrigued that the quality of PDF was "better" than from standalone tools. Especially because my experience is quite different. Further discussion (not shown in the graphic) revealed that what she actually meant was that the quality of the text extraction with the CAT tool usually beat the quality of text received from translation agencies who performed conversions. That is easy to explain, really. In my experience, most agencies are clueless about how to use conversion tools and too often use automated settings and save the results "with layout". This is very often utterly unsuited for work with translation environment tools or requires a lot of cleanup and code zapping.

For years I have recommended to agencies and colleagues that they spare themselves a lot of headaches by saving PDF conversions as plain text and adding any desired formatting later. Most people ignore that advice and suffer accordingly. So in a way, a CAT tool that does so encourages "best practice" for PDF translation for those files they are actually able to handle.

Encouraged by the Twitter exchange, I decided to do a few tests with files from recent projects. I took a PDF I had with various IFRS-related texts from EU publications. It appeared to extract quickly and cleanly in memoQ, giving me a translation grid full of nicely segmented text. SDL Trados Studio 2009 choked badly on it and extracted nothing. Her extraction in SDL Trados Studio 2011 caused a timeout with the project I was told, but the text itself was completely extracted and converted to DOCX format. This is useful, because unlike the extraction to plain text in memoQ, this offers the possibility to add or change some text formatting in the translation grid. Other extraction examples from SDL Trados Studio 2011 showed that text formatting was preserved.

A closer examination of the extracted texts revealed some problems with both the memoQ and Trados Studio extractions. The memoQ 5 PDF text extraction engine proved incapable of handling text in multiple columns properly. The paragraph order was all fouled up. The extraction with SDL Trados Studio had a great number of superfluous spaces. Whether it is possible to optimize this in the settings somehow I do not know. The results of all the extraction tests are downloadable here in a 6 MB ZIP file. I've included the SDL Trados Studio extraction saved to plain text as well for a better comparison of the text order and surplus spaces problems.

Overall, I am personally not very pleased with the results of the text extractions from PDF in either tool. The results from SDL Trados Studio are clearly better, and other examples that were shared made it clear that this tool works better than many an untrained PM with better PDF conversion software. This is certainly much better than solutions I see many translators using. But really, nothing beats good OCR software, an understanding of how to use it well and a proper workflow to get a good TM and target file better fit for most purposes.

*****

Update 2012-05-22: I met colleague Victor Dewsbery at a recent gathering in Berlin, and he told me about his tests with the recently introduced PDF import feature of Atril's Déjà Vu X2 translation environment. He kindly offered to share his results (available for download here) and wrote:

Here is the result of the PDF>DVX2>RTF>ZIP process for your monster EU PDF file. Comments on the process and the result:

The steps involved were: 1. import the file into DVX2 as a PDF file; 2. mark all segments and copy source to target; 3. export the file as if it were a translated file (it comes out as an RTF file). The RTF file is 20 MB in size and zips to 3 MB.

Steps 1 and 3 took a long time, and DVX2 claimed to be not responding. For step 1 I just left it and it eventually came up with the goods. Step 3 exported the RTF file perfectly, even though DVX2 claimed that the export had not finished. I was able to open the RTF file (it was locked, but I simply renamed it), and this is the version which I enclose. Half an hour later DVX2 had still not ended the export process (and had to be closed via the Task Manager), although the exported file was in fact perfectly OK. The procedure worked more smoothly with a couple of smaller PDF files. Atril is working on streamlining the process and ironing out the glitches in the process, especially the “not responding” messages.

The result actually looks very good to me. There are hardly any codes in the DVX2 project file (the import routine also integrates CodeZapper). I didn’t spot any mistakes in the sequence of the text. Indented sections with numbering seem to be formatted properly - i.e. with tabs and without any multiple spaces.

The top and bottom page boundaries in the exported file are too wide, so most pages run over and the document has over 900 pages instead of just under 500. Marking the whole document and dragging the header/footer spaces in Word seems to fix this fairly quickly.

I note that some headlines are made up of individual letters with spaces between them. This may be related to the German habit of using letter spacing (“Sperrschrift”) for emphasis as an alternative to bold type.

I found one instance where text was chopped up into a table on page 857 of the file.

There are occasional arbitrary jumps in type size and right/left page boundaries between sections.

On the strength of this sample, it would usually be OK to simply import the PDF file into DVX2, translate in the normal way, and then fix any formatting problems in the exported file.

Apr 21, 2012

TM Follies

A recent comment by Iwan Davies on Twitter revealing a reviewer's rather odd notions of the requirements imposed on translation by the use of a translation memory tool led me to reflect with a friend on some of the very strange and wrong ideas that persist in some minds with regard to such technology. In the case of the twitstream discussion, the reviewer's stupid notion that each segment in a translation must stand on its own without context provoked an interesting flurry of responses, ranging from the astute observation from @PaulAppleyard that "if you wanted to translate segments as 'standalone', then you would work in a random segment file, not a text that flows..." to some rather disturbing remarks from a few to the effect of "this is why I don't like to use such tools". Various people pointed out that modern translation environment tools such as SDL Trados Studio, OmegaT and memoQ make use of context in their translation memories to avoid the problems of more primitive systems which, in the hands of translating monkeys, too often result in matches being used in very inappropriate ways.

The list I could compile of wrong-headed ideas about TMs is a long one, and I would probably only capture ten percent of the foolishness on a lucky day. A few highlights in my memory include:

A statement by an otherwise respected colleague some years ago that translators must not sacrifice potential "leverage" by combining segments to make sense in the translation. This included cases where someone inserts
carriage
returns and line
breaks into the sentence to
make it fit in some odd space. In a source language like German, where word order is often very different than in a good English translation, this can quickly pollute a TM to the point of being worst than worthless. This in fact describes the real state of many "promiscuous" agency TMs that I have seen over the years. Fortunately, advanced features in modern translation memories, like memoQ's "TM-driven segmentation" encourage much better practice among smart service providers today.
The widespread notion that translation memory systems are only useful if one works on repetitive texts. I've got news for you: much of the repetitive stuff was outsourced to King Louie & Co. years ago. And yet I still find great value in working with good TMs. Why? A friend of mine summarized it nicely the other day when she talked about how she spent two hours researching a very obscure term for roadworks equipment in a minor European language: "The next time this comes up, I can find it right away and see the context." Indeed. I am amazed sometimes at the obscure technical terminology that comes out of my personal TM with its 12 year record of my work. Sometimes that amazement is even positive. An hour invested in researching a term and saving it in a TM (or much better: a proper termbase with metadata including domain´, source and examples of use) is probably several more hours saved over the next few years. At least.
The idea that a translation memory is a reliable source of terminology and obviates the need to create and maintain termbases or proper glossaries. Wrong, wrong, wrong. Particular offenders in this regard are agencies with their brothel-like practices of letting any number of translators screw the end customers' texts. Do a concordance search to find the right term in one of those TMs? Riiiiiiiiiight. Even agencies I've worked with for years who have made a real effort to keep TMs clean can't keep the terms in them on the straight and narrow. And using TMs to replace a real termbase, even a limited one, sacrifices the enormous potential benefits of automated terminology QA procedures offered by some modern translation environments.
King Louie & Co. as well as many other agencies in the race to the bottom of the quality barrel truly believe that once a good TM has been established by top translators, the second- or third-tier team can take over at lower cost and keep the customer happy. Well... at the moment, the lock on my Volvo's rear hatch is broken. I could get it fixed by a mechanic on Monday, or I could follow my neighbor's suggestion, and just hold it shut with a bungee cord. And the next time a tail light cover gets broken, I could just tape some red or yellow plastic film over it. Replace the hubcap that flew off when I hit that pothole? Naw. But sooner or later, people will notice the difference and draw their own conclusions. Will those be good for business? Can a jobbing student equipped with a good TM really produce the quality of legal translation you can rely on before the court? Trägt er auch 'nen gold'nen Ring, der Affe bleibt....

I have noted over the years, that most of the best clients never ask about translation memories or the tools related to them, even though a good number of them are aware of the technology and many of these use it. But these are the ones who understand that the monkeys who rely slavishly on CAT tools without the use of BAT* too often produce stale, stilted text unsuited for its communication purpose. At best. And all the king's machine translation engines won't change that.

Nonetheless, I believe there is great value for nearly all translators, even "creatives", in using advanced translation environment tools. But that value will not be in the same methods nor in the same features necessarily. Calls to "throw away your TMs" with the introduction of advanced alignment technologies like Kilgray's LiveDocs in memoQ, which allow final edited versions of past documents to be incorporated quickly when a new versions are to be translated, may be a bit premature, but they are often appropriate in my recent experience. And combinations of that with voice recognition technologies, term QA tools and other features offer a wealth of creative possibilities for taking the best and leaving the rest in our quest for better results and working conditions.

* brain-assisted translation

Apr 2, 2012

TM-driven segmentation in memoQ

One old feature of memoQ which continues to put cash in my pocket and make my work go faster is TM-driven segmentation. It is a pretranslation option. In theory, it combines and splits segments to improve matches from the TM; in reality it is biased toward combination, which is a good thing, as it emphasizes coherent text chunks.

I recently completed a translation for my least favorite end client of an agency partner I rather like. I suppose the folks at this end client company are nice enough; most probably do not beat their dogs or their children. But the texts they send for translation are abusive in the extreme: Microsoft Word files generated by some sort of program on a host system, with a bizarre mix of colors and font changes (both type and size), as well as lots of superfluous line breaks and carriage returns. I presume the thought for the latter is to avoid overlapping graphics, but since text wrap is turned on for the graphics anyway, I don't see the point. What I do see is horrible German sentences horribly mutilated into as many as five or six chunks, but at least two or three most of the time. A real crime.

And did I mention that segments break at the color and font changes even for sentences which appear intact? No CAT guru has ever been able to figure that one out.

One such horror revisited me last week, and I put it off as long as I could. Finally, I got to work at the point where the deadline was very much in doubt, and as an afterthought I did something I usually forget about: I pretranslated the file. I applied the "TM-driven segmentation feature", which is not considered in the file analysis. To my amazement, most of the file pretranslated with matches over 95%. When the remaining empty segments were examined and 4 or more parts joined to make a sentence, most were 99% matches. I had completely forgotten that I had translated this material a year ago. And the agency was unaware of that as well, because they rely on traditional Trados methods for file analysis and processing. What I thought was going to be a very hard slog through about 500 horrible segments turned out to be a bit of tag tweaking and a few sentences of updating the text.

This is part of what my agency friends who have gone over to memoQ mean when they talk about improved leverage over time from legacy resources.

To demonstrate how this works, I took a bit of text on "technical terminology" from Wikipedia and prepared it as a text with coherent sentences and also as a text with lots of superfluous carriage returns like one might find with text copied from a PDF file, for example:

I translated the file with intact sentences in memoQ, then ran an analysis using the Operations > Statistics... function:

The file's segments looked like this in memoQ:

Then the file was pretranslated using the TM-driven segmentation option:

This was the result:

The exclamation marks indicate missing tags, which may cause problems. In cases like this I usually insert them at the end and clean up the spacing in the output target file. And if I send a TMX to someone I clean the crap tags out of it with a search and replace operation in a text editor.

To satisfy my curiosity I then deleted the contents of the TM and made a new source file with a couple of broken segments:

Note the lesser quality of what will be going to the TM. This is the diet Trados users have enjoyed for a long time or, for that matter, what anyone who uses a CAT tool without the ability or knowledge to join segments may swallow routinely. After that was sent to the TM, I re-translated the file with intact sentences:

In Segment 1, a split was made, but no pretranslation was done of the fragment (even though it was in the TM as "101%"). In Segment 4 the sentence was not split but instead taken as a fuzzy match. The information pane at the right of the translation window shows the differences with the TM information:

I am not disturbed by the more restrained matching when splits are involved. I consider it a good thing, a feature which encourages users to wean themselves off the bad practice of "translating" text which has been impossibly chopped up. Smart translators use the functions for segment joining and splitting frequently in a good CAT tool, and with memoQ this habit is rewarded particularly.

Search me!