Apr 2, 2012

TM-driven segmentation in memoQ

One old feature of memoQ which continues to put cash in my pocket and make my work go faster is TM-driven segmentation. It is a pretranslation option. In theory, it combines and splits segments to improve matches from the TM; in reality it is biased toward combination, which is a good thing, as it emphasizes coherent text chunks.

I recently completed a translation for my least favorite end client of an agency partner I rather like. I suppose the folks at this end client company are nice enough; most probably do not beat their dogs or their children. But the texts they send for translation are abusive in the extreme: Microsoft Word files generated by some sort of program on a host system, with a bizarre mix of colors and font changes (both type and size), as well as lots of superfluous line breaks and carriage returns. I presume the thought for the latter is to avoid overlapping graphics, but since text wrap is turned on for the graphics anyway, I don't see the point. What I do see is horrible German sentences horribly mutilated into as many as five or six chunks, but at least two or three most of the time. A real crime.

And did I mention that segments break at the color and font changes even for sentences which appear intact? No CAT guru has ever been able to figure that one out.

One such horror revisited me last week, and I put it off as long as I could. Finally, I got to work at the point where the deadline was very much in doubt, and as an afterthought I did something I usually forget about: I pretranslated the file. I applied the "TM-driven segmentation feature", which is not considered in the file analysis. To my amazement, most of the file pretranslated with matches over 95%. When the remaining empty segments were examined and 4 or more parts joined to make a sentence, most were 99% matches. I had completely forgotten that I had translated this material a year ago. And the agency was unaware of that as well, because they rely on traditional Trados methods for file analysis and processing. What I thought was going to be a very hard slog through about 500 horrible segments turned out to be a bit of tag tweaking and a few sentences of updating the text.

This is part of what my agency friends who have gone over to memoQ mean when they talk about improved leverage over time from legacy resources.

To demonstrate how this works, I took a bit of text on "technical terminology" from Wikipedia and prepared it as a text with coherent sentences and also as a text with lots of superfluous carriage returns like one might find with text copied from a PDF file, for example:

I translated the file with intact sentences in memoQ, then ran an analysis using the Operations > Statistics... function:

The file's segments looked like this in memoQ:

Then the file was pretranslated using the TM-driven segmentation option:

This was the result:

The exclamation marks indicate missing tags, which may cause problems. In cases like this I usually insert them at the end and clean up the spacing in the output target file. And if I send a TMX to someone I clean the crap tags out of it with a search and replace operation in a text editor.

To satisfy my curiosity I then deleted the contents of the TM and made a new source file with a couple of broken segments:

Note the lesser quality of what will be going to the TM. This is the diet Trados users have enjoyed for a long time or, for that matter, what anyone who uses a CAT tool without the ability or knowledge to join segments may swallow routinely. After that was sent to the TM, I re-translated the file with intact sentences:

In Segment 1, a split was made, but no pretranslation was done of the fragment (even though it was in the TM as "101%"). In Segment 4 the sentence was not split but instead taken as a fuzzy match. The information pane at the right of the translation window shows the differences with the TM information:

I am not disturbed by the more restrained matching when splits are involved. I consider it a good thing, a feature which encourages users to wean themselves off the bad practice of "translating" text which has been impossibly chopped up. Smart translators use the functions for segment joining and splitting frequently in a good CAT tool, and with memoQ this habit is rewarded particularly.


  1. Once again you succeed in demonstrating a function I wasn't sure about in a clear and understandable way - very helpful and I'm sure it will prove very useful soon!

  2. Hello! I am a translator with over 12 years of experience. However I am a total beginner when it comes to CAT tools. I have used them because of customer demands, both MemoQ and Trados. I felt much more comfortable with MemoQ, but to be honest still find it all hard to understand.

    You must realize I am really a complete beginner, to the point where I still don't understand why are CAT tools useful or a good idea.

    So my question is: What should I read? Where should I go to learn more about the benefits of CAT tools, and even more important: how do you train someone like me, for whom words like Tags, matches, splits, TM, pretranslation, segmentation, etc... are still evoking very vague concepts...

    I am however very interested... so any advise will be considered useful and elicit eternal gratitude :)

  3. Hard question to answer, Mariana, without an overview of your work patterns. I have a very different view of the benefits of these tools than the common wisdom, but because I adapt tools such as these to individual needs rather than take a top-down view of purpose and function, I'll probably give a different answer to anyone who asks me.

    What to read? Read this blog, read my book, read whatever is relevant to the types of projects you do. Look for technology mentors in your local area or online who haven't got their heads up their own backsides with rigid engineering assumptions that have little to do with real people and their psychologies. Ignore most of the common wisdom and what experts tell you, and you'll probably be fine in the end.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)