Oct 5, 2023

What's wrong with my segmentation (in translation)?

The fifth open office hours session for the self-guided online course "memoQuickies Resource Camp" discussed segmentation problems with documents imported to translation environments such as memoQ, Trados Studio, Phrase, Cafetran Espresso, etc. and various ways that these issues might be identified so that they can be corrected.

Segmentation problems waste enormous amounts of time, and bad segmentation rules are a plague on the translation and localization service community. Unfortunately, nearly all the rules I have seen, for all working environments, simply suck sewage. memoQ's rules usually suck less, but still....

This week's talk presented, among other things, some methods for identifying segmentation trouble spots quickly and easily with the use of special regular expressions describing common patterns followed by texts with troubled segmentation. And a Regex Assistant library has been provided (and will be updated during the course period) to help with all of this.

The video and related course pages will remain completely open to the public, with downloads available, at least through the end of 2023. After that the pages and resources may be taken down for updates and reorganization in other courses.

The video recording of the lecture "What's wrong with my segmentation?" can be accessed on YouTube (embedded below) or course participants can access the page to download it by clicking the "segmentation rules" icon at the top of this article.


An important part of checking the performance of your segmentation rules and possibly improving them is to have a good sampling of test data. One of my favorite sources for this are the European Community archives at the DGT, where EU legislation and other important information is available in a parallel corpus of all the official languages of the Community.

I have downloaded part of the 2022 DGT distribution and prepared a number of monolingual and bilingual corpora (about 2.6 million words, approximately 150,000 TUs) in EU languages and translation pairs. Moreover, information on my method has been published so that others can reproduce it for the languages that interest them.

No comments:

Post a Comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)