Translation Tribulations: segmentation

Showing posts with label segmentation. Show all posts

Oct 5, 2023

What's wrong with my segmentation (in translation)?

The fifth open office hours session for the self-guided online course "memoQuickies Resource Camp" discussed segmentation problems with documents imported to translation environments such as memoQ, Trados Studio, Phrase, Cafetran Espresso, etc. and various ways that these issues might be identified so that they can be corrected.

Segmentation problems waste enormous amounts of time, and bad segmentation rules are a plague on the translation and localization service community. Unfortunately, nearly all the rules I have seen, for all working environments, simply suck sewage. memoQ's rules usually suck less, but still....

This week's talk presented, among other things, some methods for identifying segmentation trouble spots quickly and easily with the use of special regular expressions describing common patterns followed by texts with troubled segmentation. And a Regex Assistant library has been provided (and will be updated during the course period) to help with all of this.

The video and related course pages will remain completely open to the public, with downloads available, at least through the end of 2023. After that the pages and resources may be taken down for updates and reorganization in other courses.

The video recording of the lecture "What's wrong with my segmentation?" can be accessed on YouTube (embedded below) or course participants can access the page to download it by clicking the "segmentation rules" icon at the top of this article.

An important part of checking the performance of your segmentation rules and possibly improving them is to have a good sampling of test data. One of my favorite sources for this are the European Community archives at the DGT, where EU legislation and other important information is available in a parallel corpus of all the official languages of the Community.

I have downloaded part of the 2022 DGT distribution and prepared a number of monolingual and bilingual corpora (about 2.6 million words, approximately 150,000 TUs) in EU languages and translation pairs. Moreover, information on my method has been published so that others can reproduce it for the languages that interest them.

Sep 4, 2023

New online course: "memoQuickies Resource Camp"

Summer is almost over, but technically, "camping season" will continue in memoQ World until November 30th. Or maybe January 31st, depending on how you count.

Today, a three-month journey of exploration begins, covering six important kinds of resources to make work with the memoQ translation desktop and server environments more pleasant and efficient... and profitable. This self-guided online course will give participants full access to my 14 years of cumulative experience as a memoQ user translating, managing projects and developing hundreds of solutions with this world-leading productivity tool.

Click here or on the icon bar above to have a look at the course description and to see (and maybe download) some of the publicly available information and resources for better work in many language pairs.

The emphasis of teaching will shift to a new resource every two weeks (with auto-translation rules as the main topic for the first two weeks), but throughout the course, information will be added continuously to all topic sections as I trawl through, sort, upgrade and publish the best or most interesting stuff from my archives. And course participants have access to open virtual office hours each week on Thursdays and some other occasions, where any questions can be asked and special requests made.

A special enrollment discount of 40% is available for the first week (code: HALFOFFLAUNCH) until September 10th, but you can join at any time and work with any of the material posted, ask questions and receive feedback. Learning material and downloadable, ready-to-use and -adapt resources will continue to be added until the end of November, and the full course will remain online through January 2024. Enrollment fees and content are subject to change without notice.

Addendum 1: On Thursday afternoon, September 7th, 2023, a presentation was made to introduce the first course topic - "Auto-translation Rules for Everyone". The recording and slides can be found here.

Addendum 2: Payment options for groups and monthly budgets have been introduced now. These options enable teams, departments and organizations to obtain blocks of passes for their members to receive continuing professional education in translation workflow tools. The host site applies VAT and other taxes where relevant and generates appropriate invoices. All relevant information can be found at the bottom of the information and enrollment page.

Jun 4, 2019

Regular expressions in memoQ demystified - THE workshop!

Next week in Utrecht there will be a unique workshop to enhance your productivity with memoQ, as you learn how to develop rules for automated formatting and QA of patterned expressions, such as dates, currency expressions, unusual or custom text formats and more. THIS knowledge is one of those "secret weapons" that I deploy to help the most sophisticated financial and legal translators I know save countless hours of mind-numbing donkey work doing QA on things like legal references and expressions involving currency (such as EUR 3 million vs. €3m, etc.) or creating those references in the first place and inserting them in the translation with a simple keystroke.

The course instructor, Marek Pawelec, is one of my personal resources when I am in over my head on technical problems or when I need to be very sure that a client of mine gets the right help in time. He has a rare gift of taking subject matter which many find baffling and presenting in a way that makes it accessible to most any educated adult.

Because of the scope of this subject matter and the importance of proper follow-up and support while learning it, the workshop will be held over two days - June 10 and 11 (Monday and Tuesday) - from 10 am to 4 pm each day, which will give plenty of time to learn the basics and move on to apply your new technical skills to common and not-so-common technical challenges in translation projects where memoQ is involved.

Trust me on this one: we are talking about critical process secrets to save massive amounts of time and do better work on things like annual reports, court briefs and more. Or creating projects for text formats that seem impossible to work with at first glance. THIS is where the money is in an increasingly competitive market.

Information to register now can be found on the Facebook event page for the workshop or on the relevant Regex Workshop page for the host, the All Round Translator education cooperative in the Netherlands.

Jan 22, 2014

memoQ cloud: a team server "on tap"

This afternoon, Kilgray CEO István Lengyel held one of the best webinars I've seen him do yet to describe the convenient new hosted server facilities known as memoQ cloud, which I reviewed recently.

In the webinar, he explained the company's evolution of thought for online computing and how concerns about security were finally resolved to create a more sustainable offering than the more support-intensive "honeymoon" server solution.

He made it clear how existing desktop licenses for the Project Manager and Translator Pro editions can be used in combination with concurrent access licenses (CALs) for the server, as well as how cloud services can be suspended for periods in which they are not needed, saving considerable costs for those with only occasional needs to work in a coordinated online team.

Backing up the server configuration can be done quickly and easily from a Language Terminal account, so if cloud service is dormant for more than three months (after which data are deleted from the server), everything can be restored quickly when needed.

The webinar also included a demonstration of the integrated translation in web browsers, memoQ WebTrans. This is one way of providing access to the server for others who do not have installed copies of memoQ or working on your server when using other computers. Of course this interface also works in web browsers under other operating systems, such as MacOS or Linux. (Click on the graphic below to get a full-sized view of the web translation interface.)

Access to Kilgray's premium terminology server qTerm and memoQ server APIs is also available for an additional subscription fee. Subscribed services can be changed at any time as your needs evolve.

In the webinar, István showed how in about the same time it takes to enjoy a cup of coffee, one can get a free Kilgray Language Terminal account and register with a credit card for a month's trial of the memoQ cloud server (with any services available) for just €1/$1. If you are trying out services which you will not want beyond the trial period (like the API, qTerm or extra licenses), these can be set to cancel at the end of the trial period to avoid unwanted charges.

The embedded video below is a 20-minute tour of how simple it is to set up and manage projects in memoQ cloud. Use the icon at the lower right of the video frame to watch this on your full screen.

This is a good overview of the process, although the licenses aren't explained very well, and the project type recommendation is bad advice in many cases, as I pointed out in my post on server projects on segmentation and projects with desktop documents. Everything else in the video is good, but it's often very important to allow segmentation to be changed or corrected, particularly if the segmentation rules used in the project do not cover abbreviations which may split sentences in very unfortunate ways. If you need to have instantaneous access to work from other team members by using online documents, the the segmentation will need to be checked very carefully and corrected before the project begins to avoid difficulties.

Those testing the memoQ cloud server or using desktop editions of memoQ may also want to check out various free configuration resources on Language Terminal. These include special QA profiles, AutoCorrect files, import filters that are not part of the shipping product and auto-translation rules for easier translation of number and date formats, etc. Language Terminal offers other facilities which may be of interest even to those who do not use memoQ, such as the free InDesign server, which can create PDF previews of InDesign documents (very useful for reviews before delivery) or convert InDesign files of any type to XLIFF for translation in many different environments.

UPDATE:
The memoQ cloud webinar is now available to watch on Kilgray's page for recorded webinars; it can be accessed directly here or viewed in the embedded video below.

Dec 2, 2013

Segmentation in memoQ server projects

Segmentation difficulties are often one of the most troublesome aspects of working with translation environment tools. Learning to configure segmentation rules correctly and applying that knowledge can save many hours of wasted time in alignments and translations and avoid filling translation memory resources with garbage from fractured translations of partial sentences with missing verbs, subjects and whatnot.

The usual alternative remedy for inadequately configured segmentation rules which lack the segmentation exceptions needed for abbreviations, for example, is to use the "join" function (Ctrl+J), and sometimes the split function (to manage very long, unwieldy clauses such as one might find in a patent text, and the join the parts again later).

There are situations where joining and splitting of segments is blocked. This is the case with any file which is part of a view, for example; the view must be deleted before segments can once again be joined or split. Segmentation changes are also not possible in a server project which has not been set up to allow them.

There are several options or documents available to project managers when setting up memoQ server project. But to enable translators to correct unfavorable segmentation, there is really only one choice:

If Desktop documents (no web translation) is selected, then on the dialog page which follows, changes in segmentation can be enabled:

If a project manager does not configure a project to allow this, for example because a document is being split between multiple translators (which does not allow for segmentation changes for technical reasons), the full responsibility must be assumed by the project manager for any segmentation issues. The imported documents should be examined carefully, and if any problems are observed, the segmentation rules should be modified and the documents re-imported. Doing otherwise may unavoidably result in garbage being written to the project's translation memory.

This is a very important point for memoQ trainers to emphasize when they are teaching users of the memoQ server to set up projects. Segmentation topics should be covered thoroughly, and the potential for bad results should be understood clearly if translators are given badly segmented documents they cannot fix. Project managers should also be encouraged to avoid restricting translators options in ways which are likely to harm the quality of the results and make parts of the translation unfit for later re-use.

A good rule of thumb is to choose the desktop documents option for projects always unless there are very urgent reasons not to do so. In this way, you will avoid upsetting your translators by forcing unmanageable, fractured sentence fragments on them, and you will be assured of better quality translation memory resources.

Nov 20, 2013

memoQuickie: fixing segmentation with goofy product and company names

The insistence by some companies to brand themselves and their products with names whose capitalization violate the usual rules of a language can cause segmentation difficulties for translators working with translation environment tools.

To correct this difficulty in memoQ, find the segmentation rules for the language in question and edit them.

On the Custom lists tab of the segmentation rule set, select the #cap# list and add the troublesome name (such as iPhone, iPad or memoQ). By adding the names to this list, you are essentially defining them as "capital letters"; if "memoQ" is in the list, then segmentation will also work for "memoQuickies". Then reimport the file using that rule set:

Sep 28, 2013

memoQ filters for static and dynamic views and navigation

The filtering functions for translation documents in memoQ are really cool. I'm not talking about the import filters for different types of documents, though most of these are rather good, and improvements are being made all the time. I mean the ways in which you can use filters to look at the content you are translating and sort and navigate it in different ways.

The filters can be used to create static views from one or more translation documents selected on the Documents tab of the Translations window of Project home using the Create view command.

I use this function a lot to create comment and feedback lists for clients or select some particular part of my content that I want to save and work on or share separate from the rest. Exported as bilingual RTF tables, the content can be corrected or questions answered in the comments column, and all the changes and commentary can be re-imported to update your project.

What I use even more often are the dynamic filters in the translation window. There are three main types: sorting filters in a dropdown list, source and target text filters in the fields above the related columns, and the dilog filter with its many options, which is invoked with the funnel icon (2):

Source and target text filtering of the segments can be made case-sensitive by marking the icon (1). Any number of filtering operations can be applied cumulatively in sequence, and filters applied with the source and target text fields can be cleared with the red X icon (4). To clear a sorting filter, you must select "No sorting" from the dropdown menu.

These view filters can be very helpful for translation and quality assurance. But what many do not realize is that memoQ also allows you to navigate through translation segments using filter criteria. This is done with Edit > Goto Next (Ctrl+G). The filter criteria to apply for navigation are chosen under Edit > Goto Next Settings (Ctrl+Shift+G).

This often has the advantage over view filtering that all the segments remain visible and you can see the context better. Examples of this are shown for navigating to commented segments and navigating through the many footnotes in a document to check their formatting in the short (3.5 minute) video tutorial below. It was prepared with the most recent build of memoQ 2013 (6.5.15) and shows the new "golden" bubble icon for commented segments. The video demonstrates (with footnotes) how tag type can be used as a filtering or segment navigation criterion. This might come in handy for an academic thesis or a legal document with many footnotes to check.

Use the icon at the bottom right of the video to toggle full screen mode for viewing; this makes it much easier to see the details of this somewhat fast-paced clip.

0:17 Static views on the View tab of the Translations window
0:30 Dynamic source and target text field filtering
0:40 Dialog filtering with the funnel icon (3rd cumulative filter)
0:52 Using filters to navigate: Goto Next (rationale & contrast with view filter)
1:20 Goto Next settings for navigating commented segments
1:42 Navigating footnotes in a translation document with Goto Next
1:55 Setting the navigation filter for a tag type
2:30 Getting rid of a static view to allow segment joining
3:23 Oops! I join something I shouldn't and split the segment again, hoping nobody will notice.

Subscribe to my free YouTube channel and I think you'll receive updates of new video tutorials I add (or at least it'll be easier to find them). I would also like to thank Ulrich Scheffler of LSP.net for providing me with a Camtasia license recently to support my teaching - it's much better than the free Open Source CamStudio I started working with months ago, though I can definitely recommend CamStudio as a good tool to get started with making demonstration or teaching videos.

Dec 4, 2012

Migrating memoQ segmentation rules to another language variant

As we work with memoQ, many of us optimize our segmentation rules bit by bit to improve the results when importing documents to translate or align. But what do you do when you've worked hard setting up good rules for your usual target language but one day find a need to apply them to another variant of the language? Take all the rules you've spent months fine-tuning for "German" and apply then to "German (Austria)", for example?

The current version of memoQ offers no obvious way to do this. But it's not that hard to "cheat" and save many hours of tedious and unnecessary work.

First, find the ruleset you want to reuse. Segmentation rules can be accessed in three ways:

Tools > Resource Console... > Segmentation rules > [language in the drop-down menu]
Tools > Options > Default resources> Segmentation rules > [language in the drop-down menu]
Project home > Settings > Segmentation rules

Click Export and save the MQRES file somewhere you can find it.

The segmentation rule sets are named like this: <language-variant>#NameOfRuleset.mqres

So for German, for example: ger#MyRuleset.mqres, ger-DE#MyRuleset.mqres, ger-AT#MyRuleset.mqres, etc.

Rename the ruleset appropriately, then open it in Notepad or another text editor:

Then adapt the information in the part marked red here:

Change the text between the language tags particularly. The final result will be something like this:

Save the file. Then go back to the segmentation rules in memoQ in one of the three places listed above.

Click Import new and select your file that you edited.

Make whatever changes you like to the name and description. Click OK.

That’s all. If for some reason you forget to change the language codes in the file, you will get an error message (here I did that by choosing my original file for generic German and tried to import it to Austrian German):

This is not very helpful; it would be much better for memoQ to indicate that the language settings are wrong and perhaps offer to change them. But maybe that's on the Kilgray Roadmap for another day ;-)

Nov 23, 2012

Happy Thanksgiving memoQ tutorials & book special

After fielding quite a few questions in the past week on segmentation problems with translation files and alignments as well as questions from a few colleagues and clients about ways to get better-looking output of the data stored in memoQ term bases (with SDL Trados MultiTerm), I prepared two longer tutorial scripts and distributed these with a segmentation practice file to registered subscribers to my ebook, memoQ 6 in Quick Steps. This is a little thank you and dividend for their early support of my first commercial publication effort for translator education.

To celebrate the national holiday in my native country as well as my own thanksgiving for all the ideas for best practice which my colleagues and clients continue to share, I have also set a "Thanksgiving Weekend Special" with a special rate for the ebook until the end of Sunday, California time. Registered purchasers will receive the book update for memoQ 6.2 in December as well as any further updates for a year.

Happy Thanksgiving, everyone!

Jun 16, 2012

memoQuickie: footnote, cross-reference & index entry segmentation in Microsoft Word files

If you have a Microsoft Word DOC file or RTF to translate, it is important to be aware of the different behaviors of the memoQ import filter options you can use. If there are footnotes, cross-references or index entries, it is far better to use the option to import the DOC or RTF file as DOCX.

The DOC file shown below has a footnote, a cross-reference and an index entry:

Adding it to a memoQ project with the default filter for Microsoft Word in memoQ 5

gives the following segmentation result:

Importing the same document with the DOCX option of the filter

yields much cleaner segmentation and better tags to work with:

Compare what some other programs do with this file:

WordFast Pro

DVX2 (DOC)

DVX2 (DOCX)

TagEditor salad (partial)

SDL Trados Studio 2009 segmentation

SDL Trados Studio 2011

There is room for improvement with most tools.

May 5, 2012

memoQuickie: fixing source segmentation from abbreviations

Do you see segmentation like the above in your projects? Annoying, right? This is easy to fix in memoQ.

Go to Tools > Options... > Default resources > Segmentation rules (in the row of icons):

Select the language (including sublanguage if relevant) and select the editable rule set, then click Edit.

On the tab for custom lists, add the offending abbreviations to the #abbr_short# list.

Re-import the document(s) on the Translations > Documents list of the Project home tab. The number of segments in my document was reduced from 197 to 134, because it was so laden with academic titles. Since I use versioning, any previously translated segments can be recovered quickly by Operations > X-Translate...

Sometimes I think that abbreviations I added aren't fixing the segmentation. In those cases I have usually switched to a sublanguage for which they were not entered.

Apr 26, 2012

Twitterview: SDL Trados Studio, memoQ, DVX2 and PDF extraction

When I began using Twitter somewhat hesitantly three years ago, I never expected that it would eventually prove to be one of the most useful social media tools for gathering information of professional value. Much of this is serendipitous; I really never know what will come floating down the twitstream or where some of the conversations in it will go. Like the direct chat I had with with a colleague in New Zealand about features she liked best in the two main CAT tools she uses, SDL Trados Studio and memoQ.

We both really appreciate the TM-driven segmentation in memoQ and the superior leverage this offers. But to my surprise, she expressed a preference for SDL Trados Studio, particularly for the quality of its PDF text extractions from electronically generated files. This is not a feature I make heavy use of in either tool, though I have used it more often lately in memoQ for alignments in the LiveDocs module and found it generally satisfactory. Most of my work involving PDF files is with scanned documents - there one has no choice but to use a good OCR tool like OmniPage or ABBYY FineReader.

So I was quite intrigued that the quality of PDF was "better" than from standalone tools. Especially because my experience is quite different. Further discussion (not shown in the graphic) revealed that what she actually meant was that the quality of the text extraction with the CAT tool usually beat the quality of text received from translation agencies who performed conversions. That is easy to explain, really. In my experience, most agencies are clueless about how to use conversion tools and too often use automated settings and save the results "with layout". This is very often utterly unsuited for work with translation environment tools or requires a lot of cleanup and code zapping.

For years I have recommended to agencies and colleagues that they spare themselves a lot of headaches by saving PDF conversions as plain text and adding any desired formatting later. Most people ignore that advice and suffer accordingly. So in a way, a CAT tool that does so encourages "best practice" for PDF translation for those files they are actually able to handle.

Encouraged by the Twitter exchange, I decided to do a few tests with files from recent projects. I took a PDF I had with various IFRS-related texts from EU publications. It appeared to extract quickly and cleanly in memoQ, giving me a translation grid full of nicely segmented text. SDL Trados Studio 2009 choked badly on it and extracted nothing. Her extraction in SDL Trados Studio 2011 caused a timeout with the project I was told, but the text itself was completely extracted and converted to DOCX format. This is useful, because unlike the extraction to plain text in memoQ, this offers the possibility to add or change some text formatting in the translation grid. Other extraction examples from SDL Trados Studio 2011 showed that text formatting was preserved.

A closer examination of the extracted texts revealed some problems with both the memoQ and Trados Studio extractions. The memoQ 5 PDF text extraction engine proved incapable of handling text in multiple columns properly. The paragraph order was all fouled up. The extraction with SDL Trados Studio had a great number of superfluous spaces. Whether it is possible to optimize this in the settings somehow I do not know. The results of all the extraction tests are downloadable here in a 6 MB ZIP file. I've included the SDL Trados Studio extraction saved to plain text as well for a better comparison of the text order and surplus spaces problems.

Overall, I am personally not very pleased with the results of the text extractions from PDF in either tool. The results from SDL Trados Studio are clearly better, and other examples that were shared made it clear that this tool works better than many an untrained PM with better PDF conversion software. This is certainly much better than solutions I see many translators using. But really, nothing beats good OCR software, an understanding of how to use it well and a proper workflow to get a good TM and target file better fit for most purposes.

*****

Update 2012-05-22: I met colleague Victor Dewsbery at a recent gathering in Berlin, and he told me about his tests with the recently introduced PDF import feature of Atril's Déjà Vu X2 translation environment. He kindly offered to share his results (available for download here) and wrote:

Here is the result of the PDF>DVX2>RTF>ZIP process for your monster EU PDF file. Comments on the process and the result:

The steps involved were: 1. import the file into DVX2 as a PDF file; 2. mark all segments and copy source to target; 3. export the file as if it were a translated file (it comes out as an RTF file). The RTF file is 20 MB in size and zips to 3 MB.

Steps 1 and 3 took a long time, and DVX2 claimed to be not responding. For step 1 I just left it and it eventually came up with the goods. Step 3 exported the RTF file perfectly, even though DVX2 claimed that the export had not finished. I was able to open the RTF file (it was locked, but I simply renamed it), and this is the version which I enclose. Half an hour later DVX2 had still not ended the export process (and had to be closed via the Task Manager), although the exported file was in fact perfectly OK. The procedure worked more smoothly with a couple of smaller PDF files. Atril is working on streamlining the process and ironing out the glitches in the process, especially the “not responding” messages.

The result actually looks very good to me. There are hardly any codes in the DVX2 project file (the import routine also integrates CodeZapper). I didn’t spot any mistakes in the sequence of the text. Indented sections with numbering seem to be formatted properly - i.e. with tabs and without any multiple spaces.

The top and bottom page boundaries in the exported file are too wide, so most pages run over and the document has over 900 pages instead of just under 500. Marking the whole document and dragging the header/footer spaces in Word seems to fix this fairly quickly.

I note that some headlines are made up of individual letters with spaces between them. This may be related to the German habit of using letter spacing (“Sperrschrift”) for emphasis as an alternative to bold type.

I found one instance where text was chopped up into a table on page 857 of the file.

There are occasional arbitrary jumps in type size and right/left page boundaries between sections.

On the strength of this sample, it would usually be OK to simply import the PDF file into DVX2, translate in the normal way, and then fix any formatting problems in the exported file.

Apr 2, 2012

TM-driven segmentation in memoQ

One old feature of memoQ which continues to put cash in my pocket and make my work go faster is TM-driven segmentation. It is a pretranslation option. In theory, it combines and splits segments to improve matches from the TM; in reality it is biased toward combination, which is a good thing, as it emphasizes coherent text chunks.

I recently completed a translation for my least favorite end client of an agency partner I rather like. I suppose the folks at this end client company are nice enough; most probably do not beat their dogs or their children. But the texts they send for translation are abusive in the extreme: Microsoft Word files generated by some sort of program on a host system, with a bizarre mix of colors and font changes (both type and size), as well as lots of superfluous line breaks and carriage returns. I presume the thought for the latter is to avoid overlapping graphics, but since text wrap is turned on for the graphics anyway, I don't see the point. What I do see is horrible German sentences horribly mutilated into as many as five or six chunks, but at least two or three most of the time. A real crime.

And did I mention that segments break at the color and font changes even for sentences which appear intact? No CAT guru has ever been able to figure that one out.

One such horror revisited me last week, and I put it off as long as I could. Finally, I got to work at the point where the deadline was very much in doubt, and as an afterthought I did something I usually forget about: I pretranslated the file. I applied the "TM-driven segmentation feature", which is not considered in the file analysis. To my amazement, most of the file pretranslated with matches over 95%. When the remaining empty segments were examined and 4 or more parts joined to make a sentence, most were 99% matches. I had completely forgotten that I had translated this material a year ago. And the agency was unaware of that as well, because they rely on traditional Trados methods for file analysis and processing. What I thought was going to be a very hard slog through about 500 horrible segments turned out to be a bit of tag tweaking and a few sentences of updating the text.

This is part of what my agency friends who have gone over to memoQ mean when they talk about improved leverage over time from legacy resources.

To demonstrate how this works, I took a bit of text on "technical terminology" from Wikipedia and prepared it as a text with coherent sentences and also as a text with lots of superfluous carriage returns like one might find with text copied from a PDF file, for example:

I translated the file with intact sentences in memoQ, then ran an analysis using the Operations > Statistics... function:

The file's segments looked like this in memoQ:

Then the file was pretranslated using the TM-driven segmentation option:

This was the result:

The exclamation marks indicate missing tags, which may cause problems. In cases like this I usually insert them at the end and clean up the spacing in the output target file. And if I send a TMX to someone I clean the crap tags out of it with a search and replace operation in a text editor.

To satisfy my curiosity I then deleted the contents of the TM and made a new source file with a couple of broken segments:

Note the lesser quality of what will be going to the TM. This is the diet Trados users have enjoyed for a long time or, for that matter, what anyone who uses a CAT tool without the ability or knowledge to join segments may swallow routinely. After that was sent to the TM, I re-translated the file with intact sentences:

In Segment 1, a split was made, but no pretranslation was done of the fragment (even though it was in the TM as "101%"). In Segment 4 the sentence was not split but instead taken as a fuzzy match. The information pane at the right of the translation window shows the differences with the TM information:

I am not disturbed by the more restrained matching when splits are involved. I consider it a good thing, a feature which encourages users to wean themselves off the bad practice of "translating" text which has been impossibly chopped up. Smart translators use the functions for segment joining and splitting frequently in a good CAT tool, and with memoQ this habit is rewarded particularly.

Jan 2, 2012

ODT files in translation environment tools

After an interesting afternoon with a friend who was a bit frustrated with the behavior of her translation assistance technology with an ODT (Open Office text) source file, I decided to have a look at how a variety of common tools handle this format. I created a small test file which contained some of the troublesome elements and saved it as *.odt for testing. The test file looked like this:

The ordered list was created using the numbering feature.

When the file was imported to OmegaT, the segmentation looked as follows:

Fairly clean, though the segmentation is a bit off due to the encoding of the space after the end of the sentence in the second block of text. Nine segments where there should have been ten.

With memoQ, the result was:

Altogether there were a dozen segments after import. The part with the hyperlink was segmented incorrectly in three parts instead of one. However, memoQ did handle the space tag after "tool." correctly and start a new segment at "Here". Once can, of course, use the segment joining function

to correct the segmentation until Kilgray gets around to fixing the segmentation on the hyperlink tag:

Update 9 January 2012: The developers at Kilgray have informed me now that this quirk in the ODT filter has been corrected and will be included in the next build released.

When I tried to test my SDL Trados Studio 2009 license, at first it refused to joint the party:

Never a dull moment with SDL as we all know. Of course SDL Trados 2007 was in fact installed, but when I upgraded to Studio 2009, of course it trashed my 2007 installation, and I had been too irritated to do anything about it for over half a year since I don't use Trados for anything more than file preparation and compatibility testing anymore, and I was still able to do that for my projects with the damaged installation. However, when I discovered that the ODT file caused TagEditor to run and hide without even saying goodbye, I sighed deeply and wasted half an hour reinstalling SDL Trados 2007. At least I didn't have to go through that insane check-in/check-out license procedure online. I trusted in God and my Windows Registry entries, and the location of my license file was remembered, so all was well.

The second attempt at SDL Trados Studio 2009 was much better:

Same segmentation problem as OmegaT, and examining the tags reveals where the issue might be addressed in a tweak of the filter.

I haven't got the latest upgrade, but someone was kind enough to run my test file through SDL Trados Studio 2011, which appears to offer the best results for filtering ODT (the settings were slightly different, with the URL included, but that is also possible with some other tools):

SDL Trados TagEditor also worked after re-installation. The results were:

Oh dear. Well, it works, but if I still used TagEditor, I would run, not walk, to the much cleaner interface of OmegaT for this sort of thing if I didn't have the good sense to upgrade to Studio or something else commercial. Note the same segmentation issue and the need for filter modification.

Victor Dewsbery was kind enough to import my test file to the original Atril DVX and the newer DVX2 and send me the results:

DVX import of the test file

DVX2 import of the test file.

I also tried to test SDLX, Wordfast Pro and Wordfast Anywhere. The first two tools don't support ODT. Wordfast Anywhere claims too, but went nowhere, with the following status message displayed in my browser for about half an hour before I gave up and went to lunch:

Of course I canceled. I had a blog post to write and a New Year to get on with. Anyone who wants to try the test file in another tool (to compare apples with apples) can get it here.

Search me!