Jan 2, 2012

ODT files in translation environment tools

After an interesting afternoon with a friend who was a bit frustrated with the behavior of her translation assistance technology with an ODT (Open Office text) source file, I decided to have a look at how a variety of common tools handle this format. I created a small test file which contained some of the troublesome elements and saved it as *.odt for testing. The test file looked like this:

The ordered list was created using the numbering feature.

When the file was imported to OmegaT, the segmentation looked as follows:

Fairly clean, though the segmentation is a bit off due to the encoding of the space after the end of the sentence in the second block of text. Nine segments where there should have been ten.

With memoQ, the result was:

Altogether there were a dozen segments after import. The part with the hyperlink was segmented incorrectly in three parts instead of one. However, memoQ did handle the space tag after "tool." correctly and start a new segment at "Here". Once can, of course, use the segment joining function to correct the segmentation until Kilgray gets around to fixing the segmentation on the hyperlink tag:

Update 9 January 2012: The developers at Kilgray have informed me now that this quirk in the ODT filter has been corrected and will be included in the next build released.

When I tried to test my SDL Trados Studio 2009 license, at first it refused to joint the party:

Never a dull moment with SDL as we all know. Of course SDL Trados 2007 was in fact installed, but when I upgraded to Studio 2009, of course it trashed my 2007 installation, and I had been too irritated to do anything about it for over half a year since I don't use Trados for anything more than file preparation and compatibility testing anymore, and I was still able to do that for my projects with the damaged installation. However, when I discovered that the ODT file caused TagEditor to run and hide without even saying goodbye, I sighed deeply and wasted half an hour reinstalling SDL Trados 2007. At least I didn't have to go through that insane check-in/check-out license procedure online. I trusted in God and my Windows Registry entries, and the location of my license file was remembered, so all was well.

The second attempt at SDL Trados Studio 2009 was much better:

Same segmentation problem as OmegaT, and examining the tags reveals where the issue might be addressed in a tweak of the filter.

I haven't got the latest upgrade, but someone was kind enough to run my test file through SDL Trados Studio 2011, which appears to offer the best results for filtering ODT (the settings were slightly different, with the URL included, but that is also possible with some other tools):


SDL Trados TagEditor also worked after re-installation. The results were:

Oh dear. Well, it works, but if I still used TagEditor, I would run, not walk, to the much cleaner interface of OmegaT for this sort of thing if I didn't have the good sense to upgrade to Studio or something else commercial. Note the same segmentation issue and the need for filter modification.

Victor Dewsbery was kind enough to import my test file to the original Atril DVX and the newer DVX2 and send me the results:
DVX import of the test file
DVX2 import of the test file.
I also tried to test SDLX, Wordfast Pro and Wordfast Anywhere. The first two tools don't support ODT. Wordfast Anywhere claims too, but went nowhere, with the following status message displayed in my browser for about half an hour before I gave up and went to lunch:

Of course I canceled. I had a blog post to write and a New Year to get on with. Anyone who wants to try the test file in another tool (to compare apples with apples) can get it here.

5 comments:

  1. "Nine segments where there should have been ten."
    You have to enable sentence level segmenting in the project properties dialog (best to do it while creating a project). I've got 10 segments in OmegaT with the same document.

    Also I don't have such a tag-mess, just 6 paired tags... In case you have created your document using MS Word, it has still issues regarding the implementation of ODF, so it's better to use an application with native ODF support.

    ReplyDelete
  2. Milos, sentence level segmentation is implemented. The problem here is the use of a tag to represent a space as you can see from the various screen shots. Also, I believe there are other potential problems if you use the "native" ODF support because of how ODF often comes into play in the translation environment. I got onto this particular test in the first place because Open Office was used to convert MS Word DOC files, and this often causes format difficulties. This led me to look at the differences between various tools in handling ODT. If you follow the workflow DOC > ODT > DOC because your tool can't support a DOC file, then you are best advised to take the path which will damage the structure of the original DOC the least. (Or better yet, skip Open Office entirely and just work with DOCX if you have access to an appropriate version of MS Office.)

    ReplyDelete
  3. I created an ODT file based on the formatting you show, then I imported it into DVX2. The result looks pretty clean. Of course this is only one scenario which may not apply to DOC files converted to ODT. I haven't tried it on DVX1.

    ReplyDelete
  4. Hi Kevin,

    Nice post as usual. Thought I would add Studio 2011 here too, just to complete the picture. We have support for *.ODP and *.ODS as well.

    http://youtu.be/1RurEF2fEXs

    I think this handles the file correctly as you wanted.

    Regards

    Paul

    ReplyDelete
  5. A couple of comments:

    Tagging in .odt files varies widely depending upon the application used to create or convert the .odt file (or .odp or .ods file, which are essentially the same format). There are wide differences between OpenOffice.org vs. LibreOffice, and even between different versions of LibreOffice. I opened your test file and closed it again in Abiword and LibreOffice and compared the resulting tag counts as displayed in OmegaT on my system. The results were:

    File as supplied by you: 24 tags
    After saving in Abiword: 16 tags
    After saving in LibreOffice v. 3.3.2: 34 tags

    That's both a very large number of tags, and a very wide variation, but neither is attributable to OmegaT as such.

    One consequence of this is that for comparison of tag display in different CAT tools, great care must be taken to use the same file. Even opening the file in another application and closing it again may be sufficient to reduce or increase the tag count in the CAT tool.

    Segmentation: in OmegaT, segmentation is determined by the segmentation rules, which in OmegaT are highly customizable. If you wanted to, you could probably adjust the segmentation rules to produce what you describe here as the "correct" segmentation, and possibly even to keep leading/trailing segment tags out of the displayed segment, thus reducing the tag count (though probably with a trade-off in format control).

    User customization like this is very consistent with the OmegaT philosophy. To say that OmegaT segments in a certain way is therefore a little like saying "Microsoft Word uses Ariel" because that happens to be the default font in Word. The ease of such customization (requiring knowledge of regex) is another matter of course, but that's OmegaT for you: a powerful tool in skilled hands.

    Keep up the good work,

    Marc P.

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)