Jan 8, 2012

United Nations General Assembly resolutions: 6-language parallel corpus

Thank you to colleague Christian Taube for pointing me to another interesting public dataset, the parallel corpus of UN General Assembly resolutions available in TMX format in English, Russian, French, Arabic, Chinese and Spanish. All six languages are in the same TMX file. Altogether there are over 72,000 entries. Not a big collection compared to the EU DGT data, but possibly of better quality and certainly useful for those whose translations relate to the subject matter and involve pairs drawn from these six languages.

Importing into a CAT tool is a fairly simple matter: when the languages of interest are specified, the respective tool will extract the data desired. I tested this in memoQ (53 K entries imported) and SDL Trados Studio 2009 (48 K entries imported) with English and Russian, and the process was quite painless. I presume the difference of about 20,000 translation units read and the number imported have to do with redundancies and memoQ and SDL Trados Studio interpret these in a slightly different manner, but I don't really know. I also tested OmegaT, but unfortunately it chokes on the file when copied into the TM directory:


To use this data in OmegaT, the desired language pair must first be extracted with another tool to create a bilingual TMX file (as opposed to the hexalingual one). I exported the data I had brought into memoQ to make an English/Russian TMX and it then worked fine in OmegaT.

The following suggestion was offered by
If you work on Windows, you can drop/drop the TMX file in Olifant (the "old"
.NET version: http://sourceforge.net/projects/okapi/files/)

That will show you the 74,070 entries in 6 languages in it.

Then do "File > Export", select the output file name, and then the 2 languages
you want in that new TMX file.

This tip is good for everyone, of course, not just OmegaT users. The Okapi tools are excellent and free for maintaining TMs when the tool providers can't be bothered so far to offer decent facilities for maintaining the data in their software (not naming any names here, but you know who you are....)

3 comments:

  1. I just created a EN-RU project on my Mac, with nothing in /source/ and put the 184 mb TMX uncorpora file into /tm/.

    OmegaT opened the project without any issue after 5-6 seconds. Then I made a search on "France" (exact search/Case sensitive/in source/in translations/Search TMs) and found 819 segments in 6-7 seconds.

    Searching for "." ("period only" with Regular expressions/In source/Search TMs and Number of results=100000 gave me 74067 matches after about 2-3 minutes. And the coloring of the search string (source all in blue, since I was looking for "any character in source") took another extra 10 seconds. That shows that OmegaT "imported" all the segments without exception.


    You must understand that Java applications can use only as much memory as the Java Runtime Environment assigns them. The default depends on the platform, the Java version and a few other really obscure factors.

    What you did was run OmegaT with the default JRE settings, which are way below what is required to translate a reasonably sized project. Not even to mention a 184 mb TMX file.

    I don't know if the Windows install of OmegaT comes with any set value for OmegaT's assigned memory (and if it does then it is too low), but if you assign it 1024 mb of RAM in the INI file that comes with the Windows package, you should be fine. In fact, 1024 mb of RAM is the default setting in the Mac package.

    I did test the TMX with the generic Java package, without assigning it any extra memory and indeed OmegaT did not accept the TM for lack of memory (the error message even displayed the amount of memory Java had assigned OmegaT as default value).

    I am guessing Windows native applications don't have that limitation, because Windows automatically assigns them all the memory they need (and even more with swap memory when necessary), so you may want to do your testing again after changing OmegaT's startup values in its INI file.

    Jean-Christophe

    ReplyDelete
  2. Thank you for the tip, Jean-Christophe. The error message did, of course, indicate the memory problem as you pointed out, but in my experience with software, I have often found that there is more to the story than the error message, and sometimes the messages may be wrong or misleading. Although OmegaT will indeed work here with more memory allocated, there is more to the story somehow with the hexalingual file. As you noted, there are about 70,000 TUs. In my recent blog post comparing concordances, the same memory allocation for OmegaT was able to handle half a million TUs! So obviously, something about that hexalingual structure is stressing the program.

    That said, I would like to point out that for speed of lookup, OmegaT has come out at the top of all my tests with CAT tools. The only reason I haven't put that in hard numbers is that (1) I don't have a stopwatch and (2) I get bored and fall asleep waiting for some of the commercial packages to finish a search in something the size of the EU DGT data sets.

    ReplyDelete
  3. Hi, Kevin, thanks for all teh tips, and Jean-Christophe, too :)

    The hexalingual TM works fine in my computer, the same as the EU TMs, it's just slower, that's all.

    Saludos

    Jose

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)