Showing posts with label UN General Assembly resolutions. Show all posts
Showing posts with label UN General Assembly resolutions. Show all posts

Jan 8, 2012

United Nations General Assembly resolutions: 6-language parallel corpus

Thank you to colleague Christian Taube for pointing me to another interesting public dataset, the parallel corpus of UN General Assembly resolutions available in TMX format in English, Russian, French, Arabic, Chinese and Spanish. All six languages are in the same TMX file. Altogether there are over 72,000 entries. Not a big collection compared to the EU DGT data, but possibly of better quality and certainly useful for those whose translations relate to the subject matter and involve pairs drawn from these six languages.

Importing into a CAT tool is a fairly simple matter: when the languages of interest are specified, the respective tool will extract the data desired. I tested this in memoQ (53 K entries imported) and SDL Trados Studio 2009 (48 K entries imported) with English and Russian, and the process was quite painless. I presume the difference of about 20,000 translation units read and the number imported have to do with redundancies and memoQ and SDL Trados Studio interpret these in a slightly different manner, but I don't really know. I also tested OmegaT, but unfortunately it chokes on the file when copied into the TM directory:


To use this data in OmegaT, the desired language pair must first be extracted with another tool to create a bilingual TMX file (as opposed to the hexalingual one). I exported the data I had brought into memoQ to make an English/Russian TMX and it then worked fine in OmegaT.

The following suggestion was offered by
If you work on Windows, you can drop/drop the TMX file in Olifant (the "old"
.NET version: http://sourceforge.net/projects/okapi/files/)

That will show you the 74,070 entries in 6 languages in it.

Then do "File > Export", select the output file name, and then the 2 languages
you want in that new TMX file.

This tip is good for everyone, of course, not just OmegaT users. The Okapi tools are excellent and free for maintaining TMs when the tool providers can't be bothered so far to offer decent facilities for maintaining the data in their software (not naming any names here, but you know who you are....)