Aug 19, 2015

Doing the deed with the DGT

Several years ago a legal translator in my circles began to use memoQ for her work, and I was asked to help with the migration of data from her old environment. When she was introduced to memoQ LiveDocs, she was delighted to learn that she was able to view the original document text or bitext of concordance hits for content saved in a LiveDocs corpus.

Because her work involved a lot of references to EU directives and other information sources from the EU, the parallel corpora from the DGT had great value to her work. These are enormous bodies of data, totaling several million translation units and growing constantly. Many translators in the EU use this data, but the sheer bulk of it tends to be burdensome to many translation environments, and the lack of context often limits the value of information retrieved from these corpora when stored in translation memories.

So she decided that LiveDocs was the medium in which the DGT data were to be stored, and because the DGT translation memories contain their data in sequential document order, the document context of any concordance hits can be viewed using the context menu in the memoQ concordance:

Thanks to the expansion of file types which can be included in LiveDocs since that time, it is easier than ever to import data from parallel corpora like the EU DGT and use these to support translation work. Using the LiveDocs approach, the extraction of a single large bilingual TMX file from the many zipped data collections is also completely unnecessary (in fact, the extreme quantity of data in those single files inevitably causes memory problems). To build reference corpora for concordancing or the construction of predictive typing resources such as Muses in memoQ, it is simply necessary to unpack the individual zip files into folders full of small TMX files and then import these folder structures into memoQ:



Include only TMX files in the LiveDocs corpus import:


Selecting the desired languages extracts the bilingual data from the individual TMX files, which contain data in all the official EU languages. If a particular file does not contain the desired pairing a corresponding message will be displayed. Don't worry about it.


This approach of loading smaller TMX files into LiveDocs overcomes the memory problems which may occur with gigantic files. And once these smaller files are in a LiveDocs corpus, they can be selected en masse and exported to one or more translation memories.

In fact, this approach is useful to get around the current inability of memoQ translation memories to import more than one TMX file directly at a time. This may be helpful, for example, to OmegaT users who want to migrate their many TMX translation memories (one from each project!) if they start using memoQ.

1 comment:

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)