Showing posts with label AlignFactory. Show all posts
Showing posts with label AlignFactory. Show all posts

Sep 15, 2015

A quick trip to LiveDocs for EUR-Lex bilingual texts

Quite a number of friends and respected colleagues use EUR-Lex as a reference source for EU legislation. Being generally sensible people, some of them have backed away from the overfull slopbucket of bulk DGT data and built more selective corpora of the legislation which they actually need for their work.

However, the issue of how to get the data into a usable form with a minimum of effort has caused no little trouble at times. The various texts can be copied out or downloaded in the languages of interest and aligned, but depending on the quality of the alignment tool, the results are often unsatisfactory. I've been told that AlignFactory does a better job than most, but then the question of how best to deal with the HTML bitexts from AlignFactory remains.

memoQ LiveDocs is of course rather helpful for quick and sometimes dirty alignment, but if the synchronization of the texts is too many segments off, it is sometimes difficult to find the information one needs even when the (bilingual) document is opened from the context menu in a concordance window.

EUR-Lex offers bi- or tri-lingual views of most documents in a web page. The alignments are often imperfect, but the synchronization is usually off by only one or two segments, so finding the right text in a document's context is not terribly difficult. So these often imperfect alignments are usually quite adequate for use as references in a memoQ LiveDocs corpus. Here is a procedure one might follow to get the EUR-Lex data there.


The bilingual text of a view such as the one above can be selected by dragging the cursor to select the first part of the information, then scrolling to the bottom of the window and Shift+clicking to select all the text in both columns:


Copy this text, then paste it into Excel:


Then import the Excel file as a file for "translation" in a memoQ project with the right language settings. Because of quirks with data access in LiveDocs if the target language variants are specified and possibly not matched, I have created a "data conversion project" with generic language settings (DE + EN in my case as opposed to my usual DE-DE + EN-US project settings) to ensure that data stored in LiveDocs will be accessed without trouble from any project. (This irritating issue of language variants in LiveDocs was introduced a few version ago by Kilgray in an attempt to placate some large agencies, but it has caused enormous headaches for professional translators who work with multiple sublanguage settings. We hope that urgent attention will be given to this problem soon, and until then, keep your LiveDocs language data settings generic to ensure trouble-free data access!)


When the Excel file is added to the Translations file list, there are two important changes to make in the import options. First, the filter must be changed from Microsoft Excel to "multilingual delimited text" (which also handles multilingual Excel files!). Second, the filter configuration must be "changed" to specify which data is in the columns of interest.


The screenshot above shows the import settings that were appropriate for the data I copied from EUR-Lex. Your settings will likely differ, but in each case the values need to be checked or set in the fields near the arrows ("Source language" particularly at the top and the three dropdown menus by the second arrow below).


Once the data are imported, some adjustments can be made by splitting or joining segments, but I don't think the effort is generally worth it, because in the cases I have seen, data are not far out of sync if they are mismatched, and the synchronization is usually corrected after a short interval.

In the Translations list of the Project home, the bilingual text can be selected and added to a LiveDocs corpus using the menus or ribbons.


The screenshot below shows the worst location of badly synchronized data in the text I copied here:


This minor dislocation does not pose a significant barrier to finding the information I might need to read and understand when using this judgment as a reference. The document context is available from the context menu in the memoQ Concordance as well as the context menu of the entry appearing in the Translation results pane.

A similar data migration procedure can be implemented for most bilingual tables in HTML files, word processing files or other data sources by copying the data into Excel and using the multilingual delimited text filter.