Translation Tribulations: A quick trip to LiveDocs for EUR-Lex bilingual texts

Sep 15, 2015

A quick trip to LiveDocs for EUR-Lex bilingual texts

Quite a number of friends and respected colleagues use EUR-Lex as a reference source for EU legislation. Being generally sensible people, some of them have backed away from the overfull slopbucket of bulk DGT data and built more selective corpora of the legislation which they actually need for their work.

However, the issue of how to get the data into a usable form with a minimum of effort has caused no little trouble at times. The various texts can be copied out or downloaded in the languages of interest and aligned, but depending on the quality of the alignment tool, the results are often unsatisfactory. I've been told that AlignFactory does a better job than most, but then the question of how best to deal with the HTML bitexts from AlignFactory remains.

memoQ LiveDocs is of course rather helpful for quick and sometimes dirty alignment, but if the synchronization of the texts is too many segments off, it is sometimes difficult to find the information one needs even when the (bilingual) document is opened from the context menu in a concordance window.

EUR-Lex offers bi- or tri-lingual views of most documents in a web page. The alignments are often imperfect, but the synchronization is usually off by only one or two segments, so finding the right text in a document's context is not terribly difficult. So these often imperfect alignments are usually quite adequate for use as references in a memoQ LiveDocs corpus. Here is a procedure one might follow to get the EUR-Lex data there.

The bilingual text of a view such as the one above can be selected by dragging the cursor to select the first part of the information, then scrolling to the bottom of the window and Shift+clicking to select all the text in both columns:

Copy this text, then paste it into Excel:

Then import the Excel file as a file for "translation" in a memoQ project with the right language settings. Because of quirks with data access in LiveDocs if the target language variants are specified and possibly not matched, I have created a "data conversion project" with generic language settings (DE + EN in my case as opposed to my usual DE-DE + EN-US project settings) to ensure that data stored in LiveDocs will be accessed without trouble from any project. (This irritating issue of language variants in LiveDocs was introduced a few version ago by Kilgray in an attempt to placate some large agencies, but it has caused enormous headaches for professional translators who work with multiple sublanguage settings. We hope that urgent attention will be given to this problem soon, and until then, keep your LiveDocs language data settings generic to ensure trouble-free data access!)

When the Excel file is added to the Translations file list, there are two important changes to make in the import options. First, the filter must be changed from Microsoft Excel to "multilingual delimited text" (which also handles multilingual Excel files!). Second, the filter configuration must be "changed" to specify which data is in the columns of interest.

The screenshot above shows the import settings that were appropriate for the data I copied from EUR-Lex. Your settings will likely differ, but in each case the values need to be checked or set in the fields near the arrows ("Source language" particularly at the top and the three dropdown menus by the second arrow below).

Once the data are imported, some adjustments can be made by splitting or joining segments, but I don't think the effort is generally worth it, because in the cases I have seen, data are not far out of sync if they are mismatched, and the synchronization is usually corrected after a short interval.

In the Translations list of the Project home, the bilingual text can be selected and added to a LiveDocs corpus using the menus or ribbons.

The screenshot below shows the worst location of badly synchronized data in the text I copied here:

This minor dislocation does not pose a significant barrier to finding the information I might need to read and understand when using this judgment as a reference. The document context is available from the context menu in the memoQ Concordance as well as the context menu of the entry appearing in the Translation results pane.

A similar data migration procedure can be implemented for most bilingual tables in HTML files, word processing files or other data sources by copying the data into Excel and using the multilingual delimited text filter.

8 comments:

StevenSeptember 16, 2015 6:58 AM
Good tip, thanks! I usually download the html versions of the texts and add these to LiveDocs.
ReplyDelete
Replies
WasatySeptember 16, 2015 9:44 AM
Simple and elegant. However, it's usually a good idea to check the file in Excel - you can easily fix minor problems with dislocations, which will make life easier after importing to LiveDocs.

And if you want to have sentence segmentation, save source and target columns in different files (but in the same columns, e.g. column A for both source and target) and then align with "Structural alignment" on.
ReplyDelete
Replies
UnknownSeptember 16, 2015 2:30 PM
Kevin,

I was nice meeting you in Bordeaux at the IAPTI conference.

I have AlignFactory Light and it does make TMs in XML and TMX formats. By the way, you can do the same with Logiterm, which is sold by the same company: Terminotix.

I have been using Logiterm and AlignFactory Light to create and manage my TMs for years. If the source and target documents are in the same format, the alignment (which takes a few seconds) is more or less perfect. Although there may be problems if the documents are converted PDFs or contain tables, graphics or foot notes, they can easily be dealt with before alignment, for example, by removing unnecessary carriage returns or deleting the offending elements.

I do a lot of financial translation and have used AlignFactory Light to align many EU directives and regulations dealing with the financial services industry.

The advantage of AlignFactory Light over Logiterm's alignment module are its TM editing functions.

ReplyDelete
Replies

Add comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)

Search me!

Sep 15, 2015

A quick trip to LiveDocs for EUR-Lex bilingual texts

8 comments: