Sep 15, 2015

A quick trip to LiveDocs for EUR-Lex bilingual texts

Quite a number of friends and respected colleagues use EUR-Lex as a reference source for EU legislation. Being generally sensible people, some of them have backed away from the overfull slopbucket of bulk DGT data and built more selective corpora of the legislation which they actually need for their work.

However, the issue of how to get the data into a usable form with a minimum of effort has caused no little trouble at times. The various texts can be copied out or downloaded in the languages of interest and aligned, but depending on the quality of the alignment tool, the results are often unsatisfactory. I've been told that AlignFactory does a better job than most, but then the question of how best to deal with the HTML bitexts from AlignFactory remains.

memoQ LiveDocs is of course rather helpful for quick and sometimes dirty alignment, but if the synchronization of the texts is too many segments off, it is sometimes difficult to find the information one needs even when the (bilingual) document is opened from the context menu in a concordance window.

EUR-Lex offers bi- or tri-lingual views of most documents in a web page. The alignments are often imperfect, but the synchronization is usually off by only one or two segments, so finding the right text in a document's context is not terribly difficult. So these often imperfect alignments are usually quite adequate for use as references in a memoQ LiveDocs corpus. Here is a procedure one might follow to get the EUR-Lex data there.


The bilingual text of a view such as the one above can be selected by dragging the cursor to select the first part of the information, then scrolling to the bottom of the window and Shift+clicking to select all the text in both columns:


Copy this text, then paste it into Excel:


Then import the Excel file as a file for "translation" in a memoQ project with the right language settings. Because of quirks with data access in LiveDocs if the target language variants are specified and possibly not matched, I have created a "data conversion project" with generic language settings (DE + EN in my case as opposed to my usual DE-DE + EN-US project settings) to ensure that data stored in LiveDocs will be accessed without trouble from any project. (This irritating issue of language variants in LiveDocs was introduced a few version ago by Kilgray in an attempt to placate some large agencies, but it has caused enormous headaches for professional translators who work with multiple sublanguage settings. We hope that urgent attention will be given to this problem soon, and until then, keep your LiveDocs language data settings generic to ensure trouble-free data access!)


When the Excel file is added to the Translations file list, there are two important changes to make in the import options. First, the filter must be changed from Microsoft Excel to "multilingual delimited text" (which also handles multilingual Excel files!). Second, the filter configuration must be "changed" to specify which data is in the columns of interest.


The screenshot above shows the import settings that were appropriate for the data I copied from EUR-Lex. Your settings will likely differ, but in each case the values need to be checked or set in the fields near the arrows ("Source language" particularly at the top and the three dropdown menus by the second arrow below).


Once the data are imported, some adjustments can be made by splitting or joining segments, but I don't think the effort is generally worth it, because in the cases I have seen, data are not far out of sync if they are mismatched, and the synchronization is usually corrected after a short interval.

In the Translations list of the Project home, the bilingual text can be selected and added to a LiveDocs corpus using the menus or ribbons.


The screenshot below shows the worst location of badly synchronized data in the text I copied here:


This minor dislocation does not pose a significant barrier to finding the information I might need to read and understand when using this judgment as a reference. The document context is available from the context menu in the memoQ Concordance as well as the context menu of the entry appearing in the Translation results pane.

A similar data migration procedure can be implemented for most bilingual tables in HTML files, word processing files or other data sources by copying the data into Excel and using the multilingual delimited text filter.

8 comments:

  1. Good tip, thanks! I usually download the html versions of the texts and add these to LiveDocs.

    ReplyDelete
    Replies
    1. That's the approach that most seem to take, though as I noted some are unhappy with the alignment quality when they try to build bilingual references. Note Marek's suggestion below for fixing the misalignments in Excel.

      In general I think approaches like this with more focused, self-selected corpora are more useful than the bucket approach of the Big Data addicts. People seem to forget too often that a BAT tool beats a CAT tool nearly every time, and masses of surplus crap data are not useful for brain-assisted translation.

      Delete
  2. Simple and elegant. However, it's usually a good idea to check the file in Excel - you can easily fix minor problems with dislocations, which will make life easier after importing to LiveDocs.

    And if you want to have sentence segmentation, save source and target columns in different files (but in the same columns, e.g. column A for both source and target) and then align with "Structural alignment" on.

    ReplyDelete
    Replies
    1. You're right, Marek - Excel is a better environment to fix those misalignments.

      For the sentence alignment, you might as well just save the two languages as monolingual HTML docs and align them, which is what the users who inspired this have been doing. But here memoQ apparently does an inadequate job much of the time, and AlignFactory was said to be better.

      I spent some time before I published this looking at aligning the Excel file with itself, but selecting different content ranges. It worked as long as I did not specify structural alignment (which bizzarely ended up aligning one language with itself!) but here I uncovered a real mess of bugs in Kilgray's implementation of the Excel filter for LiveDocs import. You can specify ranges by typing them, for example, but you cannot do so interactively in the Excel file like you can when importing a translation document. That is a serious flaw. In general, my recent work has pointed out a number of areas where LiveDocs filter implementations need a serious upgrade/overhaul. The absence of the multilingual delimited text filter there is puzzling, and preview formation seems to fail almost all the time, much worse than the frequent problems for translation files.

      Delete
  3. Kevin,

    I was nice meeting you in Bordeaux at the IAPTI conference.

    I have AlignFactory Light and it does make TMs in XML and TMX formats. By the way, you can do the same with Logiterm, which is sold by the same company: Terminotix.

    I have been using Logiterm and AlignFactory Light to create and manage my TMs for years. If the source and target documents are in the same format, the alignment (which takes a few seconds) is more or less perfect. Although there may be problems if the documents are converted PDFs or contain tables, graphics or foot notes, they can easily be dealt with before alignment, for example, by removing unnecessary carriage returns or deleting the offending elements.

    I do a lot of financial translation and have used AlignFactory Light to align many EU directives and regulations dealing with the financial services industry.

    The advantage of AlignFactory Light over Logiterm's alignment module are its TM editing functions.

    ReplyDelete
    Replies
    1. Good information, Charles, thank you. I don't know why my colleague is generating HTML bitexts if TMX is available. If you are a memoQ user you are better off to load the TMX from the alignment in LiveDocs rather than sacrifice the context by feeding it to a TM as I pointed out in another blog post recently.

      Delete
    2. See also:

      https://dl.dropboxusercontent.com/u/6802597/Screenshots/LF-aligner-auto-downloading-and-aligning-with-CELEX-numbers.png
      +
      https://dl.dropboxusercontent.com/u/6802597/Screenshots/LF-aligner-CELEX-continued.png

      It seems there might not be any need to go through Excel, etc.

      Delete
    3. @Charles: Doesn't LogiTerm have the same Alignment Editor as AlignFactory though?

      Delete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)