Translation Tribulations

Nov 30, 2012

The Translator's Serenity Prayer

God, give me grace to accept with serenity
        the fools who cannot be helped,
        Courage to charge the things
        which should be charged,
        and the Wisdom to distinguish
        clients from time wasters.

Living one page at a time,
        enjoying one sentence at a time,
        accepting Trados as a pathway to grief,
        taking, as Jerome would,
        this awful text as it is,
        not as I would have it,
        to shape as I would have it,
        trusting that I can make all things right,
        if I render not with MT,
        so the client may be happy with this translation,
        and I supremely happy with payment in net 30 days.

Nov 26, 2012

Coming to terms with MetaTexis and memoQ

I've never dealt with MetaTexis before last night. I had heard the name mentioned, but I have neither the time nor good reason to concern myself with all the myriad technical environments for translation out there except to hope fervently that some day their makers will all grow up and learn to exchange data with each others' tools properly.

But when a colleague mentioned that she was having a few issues trying to migrate her 40,000+ entry termbase from MetaTexis to memoQ, I was intrigued, so I asked her to send me the data in various export formats. I got TBX and some sort of delimited text. Although memoQ currently can't do anything with TBX, I tried reading it into a few other tools (memSource and SDL Trados MultiTerm) to see if I could use them in the migration, but even after tweaking the file a bit I could not get the other tools to digest that alleged standard format, so I gave up and decided to attack the delimited text export.

First stop, Microsoft Excel import:

The data were semi-colon delimited. So far so good. Text qualifier is a quote mark.

When I looked at the data in Excel, it was quickly apparent that there were a number of corrupted records; I assume the export routine had a few hiccups. I also saw that the identifiers for the languages were badly scrambled in places. I don't know how much of this problem might have been inattention by the "data owner", but any hope of separating data by sublanguages went out the window. Fine. Just "French" and "English" then.

There were quite a few data columns to deal with, so I had a look at the headers and did some sorts to see which fields were actually used, how they were used and if they were really important to preserve. In the end, only six fields really mattered:

Source_Text
Source_Notes
Source_Definition
Translation_Text
Translation_Notes
Translation_Definition

I looked a bit more closely and realized that the data owner had used the Notes and Definitions fields interchangeably over the years. Where there was an entry in one, there was none in the other. So using a concatenation formula in Excel, I merged the text from those columns into a single definition field for each language. Then I renamed the headers to make them work for a memoQ terminology import:

French
French_Def
English
English_Def

Then I saved as Unicode text and imported into memoQ, right? Maybe in an ideal world or with an ideal data set, but not in this case.

There were funky things to deal with. Synonyms. Non-standard entities. And lots of crap, corrupted record structures that even messed up selection.

I got out my electronic scalpel and performed about an hour's worth of "data surgery", sorting and deleting to clean up most of the messy, useless stuff. Then I replaced the entity code &#34., which I had figured out should be a single quote mark, with a single quote mark. And the delimiter for synonyms, &#59., was replaced by semicolons.

In the course of testing, I discovered that some term records had been duplicated in the definition fields. No idea how that happened, but some of these had synonyms, and this really messed up later steps, so I copied those columns to another column and replaced the semicolons with a placeholder character, then copied the modified definitions column back. This step was only necessary because of problems in about 0.3% of the data.

Then it was time to decompose the definitions. I cut and paste the target term column to the last data column and saved a copy of the sheet as Unicode text. Then I opened it again from within Excel and specified two delimiters:

Then I sorted the data by the separate target term columns to figure out the maximum number of synonyms in the data. There were 21 English synonyms in one case! I named each of the target term column headers "English".

Then I cut and paste the source term (French) column so it was after the last target term column. To keep the data from getting screwed up, I had to put a placeholder in as a substitute for the semicolons to separate synonyms before I saved the data as Unicode text last time. So now I changed those placeholders to semicolons again, saved as Unicode text again and re-opened the file specifying two delimiters as above. Then I repeated the sort procedure to figure out how many French synonyms there might be. One entry had 15 synonyms. I named all the source term columns "French" and saved everything as Unicode text.

With the four column names all set to memoQ defaults for the languages involved, an import into a new termbase worked flawlessly with the defaults. Over 43,000 term entries came in cleanly with their synonym groupings in the entries preserved. The definitions (which were really just explanatory notes of various kinds) were associated with all the terms of their respective languages.

I expect that most cases of data migration from MetaTexis will not require as many tricks as I had to use to clean up the dirty data in this instance. But even such a "worst case" scenario worked out rather well, enabling the translator to test and use her old data in the new working environment.

Score another one for interoperability. Sort of.

Nov 24, 2012

Combining sublanguages in memoQ terminology

The termbase structure of memoQ is limited with respect to its information fields; users looking for or wanting to add fields for where a term was found or for its sublanguage (as a term property) will be disappointed. The term source has no field of its own and memoQ currently does not allow new fields to be added; the sublanguage is handled at the term level (which makes sense, I suppose, if one does adaptation projects from one sublanguage to another). But a memoQ termbase is quite flexible when it comes to adding language and sublanguages; it can include any number of these.

But this flexibility can have cause some problems in term management for busy translators in some pairs. It's nice that UK English terminology will display in my project with US English set as the target language. But when I want to export a simple list of all English and all German words in a termbase, for example - and there are synonyms in the entry as well, the resultant delimited output is very confusing for many people. And if you decide that you want to move all the terms to "generic" English, things can get really messy.

I faced exactly that problem just last week. I had been maintaining the termbase for an end client for about a year, starting with a rather large in-house terminology I was given in an Excel file. I imported it to memoQ as generic German and started working with the target language set to generic English. After a while the customer pointed out that UK English was their "standard". I swallowed hard and set my template project to UK English, pointing out that they would still be getting English with a rather American character from me no matter what I did with the spellchecker. Then later when I found out about the bug in SDL Trados where XLIFF files are difficult to import if the sublanguages are not specified, I set the source language of that customer's project to German (Germany).

After that they opened a subsidiary in the US and suddenly my native language variant was the new "standard". Ha, ha. I now had a term base with five languages: two flavors of German and three flavors of English. Consolidation time!

But how does one do this? memoQ itself is unhelpful in this regard; it barely offers any management features for terminology, much less tricks like merging sublanguages.

It's not that hard, really, and the steps depend on what data you actually keep in the termbase. If you never enter definitions for your terms things are pretty easy.

To start consolidating your terms by combining the sublanguages, first export a fully specified delimited text file. That's a "CSV" file with all the defaults.

Let's have a look at how that file is structured:

If you open the exported data in a spreadsheet, the first group of columns, 12 (A-L) in my case, were concept level fields. memoQ refers to a concept in the termbase structure as an "entry". Each entry can have any number of languages, with each sublanguage counting as an "independent" language. Related sublanguages are grouped.

A given language or sublanguage can have any number of terms, which are treated as synonyms. Why are they synonyms? Because in the term structure they share the same definition. If you have three language variants for English like I did, you can have three definitions. The definition fields are the first problem source.

Each entry for a language or sublanguage has three fields: the actual term (word, phrase), an example (which is what you wrote in the "usage" field), and "term info" which is a bunch of other meta data for capitalization rules, matching, gender, forbidden status, part of speech, etc.

If there are no definitions worth keeping in the extra variants of the language you want to combine, just delete all the extra definition fields. Leave the first one. In my case last week, I left English_Def and deleted English_United_Kingdom_Def and English_United_States_Def. Then I renamed all the English_United_Kingdom and English_United_States columns to English. I made similar changes for the German variants. Then I saved the file and imported it to a new termbase in memoQ. Problem solved. All three English types were combined as "English" and the German variants as "German". Done.

If I have definitions that I want to keep, there is a little more complication to avoid losing data. My quick and dirty solution was to create a temporary column for each major language in which I combine all the definitions for the language's variants. I do this by using Excel's concatenation function, something like this:

=CONCATENATE(N2, IF(U2 = "","", " / "),U2,IF(Y2 = "",""," / "),Y2)

In my case N was the definition column for English, U the definitions for UK English and Y for US English.

I renamed my merge columns as English_Def and German_Def and deleted the names of the original definition columns then saved the data as Unicode text (UTF-8, the memoQ default to avoid potential problems with mapping characters on re-import of the data to memoQ). After import, a quick look at my test data confirmed that no data was lost and all the language variants were combined as one major language:

Obviously a little editing is still desirable - entry #3 shows some duplication because two English variants contained the same term; my sloppy conditional statement also left a few leading slashes for cases where the first definition column was empty and a later one for another variant of the same language was not. But that's not a big deal; one should always have a careful look at data after doing something like this.

Search me!