Nov 24, 2012

Combining sublanguages in memoQ terminology

The termbase structure of memoQ is limited with respect to its information fields; users looking for or wanting to add fields for where a term was found or for its sublanguage (as a term property) will be disappointed. The term source has no field of its own and memoQ currently does not allow new fields to be added; the sublanguage is handled at the term level (which makes sense, I suppose, if one does adaptation projects from one sublanguage to another). But a memoQ termbase is quite flexible when it comes to adding language and sublanguages; it can include any number of these.

But this flexibility can have cause some problems in term management for busy translators in some pairs. It's nice that UK English terminology will display in my project with US English set as the target language. But when I want to export a simple list of all English and all German words in a termbase, for example - and there are synonyms in the entry as well, the resultant delimited output is very confusing for many people. And if you decide that you want to move all the terms to "generic" English, things can get really messy.

I faced exactly that problem just last week. I had been maintaining the termbase for an end client for about a year, starting with a rather large in-house terminology I was given in an Excel file. I imported it to memoQ as generic German and started working with the target language set to generic English. After a while the customer pointed out that UK English was their "standard". I swallowed hard and set my template project to UK English, pointing out that they would still be getting English with a rather American character from me no matter what I did with the spellchecker. Then later when I found out about the bug in SDL Trados where XLIFF files are difficult to import if the sublanguages are not specified, I set the source language of that customer's project to German (Germany).

After that they opened a subsidiary in the US and suddenly my native language variant was the new "standard". Ha, ha. I now had a term base with five languages: two flavors of German and three flavors of English. Consolidation time!

But how does one do this? memoQ itself is unhelpful in this regard; it barely offers any management features for terminology, much less tricks like merging sublanguages.

It's not that hard, really, and the steps depend on what data you actually keep in the termbase. If you never enter definitions for your terms things are pretty easy.

To start consolidating your terms by combining the sublanguages, first export a fully specified delimited text file. That's a "CSV" file with all the defaults.

Let's have a look at how that file is structured:

If you open the exported data in a spreadsheet, the first group of columns, 12 (A-L) in my case, were concept level fields. memoQ refers to a concept in the termbase structure as an "entry". Each entry can have any number of languages, with each sublanguage counting as an "independent" language. Related sublanguages are grouped.

A given language or sublanguage can have any number of terms, which are treated as synonyms. Why are they synonyms? Because in the term structure they share the same definition. If you have three language variants for English like I did, you can have three definitions. The definition fields are the first problem source.

Each entry for a language or sublanguage has three fields: the actual term (word, phrase), an example (which is what you wrote in the "usage" field), and "term info" which is a bunch of other meta data for capitalization rules, matching, gender, forbidden status, part of speech, etc.

If there are no definitions worth keeping in the extra variants of the language you want to combine, just delete all the extra definition fields. Leave the first one. In my case last week, I left English_Def and deleted English_United_Kingdom_Def and English_United_States_Def. Then I renamed all the English_United_Kingdom and English_United_States columns to English. I made similar changes for the German variants. Then I saved the file and imported it to a new termbase in memoQ. Problem solved. All three English types were combined as "English" and the German variants as "German". Done.

If I have definitions that I want to keep, there is a little more complication to avoid losing data. My quick and dirty solution was to create a temporary column for each major language in which I combine all the definitions for the language's variants. I do this by using Excel's concatenation function, something like this:

=CONCATENATE(N2, IF(U2 = "","", " / "),U2,IF(Y2 = "",""," / "),Y2)

In my case N was the definition column for English, U the definitions for UK English and Y for US English.

I renamed my merge columns as English_Def and German_Def and deleted the names of the original definition columns then saved the data as Unicode text (UTF-8, the memoQ default to avoid potential problems with mapping characters on re-import of the data to memoQ). After import, a quick look at my test data confirmed that no data was lost and all the language variants were combined as one major language:

Obviously a little editing is still desirable - entry #3 shows some duplication because two English variants contained the same term; my sloppy conditional statement also left a few leading slashes for cases where the first definition column was empty and a later one for another variant of the same language was not. But that's not a big deal; one should always have a careful look at data after doing something like this.

No comments:

Post a Comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)