Nov 26, 2012

Coming to terms with MetaTexis and memoQ


I've never dealt with MetaTexis before last night. I had heard the name mentioned, but I have neither the time nor good reason to concern myself with all the myriad technical environments for translation out there except to hope fervently that some day their makers will all grow up and learn to exchange data with each others' tools properly.

But when a colleague mentioned that she was having a few issues trying to migrate her 40,000+ entry termbase from MetaTexis to memoQ, I was intrigued, so I asked her to send me the data in various export formats. I got TBX and some sort of delimited text. Although memoQ currently can't do anything with TBX, I tried reading it into a few other tools (memSource and SDL Trados MultiTerm) to see if I could use them in the migration, but even after tweaking the file a bit I could not get the other tools to digest that alleged standard format, so I gave up and decided to attack the delimited text export.

First stop, Microsoft Excel import:

The data were semi-colon delimited. So far so good. Text qualifier is a quote mark.

When I looked at the data in Excel, it was quickly apparent that there were a number of corrupted records; I assume the export routine had a few hiccups. I also saw that the identifiers for the languages were badly scrambled in places. I don't know how much of this problem might have been inattention by the "data owner", but any hope of separating data by sublanguages went out the window. Fine. Just "French" and "English" then.

There were quite a few data columns to deal with, so I had a look at the headers and did some sorts to see which fields were actually used, how they were used and if they were really important to preserve. In the end, only six fields really mattered:
  • Source_Text
  • Source_Notes
  • Source_Definition
  • Translation_Text
  • Translation_Notes
  • Translation_Definition
I looked a bit more closely and realized that the data owner had used the Notes and Definitions fields interchangeably over the years. Where there was an entry in one, there was none in the other. So using a concatenation formula in Excel, I merged the text from those columns into a single definition field for each language. Then I renamed the headers to make them work for a memoQ terminology import:
  • French
  • French_Def
  • English
  • English_Def
Then I saved as Unicode text and imported into memoQ, right? Maybe in an ideal world or with an ideal data set, but not in this case.

There were funky things to deal with. Synonyms. Non-standard entities. And lots of crap, corrupted record structures that even messed up selection.

I got out my electronic scalpel and performed about an hour's worth of "data surgery", sorting and deleting to clean up most of the messy, useless stuff. Then I replaced the entity code &#34., which I had figured out should be a single quote mark, with a single quote mark. And the delimiter for synonyms, &#59., was replaced by semicolons.

In the course of testing, I discovered that some term records had been duplicated in the definition fields. No idea how that happened, but some of these had synonyms, and this really messed up later steps, so I copied those columns to another column and replaced the semicolons with a placeholder character, then copied the modified definitions column back. This step was only necessary because of problems in about 0.3% of the data.

Then it was time to decompose the definitions. I cut and paste the target term column to the last data column and saved a copy of the sheet as Unicode text. Then I opened it again from within Excel and specified two delimiters:

Then I sorted the data by the separate target term columns to figure out the maximum number of synonyms in the data. There were 21 English synonyms in one case! I named each of the target term column headers "English".

Then I cut and paste the source term (French) column so it was after the last target term column. To keep the data from getting screwed up, I had to put a placeholder in as a substitute for the semicolons to separate synonyms before I saved the data as Unicode text last time. So now I changed those placeholders to semicolons again, saved as Unicode text again and re-opened the file specifying two delimiters as above. Then I repeated the sort procedure to figure out how many French synonyms there might be. One entry had 15 synonyms. I named all the source term columns "French" and saved everything as Unicode text.

With the four column names all set to memoQ defaults for the languages involved, an import into a new termbase worked flawlessly with the defaults. Over 43,000 term entries came in cleanly with their synonym groupings in the entries preserved. The definitions (which were really just explanatory notes of various kinds) were associated with all the terms of their respective languages.

I expect that most cases of data migration from MetaTexis will not require as many tricks as I had to use to clean up the dirty data in this instance. But even such a "worst case" scenario worked out rather well, enabling the translator to test and use her old data in the new working environment.

Score another one for interoperability. Sort of.

5 comments:

  1. Kevin, could you please e-mail me the tbx file (unless the contents are confidential) at vaclav.balacek@gmail.com? I'd like to try and see why the import into MemSource did not work.

    ReplyDelete
  2. I'll have to ask the data owner.

    ReplyDelete
  3. Why bother with splitting synonyms into separate columns? You could define semicolon as the symbol for alternatives during import.

    ReplyDelete
    Replies
    1. Why, Marek? Because I never noticed that option until you pointed it out! And on 2 hours' sleep I wasn't likely to do so. Your suggestion make the whole process much more straightforward! Please finish your book on memoQ soon - I am really looking forward to all I will learn from it!

      Delete
  4. I do what I can. At least the term bases chapter is already finished :)

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)