Jun 7, 2013

Understanding fuzzy term matching in memoQ 2013

Perhaps the most interesting and potentially useful feature for me in the recently released memoQ 2013 is fuzzy term matching. I have wanted something like this for several years, and several efforts at harmonizing terminology in a large, collaborative project last year made it clear that this might be very helpful in identifying deviations from agreed terminology in cases where that terminology appears as part of a compound word (as it sometimes tends to do in German, my source language).

So when I finally downloaded the latest version of memoQ last weekend and began testing, fuzzy terminology was the second thing I looked at (after the current comment mess). My initial tests left me very, very confused. Each example I created gave different results, and it was not easy to discern a pattern from examining just a few terms. The explanation of why this feature works as it does can be difficult to follow, at least as far as I am able to explain it, so many readers may be better off to read my conclusions in the next paragraph and skip everything below it (except maybe the last graphic).

Fuzzy term matching in memoQ 2013 is a real  improvement for terminology matching and quality assurance involving terms, at least for my language pair. This is not an easy challenge that the developers have taken on, but some useful results have been achieved and no harm has been done to previous functionality. And I expect that this feature will be the subject of further refinement and improvement for other languages as users make the case for these.

My first quick test of the feature involved a verb, the German word for "to wait" ("warten"). I put it into a test termbase and then imported a translation text consisting of various sentences that used forms of the verb. I noticed that there was a term hit for "warte", but nothing for "gewartet". After adding "warte" to the termbase for fuzzy matching, there was still no match for "gewartet", although it contained that character sequence.

Then I tried another example with "Gesetz" (law). There I seemed to hit the jackpot. There were hits with Unweltgesetz (a typo, but typical of many source texts I see), Umweltgesetze, Gesetzentwurf and Umweltgesetzentwurf with blue background highlighting of the character sequence matching the termbase entry.

A third term produced more confusion: with "Ausführung" in the termbase, there was no match for "Farbausführungungen", but there was a match for "Farbausführungsbeispiel". Clearly this is not a simple matching function.

A question to Kilgray Support brought an answer that explained the match behavior. The current implementation of fuzzy term matching in memoQ uses a combination of rules which depend on the index, the "edit distance" (calculated differences between the entry string and the characters in the term to match) and, depending on the language, character maps and a threshold length for possible compound words.

German, it seems, is a privileged language, the only one for which compound word recognition rules are currently active. Apparently five characters are the minimum to be recognized as a word, so the "Farb-" in "Farbanwendungen" wasn't enough, but "-beispiel" triggered the compound recognition rule that caused "-anwendung-" to be matched in the middle of "Farbanwendungsbeispiel". I imagine that compound matching would be useful for Dutch and some other languages, and the developer suggested that expanding coverage to other languages as needed would not be difficult.

Character mapping - defined equivalence between letters - is implemented for German, Hungarian, Italian and Spanish to allow matching in cases where letters may change with plural formation, for example. Thus the German word "Bratapfel" in the termbase would yield a hit for the plural form "Bratäpfel".

Edit distance is calculated by dividing the number of deviating characters by the number of total characters in the fuzzy term entry. A match is currently assumed if the edit distance is 0.2 or less. The term "warte" differs from the six-letter entry "warten" by one character; 1/6 is less than 0.2, so memoQ 2013 reports a term match. In the example of "Farbausführungen" above, the calculated edit is 6/10 (because 6 letters - four on the left, two on the right - are added to the term entry), and because this is larger than 0.2 no match is indicated. If, on the other hand, "Farbtonausführungen" occurs, a match will be found, because the added segment "Farbton-" meets the requirement of five or more letters for a compound word.

How relevant can this feature be to your language? What changes might be required to the matching behavior to obtain useful results in your source language(s)? Your feedback to Kilgray's support team and feedback from others working with your languages are the best way to help improve the usefulness of fuzzy term matching for your language. So speak up.

If you find this feature useful and want fuzzy term matching as the default for new entries in a termbase, this can be set in the properties for the termbase under New term defaults... for termbases created in memoQ 2013. Older termbases will also display this option, but it won't actually work in practice. To use this new feature with old term collections, these will need to be migrated to a memoQ 2013 termbase.




7 comments:

  1. Thanks for sharing your first experiences, Kevin. I haven't tried 2013 and was wondering if you might have any idea about/if the term fuzzy matching affects the automated terminology checks. In my mind, including fuzzies in the check could create tons of false positives, but hey, those guys are smart, maybe they figured it out?

    ReplyDelete
  2. >> wondering if you might have any idea about/if the term
    >> fuzzy matching affects the automated terminology checks

    Not quite sure I follow you on that one, José. I assume you are referring to QA checks for terminology; these will show results for whatever would be expected to match. I can't think of a case off-hand where I would be worried about a large number of false positives. However... your question got me to thinking how I want to use fuzzy matching in QA, so I went back to my test project and did some more checking. And it looked to me like fuzzy term matching had broken the term QA function in memoQ! A few more tests showed that this wasn't quite the case, but it looks like some adjustment is needed. QA violations that involve a term matched by the edit distance rule are indicated, but not those which fall under the compound word rule (which is currently only activated for German). So "Egesetz" would trigger a QA warning if the translation for "Gesetz" were missing, but not "Gesetzgebung", although that term would have been caught using the old "50% prefix" rule. In any case, this is bad for my source language (German). My old default setting (50% prefix matching) found a lot of term inconsistencies for me, and I was counting on the new fuzzy term matching to do the same for terms occurring somewhere other than the beginning of a compound word.

    ReplyDelete
    Replies
    1. Just received information from the Kilgray developer responsible for this function that the compound word matching has been deliberately disabled for QA purposes, because with their test database they felt an unacceptable number of false positives were reported. I have some doubts that this would be the case with some of the data I've dealt with, but perhaps I'm mistaken. Many of the continuing problems I saw in a the terminology of a series of projects last year were because of the inability to deal with compound words that did not fit the 50% prefix rule. I think I would prefer to enable/disable QA fuzzy matching of compound words as a user option.

      Delete
  3. Hi Kevin,

    I am wondering why the KB article (http://kb.kilgray.com/article/AA-00468/0/How-do-I-get-the-fuzzy-matching-for-existing-term-bases.html#addComment) hasn't yet been changed to warn people against the dangers of using Excel to migrate their old TBs to the new format to allow for fuzzy matching to work. I'm not sure but I suspect that people are going to run the risk of messing up quite few special characters unless they use a decent UTF-8-aware text editor to replace all instances of 'HalfPrefix' with 'Prefix'.

    Michael

    ReplyDelete
    Replies
    1. Dangers? For quite a while, everyone associated with Kilgray that has discussed using Excel has emphasized the importance of saving the CSV file as Unicode text to avoid character mapping problems. By now I would say that anyone who tunes that advice out deserves to repeat the work or at least pay a consultant like me a small fortune to repeat it for them ;-)

      Thanks for the knowledgebase link.

      Delete
  4. Hmm. Although I agree that you and I should know better, there will be many not so technical users trying to migrate their termbases and Kilgray ought to at least mention the problem in their official help resources for their sakes. If only to save themselves a few support requests.

    ReplyDelete
  5. Two caveats about bugs I have turned up in testing the fuzzy terminology feature. Both of these have been reported to Kilgray Support and will presumably be fixed at some point:

    (1) Wildcards added in the term extraction module are treated as ordinary characters when written to termbases with "Fuzzy" term entries as their default.

    (2) In the memoQ term editor, if you select a "custom" matching term (one with wildcards) and change it to "fuzzy", the program offers to remove the wildcards, and if you decline, the term remains "custom". With bulk selected terms (by Shift-clicking or Ctrl-clicking to select multiple terms), all terms will be changed to "fuzzy" and wildcards are left in, causing the same problem identified in (1).

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)