Showing posts with label edit distance. Show all posts
Showing posts with label edit distance. Show all posts

Jun 7, 2013

Understanding fuzzy term matching in memoQ 2013

Perhaps the most interesting and potentially useful feature for me in the recently released memoQ 2013 is fuzzy term matching. I have wanted something like this for several years, and several efforts at harmonizing terminology in a large, collaborative project last year made it clear that this might be very helpful in identifying deviations from agreed terminology in cases where that terminology appears as part of a compound word (as it sometimes tends to do in German, my source language).

So when I finally downloaded the latest version of memoQ last weekend and began testing, fuzzy terminology was the second thing I looked at (after the current comment mess). My initial tests left me very, very confused. Each example I created gave different results, and it was not easy to discern a pattern from examining just a few terms. The explanation of why this feature works as it does can be difficult to follow, at least as far as I am able to explain it, so many readers may be better off to read my conclusions in the next paragraph and skip everything below it (except maybe the last graphic).

Fuzzy term matching in memoQ 2013 is a real  improvement for terminology matching and quality assurance involving terms, at least for my language pair. This is not an easy challenge that the developers have taken on, but some useful results have been achieved and no harm has been done to previous functionality. And I expect that this feature will be the subject of further refinement and improvement for other languages as users make the case for these.

My first quick test of the feature involved a verb, the German word for "to wait" ("warten"). I put it into a test termbase and then imported a translation text consisting of various sentences that used forms of the verb. I noticed that there was a term hit for "warte", but nothing for "gewartet". After adding "warte" to the termbase for fuzzy matching, there was still no match for "gewartet", although it contained that character sequence.

Then I tried another example with "Gesetz" (law). There I seemed to hit the jackpot. There were hits with Unweltgesetz (a typo, but typical of many source texts I see), Umweltgesetze, Gesetzentwurf and Umweltgesetzentwurf with blue background highlighting of the character sequence matching the termbase entry.

A third term produced more confusion: with "Ausführung" in the termbase, there was no match for "Farbausführungungen", but there was a match for "Farbausführungsbeispiel". Clearly this is not a simple matching function.

A question to Kilgray Support brought an answer that explained the match behavior. The current implementation of fuzzy term matching in memoQ uses a combination of rules which depend on the index, the "edit distance" (calculated differences between the entry string and the characters in the term to match) and, depending on the language, character maps and a threshold length for possible compound words.

German, it seems, is a privileged language, the only one for which compound word recognition rules are currently active. Apparently five characters are the minimum to be recognized as a word, so the "Farb-" in "Farbanwendungen" wasn't enough, but "-beispiel" triggered the compound recognition rule that caused "-anwendung-" to be matched in the middle of "Farbanwendungsbeispiel". I imagine that compound matching would be useful for Dutch and some other languages, and the developer suggested that expanding coverage to other languages as needed would not be difficult.

Character mapping - defined equivalence between letters - is implemented for German, Hungarian, Italian and Spanish to allow matching in cases where letters may change with plural formation, for example. Thus the German word "Bratapfel" in the termbase would yield a hit for the plural form "Bratäpfel".

Edit distance is calculated by dividing the number of deviating characters by the number of total characters in the fuzzy term entry. A match is currently assumed if the edit distance is 0.2 or less. The term "warte" differs from the six-letter entry "warten" by one character; 1/6 is less than 0.2, so memoQ 2013 reports a term match. In the example of "Farbausführungen" above, the calculated edit is 6/10 (because 6 letters - four on the left, two on the right - are added to the term entry), and because this is larger than 0.2 no match is indicated. If, on the other hand, "Farbtonausführungen" occurs, a match will be found, because the added segment "Farbton-" meets the requirement of five or more letters for a compound word.

How relevant can this feature be to your language? What changes might be required to the matching behavior to obtain useful results in your source language(s)? Your feedback to Kilgray's support team and feedback from others working with your languages are the best way to help improve the usefulness of fuzzy term matching for your language. So speak up.

If you find this feature useful and want fuzzy term matching as the default for new entries in a termbase, this can be set in the properties for the termbase under New term defaults... for termbases created in memoQ 2013. Older termbases will also display this option, but it won't actually work in practice. To use this new feature with old term collections, these will need to be migrated to a memoQ 2013 termbase.