Jan 7, 2012

Understanding memoQ's term extraction stopword codes

Recently I shared a link to a small stopword list for a minor language, which I had set up as a memoQ resource for a friend, and another translator questioned why I had coded the stopwords as I did. My answer was truthful: no good reason. I had simply copied the practice in Kilgray's default files for other languages. As I looked further into discussions of term extraction and stopwords on the memoQ Yahoogroups list, I realized that I was not the only one who had a hard time getting a clear picture of how things actually work. So I decided to learn by experiment.
First I created a stopword list with nonsense words having every possible coding combination. A memoQ stopword list is a test file with an XML header and *.mqres extension, with a structure that looks like this:
<memoqresource resourcetype="Stopwords" version="1.0">
  <resource>
    <guid>2b077cde-8c10-4ee1-86db-14eb42f010cc</guid>
    <filename>KSL_test-stopwords_EN.mqres</filename>
    <name>KSL_test-stopwords-EN</name>
    <description>For testing only</description>
    <language>eng</language>
  </resource>
</memoqresource>
gak    111
unga   101
munga  011
kunga  110
fra    000
blu    100
bly    001
bla    010
The entries in the stopword list (here the nonsense words gak through bla) are each followed by a tab and a three digit binary code. The first digit of this code controls whether a phrase is excluded from the list of candidates if it begins with this entry. (Kilgray calls this "blocks as first".) The second digit of the code controls whether a phrase is excluded if the entry occurs within it (not at the beginning nor at the end, Kilgray calls this "blocks inside"). The third digit controls whether a phrase is excluded if the entry occurs at its end ("blocks as last").

A "1" means yes, "0" means no. So "011" means
  • allowed at the start of the phrase,
  • not allowed inside the phrase
  • not allowed at the end of a phrase
Thus kunga will cause a phrase to be excluded if it occurs at the start of or inside the phrase, but not at the end. Phrases ending with kunga might appear in the list of candidates.

My test file contained the sentence
The quick brown fox jumped over the lazy dog 
repeated four times in three blocks for each test stopword, with the stopword substituted at the beginning, inside and at the end of "over the lazy dog":
The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog.
The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog.
The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga.
After the term extraction, the following four-word phrases from the text chunk of interest were found with the stopwords:
fra The lazy dog
bly The lazy dog
bla The lazy dog

munga
The lazy dog

over unga lazy dog

over fra lazy dog
over blu lazy dog
over bly lazy dog

over The lazy kunga
over The lazy fra
over The lazy blu
over The lazy bla
All these occurrences follow the defined rules as you can see from the stopword list above. None of the stopwords occurred singly in the extraction candidates, of course. So entering "000" as the code for a stopword will exclude that stopword alone but not in any phrase.

How is this relevant in practice? In English, for example, words like in, the and first are uninteresting by themselves and belong in a stopword list. But a phrase containing them, like "in the first instance" might indeed be of interest. In cases like that, the proper code for these stopwords might be "001" or "101" (allowing inside in both cases, at the beginning as well in the first case) might be appropriate. These are matters of judgment that will differ for each language. One user commented that he finds it more useful to be very restrictive in the extraction ("111") and add phrases during the actual translation, and I am inclined to follow this practice as well. Where one discovers exceptions, the stopword rules can always be edited in various places in memoQ.



No comments:

Post a Comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)