Apr 11, 2014

memoQ: stopwords for term extraction

The recent Kilgray blog post about the "terminology as a service" (TaaS) project reminded me of the considerable unfinished business with the term extraction extraction module introduced three years ago with memoQ 5.0. It's a very useful feature that I apply frequently to my projects, and my prediction years ago that it would not replace SDL's MultiTerm Extract in my workflows was wrong. Overall it proved to be more convenient, and after the shock of discovering that the defective logic of MultiTerm Extract created new German "words" that neither existed nor were in my text sources, I dumped that dodgy tool and stuck to memoQ's extractor. But sometimes its rough edges are irritating, and I wish Kilgray would finally pick up the ball that was dropped after a great start in the game.

One of the major weaknesses (aside from never remembering my changes to the options or my preferred settings for extractions) is the management of stopword lists.

Overall, Kilgray's approach to stopwords is reasonably sophisticated; some time ago I published a rather incomprehensible post in which I demonstrated how the "stopword codes" - those three digit binary numbers which appear stopwords - control whether a word can appear at the start, the end or the middle of a phrase even if it is excluded as a single word. These codes are quite useful in some cases. However, they also complicate the use of stopword data for most users.

memoQ includes only a few stopword lists for a few languages in its shipping configuration - German, English, French and Italian I think. Not even all the user interface languages are included. That's rather sad, because there are quite a few public domain stopword lists available on the Internet. However, most memoQ users have absolutely no idea how to incorporate these in memoQ, and Kilgray offers no features or information that I am aware of to facilitate this process.

I was reminded of this problem when I was asked to discuss terminology mining with masters students at my local university in Portugal. I thought it might be nice for the students to be able to make use of the stopword lists they could find on the Internet for their target languages (Spanish and Portuguese for that group). These lists are typically just text files of single words. When I build my own master stopword list for German a few years ago, I gathered half a dozen or more large, mostly redundant lists for a start. Then I carried out the following steps:
  1. Combine all the stopword lists for a language from various sources into one big text file.
  2. Open that text file in a spreadsheet program such as Microsoft Excel.
  3. Sort the list and use the integrated function to eliminate duplicates.
  4. Fill the number "111" in the second column of the spreadsheet. (This will completely exclude the term from phrases as well; if you want to make individual exceptions according to the scheme I described in an earlier blog post, you can do so now or at any time later after the list has been imported to memoQ.)
  5. Save the data as tab-delimited Unicode text.
  6. Open the text file and paste in this XML header, adapting the red parts to your particular list:

    <MemoQResource ResourceType="Stopwords" Version="1.0">
      <Resource>
        <Guid>dc7006ad-8db8-4724-b22d-7acfd600fd9f</Guid>
        <FileName>ger#KSL_stopwords-DE.mqres</FileName>
        <Name>KSL_stopwords-DE</Name>
        <Description>Combined lists from many sources</Description>
        <Language>ger</Language>
      </Resource>
    </MemoQResource>


  7. Save the file, change the file extension to MQRES, and import the file as a stopword list in the memoQ Resource Console.
Afterward, the stopword list you imported will be available  when you start a term mining session using Operations > Extract Terms...

During an extraction session, words can be added to the chosen stopword list (one at a time unfortunately - I've been asking for multiple addition as a productivity measure for 3 years now so far) by selecting a term and Clicking the Add as stopword command or pressing Ctrl+W.

You might be a little confused if you look for words you've added to a stopword list from the term extraction interface. They are not inserted in alphabetical order, but instead at the end of the words starting with a given letter. Thus, for example, the red box in the screen clipping below shows all the words I've added to my previously alphabetized list since I created it:


3 comments:

  1. The OK button is greyed!! dimmed
    that is very strange
    I created in a manual way as I mentioned, but I do not know why MemoQ does not allow this to work as it should.
    But certainly memoQ termextact and terminology engine is becoming much better than that of SDL!!!

    ReplyDelete
  2. Which OK button, Sameh? In which program or context?

    I used to use MultiTerm Extract a lot until a few years ago when I noticed that its so-called "intelligence" was causing it to offer me term candidates that were not actually in the text and which were not real words in the language of interest. At first I thought the client had used some complete idiots as writers, but when I failed to find those words even as substrings in the source text I realized that the SDL software was simply crazy. memoQ's facilities for term extraction are better on the whole (though I wish they had some of the features of MuliTerm Extract), but taht's a bog standard comparison. There is much room for improvement.

    ReplyDelete
  3. OK, now I understand the problem. Thanks for sending me your list - you forgot to add the XML header. The list *started* with the stopwords, so it could not be imported. This is exactly why I have been trying to tell Kilgray for several years now that their light resource management is in desperate need of reform. It should not be this complicated to handle what are basically just simple text files, and the changes that would have to be made to facilitate the management of stopword lists are fairly trivial. But as far as I know this blog post is still the only source with accurate (though annoyingly cumbersome) instructions on how to create a valid memoQ stopword list from resources found on the Internet.

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)