One of the major weaknesses (aside from never remembering my changes to the options or my preferred settings for extractions) is the management of stopword lists.
Overall, Kilgray's approach to stopwords is reasonably sophisticated; some time ago I published a rather incomprehensible post in which I demonstrated how the "stopword codes" - those three digit binary numbers which appear stopwords - control whether a word can appear at the start, the end or the middle of a phrase even if it is excluded as a single word. These codes are quite useful in some cases. However, they also complicate the use of stopword data for most users.
memoQ includes only a few stopword lists for a few languages in its shipping configuration - German, English, French and Italian I think. Not even all the user interface languages are included. That's rather sad, because there are quite a few public domain stopword lists available on the Internet. However, most memoQ users have absolutely no idea how to incorporate these in memoQ, and Kilgray offers no features or information that I am aware of to facilitate this process.
I was reminded of this problem when I was asked to discuss terminology mining with masters students at my local university in Portugal. I thought it might be nice for the students to be able to make use of the stopword lists they could find on the Internet for their target languages (Spanish and Portuguese for that group). These lists are typically just text files of single words. When I build my own master stopword list for German a few years ago, I gathered half a dozen or more large, mostly redundant lists for a start. Then I carried out the following steps:
- Combine all the stopword lists for a language from various sources into one big text file.
- Open that text file in a spreadsheet program such as Microsoft Excel.
- Sort the list and use the integrated function to eliminate duplicates.
- Fill the number "111" in the second column of the spreadsheet. (This will completely exclude the term from phrases as well; if you want to make individual exceptions according to the scheme I described in an earlier blog post, you can do so now or at any time later after the list has been imported to memoQ.)
- Save the data as tab-delimited Unicode text.
- Open the text file and paste in this XML header, adapting the red parts to your particular list:
<MemoQResource ResourceType="Stopwords" Version="1.0">
<Description>Combined lists from many sources</Description>
- Save the file, change the file extension to MQRES, and import the file as a stopword list in the memoQ Resource Console.
During an extraction session, words can be added to the chosen stopword list (one at a time unfortunately - I've been asking for multiple addition as a productivity measure for 3 years now so far) by selecting a term and Clicking the Add as stopword command or pressing Ctrl+W.
You might be a little confused if you look for words you've added to a stopword list from the term extraction interface. They are not inserted in alphabetical order, but instead at the end of the words starting with a given letter. Thus, for example, the red box in the screen clipping below shows all the words I've added to my previously alphabetized list since I created it: