Apr 4, 2018

New in memoQ 8.4: easy stopword list creation!

This wasn't really on Kilgray's plan, but hey - it's now possible, and that makes my life easier. An accidental "feature".

Four years ago, frustrated by the inability of memoQ to import stopword lists obtained from other sources to memoQ, I published a somewhat complex workaround, which I have used in workshops and classes when I teach terminology mining techniques. For years I had suggested that adding and merging such lists be facilitated in some way, because the memoQ stopword list editor really sucks (and still does). Alas, the suggestion was not taken up, so translators of most source languages were left high and dry if they wanted to do term extraction in memoQ and avoid the noise of common, uninteresting words.

Enter memoQ version 8.4... with a lot of very nice improvements in terminology management features, which will be the subject of other posts in the future. I've had a lot of very interesting discussions with the Kilgray team since last autumn, and the directions they've indicated for terminology in memoQ have been very encouraging. The most recent versions (8.3 and 8.4) have delivered on quite a number of those promises.

I have used memoQ's term extraction module since it was first introduced in version 5, but it was really a prototype, not a properly finished tool despite its superiority over many others in a lot of ways. One of its biggest weaknesses was the handling of stopwords (used to filter out unwanted "word noise". It was difficult to build lists that did not already exist, and it was also difficult to add words to the list, because both the editor and the term extraction module allowed only one word to be added at a time. Quite a nuisance.

In memoQ 8.4, however, we can now add any number of selected words in an extraction session to the stopword list. This eliminates my main gripe with the term extraction module. And this afternoon, while I was chatting with Kilgray's Peter Reynolds about what I like about terminology in memoQ 8.4, a remark from him inspired the realization that it is now very easy to create a memoQ stopword list from any old stopword lists for any language.

How? Let me show you with a couple of Dutch stopword lists I pulled off the Internet :-)

I've been collecting stopword lists for friends and colleagues for years; I probably have 40 or 50 languages covered by now. I use these when I teach about AntConc for term extraction, but the manual process of converting these to use in memoQ has simply been too intimidating for most people.

But now we can import and combine these lists easily with a bogus term extraction session!

First I create a project in memoQ, setting the source language to the one for which I want to build or expand a stopword list. The target language does not matter. Then I import the stopword lists into that project as "translation documents".

On the Preparation ribbon in the open project, I then choose Extract Terms and tell the program to use the stopword lists I imported as "translation documents". Some special settings are required for this extraction:

The two areas marked with red boxes are critical. Change all the values there to "1" to ensure that every word is included. Ordinarily, these values are higher, because the term extraction module in memoQ is designed to pick words based on their frequencies, and a typical minimum frequency used is 3 or 4 occurrences. Some stopword lists I have seen include multiple word expressions, but memoQ stopword lists work with single words, so the maximum length in words needs to be one.

Select all the words in the list (by selecting the first entry, scrolling to the bottom and then clicking on the last entry while holding down the Shift key to get everything), and then select the command from the ribbon to add the selected candidates to the stopword list.

But we don't have a Dutch stopword list! No matter:

Just create a new one when the dialog appears!

After the OK button is clicked to create the list, the new list appears with all the selected candidates included. When you close that dialog, be sure to click Yes to save the changes or the words will not be added!

Now my Dutch stopword list is available for term extraction in Dutch documents in the future and will appear in the dropdown menu of the term extraction session's settings dialog when a session is created or restarted. And with the new features in memoQ 8.4, it's a very simple matter to select and add more words to the list in the future, including all "dropped" terms if you want to do that.

More sophisticated use of your new list would include changing the 3-digit codes which are used with stopwords in memoQ to allow certain words to appear at the beginning, in the middle, or at the end of phrases. If anyone is interested in that, they can read about it in my blog post from six years ago. But even without all that, the new stopword lists should be a great help for more efficient term extractions for your source languages in the future.

And, of course, like all memoQ light resources, these lists can be exported and shared with other memoQ users who work with the same source language.

No comments:

Post a Comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)