Apr 14, 2018

memoQ filter for MS Outlook e-mail

A few days ago I was preparing screenshots in memoQ for lecture slides. As I tried to select a PDF file to import, the defective trackpad on my laptop caused a file farther down in the list to be selected, and I got a surprise. Not believing my eyes, I tried again and saw that, yes, what I saw was indeed possible...

... saved Microsoft Outlook MSG files (e-mail) are imported to memoQ with all their graphics and attachments! Kilgray created a filter some time ago and simply forgot to document its existence publicly. As of the current versions of memoQ you won't see this in the documentation or the filter lists of the interface, but memoQ can "see" MSG files, and if they are selected, this hidden filter will appear in the import dialog.

And this also works for LiveDocs.

At the time of this discovery, I was working on a little job for a friend's agency, and her project manager had sent me a list of abbreviations in an e-mail. I was too lazy to make the entries in my termbase, so I simply imported the mail to the LiveDocs corpus I maintain for her shop so that it would show up in concordance searches:

So when people tell you memoQ is good, don't believe them. It's actually better than that, but the truth is a well-kept secret :-)

Apr 4, 2018

New in memoQ 8.4: easy stopword list creation!

This wasn't really on Kilgray's plan, but hey - it's now possible, and that makes my life easier. An accidental "feature".

Four years ago, frustrated by the inability of memoQ to import stopword lists obtained from other sources to memoQ, I published a somewhat complex workaround, which I have used in workshops and classes when I teach terminology mining techniques. For years I had suggested that adding and merging such lists be facilitated in some way, because the memoQ stopword list editor really sucks (and still does). Alas, the suggestion was not taken up, so translators of most source languages were left high and dry if they wanted to do term extraction in memoQ and avoid the noise of common, uninteresting words.

Enter memoQ version 8.4... with a lot of very nice improvements in terminology management features, which will be the subject of other posts in the future. I've had a lot of very interesting discussions with the Kilgray team since last autumn, and the directions they've indicated for terminology in memoQ have been very encouraging. The most recent versions (8.3 and 8.4) have delivered on quite a number of those promises.

I have used memoQ's term extraction module since it was first introduced in version 5, but it was really a prototype, not a properly finished tool despite its superiority over many others in a lot of ways. One of its biggest weaknesses was the handling of stopwords (used to filter out unwanted "word noise". It was difficult to build lists that did not already exist, and it was also difficult to add words to the list, because both the editor and the term extraction module allowed only one word to be added at a time. Quite a nuisance.

In memoQ 8.4, however, we can now add any number of selected words in an extraction session to the stopword list. This eliminates my main gripe with the term extraction module. And this afternoon, while I was chatting with Kilgray's Peter Reynolds about what I like about terminology in memoQ 8.4, a remark from him inspired the realization that it is now very easy to create a memoQ stopword list from any old stopword lists for any language.

How? Let me show you with a couple of Dutch stopword lists I pulled off the Internet :-)

I've been collecting stopword lists for friends and colleagues for years; I probably have 40 or 50 languages covered by now. I use these when I teach about AntConc for term extraction, but the manual process of converting these to use in memoQ has simply been too intimidating for most people.

But now we can import and combine these lists easily with a bogus term extraction session!

First I create a project in memoQ, setting the source language to the one for which I want to build or expand a stopword list. The target language does not matter. Then I import the stopword lists into that project as "translation documents".

On the Preparation ribbon in the open project, I then choose Extract Terms and tell the program to use the stopword lists I imported as "translation documents". Some special settings are required for this extraction:

The two areas marked with red boxes are critical. Change all the values there to "1" to ensure that every word is included. Ordinarily, these values are higher, because the term extraction module in memoQ is designed to pick words based on their frequencies, and a typical minimum frequency used is 3 or 4 occurrences. Some stopword lists I have seen include multiple word expressions, but memoQ stopword lists work with single words, so the maximum length in words needs to be one.

Select all the words in the list (by selecting the first entry, scrolling to the bottom and then clicking on the last entry while holding down the Shift key to get everything), and then select the command from the ribbon to add the selected candidates to the stopword list.

But we don't have a Dutch stopword list! No matter:

Just create a new one when the dialog appears!

After the OK button is clicked to create the list, the new list appears with all the selected candidates included. When you close that dialog, be sure to click Yes to save the changes or the words will not be added!

Now my Dutch stopword list is available for term extraction in Dutch documents in the future and will appear in the dropdown menu of the term extraction session's settings dialog when a session is created or restarted. And with the new features in memoQ 8.4, it's a very simple matter to select and add more words to the list in the future, including all "dropped" terms if you want to do that.

More sophisticated use of your new list would include changing the 3-digit codes which are used with stopwords in memoQ to allow certain words to appear at the beginning, in the middle, or at the end of phrases. If anyone is interested in that, they can read about it in my blog post from six years ago. But even without all that, the new stopword lists should be a great help for more efficient term extractions for your source languages in the future.

And, of course, like all memoQ light resources, these lists can be exported and shared with other memoQ users who work with the same source language.

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:

Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:

That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:

That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:

The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:

Then choose that custom configuration when you import the file using Import with Options:

This cascaded configuration can also be saved using the corresponding icon button.

This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:

If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

Apr 3, 2018

Dealing with tagged translatable text in memoQ

Lately I've been doing a bit of custom filter development for some translation agency clients. Most of it has been relatively simple stuff, like chaining an HTML filter after an Excel filter to protect HTML tags around the text in the Excel cells, but some of it is more involved; in a few cases, three levels of filters had to be combined using memoQ's cascading filter feature.

And sometimes things go too far....

A client had quite a number of JSON files, which were the basis for some online programming tutorials. There was quite a lot of non-translatable content that made it past memoQ's default JSON filter, much of which - if modified in any way - would mess up the functionality of the translated content and require a lot of troublesome post-editing and correction. In the example above, Seconds in a day: is clearly translatable text, but the special rules used with the Regex Tagger turned that text (and others) into protected tags. And unfortunately the rules could not be edited efficiently to avoid this without leaving a lot of untranslatable content unprotected and driving up the cost (due to increased word count) for the client.

In situations like this, there is only one proper thing to do in memoQ: edit the tags!

There are two ways to do this:

  • use the inline tag editing features of memoQ or
  • edit the tag on the target side of a memoQ RTF bilingual review file.
The second approach can be carried out by someone (like the client) in any reasonable text editor; tags in an RTF bilingual are represented as red text:

If, however, you go the RTF bilingual route, it's important to specify that the full text of the tags is to be exported, or all you'll get are numbers in brackets as placeholders:

Editing tags in the memoQ working environment is also straightforward:

On the Edit ribbon, select Tag Commands and chose the option Edit Inline Tag

When you change the tag content as required, remember to click the Save button in the editing dialog each time, or your changes will be lost.

These methods can be applied to cases such as HTML or XML attribute text which needs to be translated but which instead has been embedded in a tag due to an incorrectly configured filter. I've seen that rather often unfortunately.

The effort involved here is greater than the typical word- or character-based compensation schemes can justly compensate and should be charged at a decent hourly rate or be included in project management fees. 

A lot of translators are rather "tag-phobic", but the reality of translation today is that tags are an essential part of the translatable content, serving to format translatable content in some cases and containing (unfortunately) embedded text which needs to be translated in other (fortunately less common) cases. Correct handling of tags by translation service providers delivers considerable value to end clients by enabling translations to be produced directly in the file formats needed, saving a great deal of time and money for the client in many cases.

One reasonable objection that many translators have is that the flawed compensation models typically used in the bulk market bog do not fairly include the extra effort of working with tags. In simple cases where the tags are simply part of the format (or are residual garbage from a poorly prepared OCR file, for example), a fair way of dealing with this is to count the tags as words or as an average character equivalent. This is what I usually do, but in the case of tags which need editing, this is not enough, and an hourly charge would apply.

In the filter development project for the JSON files received by my agency client, the text used was initially analyzed at
14,985 words; 111,085 characters; 65 tags
and after proper tagging of the coded content to be protected it was
8766 words; 46,949 characters; 2718 tags.
The reduction in text count more than covered the cost of the few hours needed to produce the cascading filter needed for this client's case and largely ensured that the translator could not alter text which would impair the function of the product.