Dec 28, 2016

Go Figure (with memoQ!)

When translating patents, legal briefs, reports, manuals and many other kinds of documents I inevitably encounter figure references to photographs and illustrations in the text as well as the labeled captions for these. In this morning's translation of a petition in a nullity suit, one such reference takes the form in Verbindung mit Figur 1,  but it might just as well appear as

Fig. 1
Fig 1
Abb. 1
or
Abbildung 1

in this or some other text; in documents with multiple and/or sloppy authors I might even find a mix of all these in the same text.

As I value consistency in writing even when the client might not care, I try to translate all of these to the same form in English where it makes sense to do so. That might be Figure 1 or Fig. 1 depending on the situation and the styleguide stipulated for the project.

But when I finish the 10,000 or so words for this job and need to do my final check before sending it to the client, I expect to be a little tired, and I want to use my attention and energy to focus on the accuracy and reading comfort of my translation. In doing so I tend to miss little details like the occurrence of "Fig. 1" on page 32 as opposed to "Figure 1" on the other 40 pages. That is why I use the QA feature of memoQ to check the consistency with which I have translated the figure references as well as other matters such as the accurate use of special terminology for the project.

The specific feature I use here for quality assurance is


an auto-translation rule set (aka "autotranslatables"), which is highlighted and selected in the screenshot of the project's settings above.

As I have stated many times before, autotranslatables should be used, but not created by the average translator. Aside from the fact that the regular expressions involved are not particularly easy even for most of the nerds among us, there are a lot of little subtleties that make the difference between a well-functioning rule set and annoying garbage, and even the "experts" struggle with this for sophisticated rules.

But the present example of Figure mapping is a comparatively simple case which can illustrate the principles and some of the "risks" to mere mortals.



My rule set for mapping figures from many German forms to a particular English form consists of a single rule.

All of the possibilities that I expect in German are compiled in a list, along with the English expression for each, and this translation pair list is named #figurelist# and is found on the corresponding dialog tab in the memoQ rule set editor for autotranslatables. (I usually edit rules externally in Notepad++ where I can comment them liberally, but in this case I felt no need to do so.) This named list is used as a variable in the regular expression for the rule to describe a source text match.

(#figurelist#)\.?\s+?\b(\d+)\b

Jeepers. That regex for the source text looks complicated, doesn't it? Wouldn't (#figurelist#) \d+ be just as good? After all, it seems to work just fine. Well, except that the list would need a few extra entries to account for abbreviations with and without periods.

No. "(#figurelist#) \d+" is total, incompetent crap. Here are some reasons why:
  • It is more efficient to express the possibility of a period after the text for "Figure" with the regex "\.?",  because you'll never have to worry about abbreviations with or without periods in your lists. Mine will get longer, as I'll probably expand these rules to cover Portuguese as well and use the same rule for both Portuguese and German sources.
  • There may or may not be a space or even extra spaces after the Figure expression. Simply typing a standard space after the (#figurelist#) group means that it must be present and it must be an ordinary space to match. If it's missing or someone typed a non-breaking space (a reasonable thing to do to keep both parts of "Figure 1" on the same line), the rule will not work! Using \s+? to express the possibility of 0 to n spaces after "Fig." or whatever is in fact the right way to go.
  • If you test the "simple" crappy regex, you'll also find that "Abb. 14" gives to results: Figure 1 and Figure 14. That is because the rule does not stipulate that the second part must be a whole "word", so the substring match with the first character also gives a result. Bad, bad, bad. The chaos that this sort of mistake can cause with more complex rules like currency expressions used in important financial translations is frightening.
The regex for the result also appears more complex than it should be, but there is a reason behind that as well. Instead of the simple $1 $2 (first group followed by a space followed by the second group), I specified output with a non-breaking space, because it looks rather unfortunate to have a line wrap in the middle of the expression for a figure. One sees that a lot, because it's a nuisance to remember to type non-breaking spaces all the time on the keyboard. This rule can also be used to check the use of the non-breaking space; an ordinary space will generate a warning when the memoQ QA profile is run with the autotranslatables check activated.

There are many ways in which regular expression rule sets can enhance the user experience and the quality of translation results when working in memoQ. It is not hard to use these rules, but it is beyond most users to create and maintain their own rule sets. Therefore
  • Kilgray should include more useful examples of rule sets (in addition to the very helpful number rules) in future releases of memoQ
  • The average user should ask the help of Kilgray Support for simple rules they need (in most cases this would fall under the usual commitment of paid support and maintenance for the year)
  • memoQ users should work with Kilgray's Professional Services department or other competent consultants to devise robust rule sets to boost their translation and quality assurance productivity. Beware of casual advice found in forums or social media; much of it does not consider issues like the problems described above despite the aggressive insistence one might see for a particular "solution". Truly, you get what you pay for :-)

Post scriptum:
An yet ye hack by night and sun, the work of regex be never done.
Of course something was forgotten in the example here. The myriad styles and customs of source text authors will inevitably offer up challenging variants to break your well-crafted rules. Today's is a text full of figure references like Abbildung 4.12, which would refer to the twelfth figure in the fourth chapter. For this the modified rule might be 

(#figurelist#)\.?\s+?(\b\d+\.?\d+?\b) 

Or perhaps not quite. Try it and you'll see a few problems. This is just another example of why it is good to make use of professional resources to help you with these challenges and to have a systematic way of recording and elaborating them. I'll explain more about such an effective system for planning and documentation in a future article. I've noticed that the "experts" in the translation field often care little for the usual standards of project specification, perhaps because they are sick and tired of translation projects with so many specification documents for those who know better.

1 comment:

  1. You end your post with "Truly you get what you pay for", but I get excellent advice from you for free on your blog! >:->

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)