Feb 20, 2017

Building a regex-savvy "termbase" in memoQ


For years I have been frustrated by and dissatisfied with how abbreviations are handled in the current memoQ termbase model. The crux of the problem is the handling of the periods in the expressions. This can be seen with termbase entries like the following, for example:


If the abbreviation "Art." appears in the source text, only the second source entry - the one without the period - will give a match result in memoQ. The first entry is simply ignored.

An additional problem which one would face, even if the terminal period character in the term did not pose a problem, is that authors are often notoriously variable in the way they write abbreviations. Take, for example, the abbreviation for the German expression "in Verbindung mit", usually written as "i.V.m."

In recent legal translation work, I have encountered this expression written as above, but also as "i. V. m." (with spaces), "iVm" (no spaces no periods) and sloppily typed variations like "iV.m" or "i. V.m." What's a poor wordworker to do?

The answer came to me while refining a set of auto-translation rules for bibliography formatting and legal references. These, too, can suffer from similar troubles: "page 7" might be abbreviated as "p. 7", but in the sloppy chaos of source texts poorly edited one might find "p.7", "p 7", "p7" or even variations with the letter capitalized, like "P.7". If you are translating nearly 1000 references in a bibliography, robust shortcuts are very helpful and save a lot of time, and if those shortcuts are based on memoQ auto-translation rules, they can also be used in a QA profile to ensure that every bit matches correctly.

As the screen capture from a memoQ Facebook group above suggests, the way to go about this is to identify which parts of the expression might vary with different deliberate and accidental typing. These are usually spaces and periods in the case of abbreviations; sometimes, particularly with German legal abbreviations, capitalization and dashes may play roles as well. (I tore my hair out not long ago trying to understand an Austrian legal text referring to two laws, which differed in their three-letter abbreviations only by a dash inserted after the first letter of one.)

In regular expressions, the question mark character means "zero or one" of whatever character precedes the question mark. So if I want a rule that acts in the case of one or no periods, I put a question mark after the period character. And because in the language of regular expressions, a period is shorthand for any character, if I want to talk about an actual period ("."), I have to precede that character by a backslash ("\."). In the technical jargon of Nerdworld that is known as "escaping the period" and there is no escaping such syntax if you want a regular expression rule about periods, period.

Spaces (normal or non-breaking ones) are represented by an escaped lowercase "s": "\s". So a matching rule for the English abbreviation "e.g" which catches a lot of typing variations might be

e\.?\s?g\.?

And in German, the target replacement rule might be

d.h.

Of course, if a typist is sloppy, there might be more than one space, or a comma might be typed accidentally instead of a period (the keys are adjacent, and if your screen is as dirty as mine gets sometimes, your eyes might not notice); capitalization might also differ accidentally or based on context. The regular expressions for matching can be adapted to handle all these cases if need be.

Rules of this type are not particularly difficult to construct, but refining them to accommodate all the variations you are likely to encounter may require an expert hand. Thus, as I have suggested before,. the average user should focus on documenting all the possible source variations clearly in a table which includes the desired target equivalents, and this table should be given to an expert (Kilgray support, a qualified consultant like Marek Pawelec or a technical programmer familiar with regular expressions and their use in memoQ). Trust me, this will save a lot of frayed nerves and probably significant time and money as well.

So now I am building a few memoQ auto-translation rulesets which are essentially fault-tolerant abbreviation glossaries. These, together with the similar rulesets for formatting bibliographical references and references to sections, paragraphs, lines, margin notes, etc. in laws, have been very helpful in reducing the time spent translating messy legal source texts, and the accuracy of the work has been improved significantly. Give it a try for your translation challenges!

1 comment:

  1. Rule no. 1 in regex: it's easy to match what you need, it's much harder to match only that. Therefore I would suggest a correction:
    \be\.?\s?g\.?\b

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)