Feb 27, 2017

Planning special rules for structured "expressions" and multi-word abbreviations

Translators and editors often deal with what I'll call "structured expressions" or "patterned data" in many forms, which include:
  • long and short dates (2016-01-13; 1/13/16; 13.01.2016; January 13, 2016; 13th January 2016; etc.
  • time expressions (14:35; 2:35 pm; 2:35 PM; 2:35 p.m.; etc.
  • currency expressions (EUR 2.3 million; € 2,300,000; €2.3m; etc.) 
  • legal references (Section 14a paragraph 3 line 2; section 14a (3) line 2; etc.
  • bibliographical references for chapters, pages, margin notes, etc.
  • and much more.
There is also a wealth of abbreviations for multiple word expressions in some categories of text; favorites in German include:
  • in Verbindung mit (variously written as i.V.m., i. V. m., iVm or some typoed hybrid of the aforementioned with spaces and periods included or forgotten depending on the authors' preferences and degree of care)
  • im Sinne des (i.S.d., i. S. d., iSd, etc.)
These can be devilishly hard to check efficiently for consistency or other quality factors in a long text, and for the translation, there is often no single "right" way to format the target text equivalents, with many individual preferences to be found with translation buyers. Even with a good style guide (all too rare anyway), these issues can be challenging time-wasters.

Translation assistance tools such as Apsic Xbench, SDL Trados Studio and others, even memoQ, have various approaches to making life easier for a translator or editor faced with these challenges. Unfortunately for most people, these approaches usually involve the use of "regular expressions" or "regex" as nerds affectionately call it. Not an easy thing even for many hardcore techies!

On past occasions when I have written about the use of regex in translation tools, I have usually stated clearly that the best approach for the best, most reliable results is to have the regex "rules" for handling the text developed by a knowledgeable third party. The experts who deal with this stuff routinely can often reduce a task that would take a semi-skilled person like myself hours or even days to the time for a coffee break, and even if a task takes a while and runs up a bit of a bill, it's much more likely to be done right the first or second time.

But... there's a catch usually. Most of these regex fireaters are not skilled in mind reading, many are not translators, and even those familiar with translation challenges might not be familiar with your working languages or your particular subject areas and their possibly unique challenges. So effective communication is really, really important (it always is, of course, but here even more so if you are dealing with a verbally challenged, monolingual math freak who might be your local expert for regex).

Even for areas I know reasonably well and languages I more or less master, I am often frustrated by help requests from colleagues and clients who need special rulesets developed for a client's preferences for date and currency information, because the request is not clear in its scope and detail, and many important cases are left out, so the end result is not fully satisfactory.

Over the years and with a lot of back and forth (sometimes inside my own head with yours truly as my nightmare of a "client"), I have developed a system of simple documentation for planning and testing rules to help translate and quality check patterned information or multi-word abbreviations. This system provides an easy structure for non-techies (or even hardcore techies) to organize the help request for most efficient handling. Here is an example of part of such a planning sheet for a recent project involving Arabic:


When the time comes to test, just copy the source text column into a separate file, add whatever variations you want to the examples to test your accomodation of typos, etc. and then load that file as a "translation text" for testing in your working environment. If you have the same information for another, overlapping language pair, such as German and English, it is easy to couple that to make a ruleset which maps multiple source languages to a target language. An example of such a result is a memoQ auto-translation ruleset for mapping long dates and month-plus-day dates from German, English, French and Spanish into Portuguese which can be obtained here.

This simple, tabular approach to data collection to plan regular expression rules has made me a lot more efficient at such tasks and faciulitated the re-use of data to make new rulesets for clients and colleagues (or myself) as needs arise. The liberal commenting of examples can be very helpful; information to include which could affect rule structure might involve capitalization, location in a sentence, variations or differences in particular contexts, etc.

For my own work, rulesets include a series for dates, currency and legal reference formats from German to English for generic and client-specific use for US and UK English. With the help of these tabular planning sheets, I can adapt any of these quickly for most other languages.

For tracking the development of rules and their improvement history I have another set of templates which I use for systematic planning and identification of areas to improve. That will be discussed on another occasion.

Feb 23, 2017

memoQuickie: version 8.0 begins public "beta" testing

At breakfast in the Social Media Cafe this morning:


You may have seen the hype behind the "memoQ Adriatic" rollout yesterday. AFAIK this is the first version of the software released without beta-testing, so the release is essentially a beta test. Beware.

The early reaction of one LSP project manager on the memoQ Facebook group makes many of the relevant points. The "new" features are mostly quite beside the point for most of us and are dealt with better elsewhere.

The choice of version "name" also strikes many as bizarre and out of touch. When Kilgray began to ape Microsoft and SDL by including years in the release designation, I said it was a bad idea. This apparent attempt to take cues from Apple's marketing is even worse.

I think this version can be ignored for the most part. Certainly for now in this dangerous beta (or perhaps alpha?) phase. Style is all very pretty, folks, but we need some real substance to address the challenges of translation technology today. Really.

For a "management summary" of new features it seems that the online Help file is your best bet.

Feb 20, 2017

Building a regex-savvy "termbase" in memoQ


For years I have been frustrated by and dissatisfied with how abbreviations are handled in the current memoQ termbase model. The crux of the problem is the handling of the periods in the expressions. This can be seen with termbase entries like the following, for example:


If the abbreviation "Art." appears in the source text, only the second source entry - the one without the period - will give a match result in memoQ. The first entry is simply ignored.

An additional problem which one would face, even if the terminal period character in the term did not pose a problem, is that authors are often notoriously variable in the way they write abbreviations. Take, for example, the abbreviation for the German expression "in Verbindung mit", usually written as "i.V.m."

In recent legal translation work, I have encountered this expression written as above, but also as "i. V. m." (with spaces), "iVm" (no spaces no periods) and sloppily typed variations like "iV.m" or "i. V.m." What's a poor wordworker to do?

The answer came to me while refining a set of auto-translation rules for bibliography formatting and legal references. These, too, can suffer from similar troubles: "page 7" might be abbreviated as "p. 7", but in the sloppy chaos of source texts poorly edited one might find "p.7", "p 7", "p7" or even variations with the letter capitalized, like "P.7". If you are translating nearly 1000 references in a bibliography, robust shortcuts are very helpful and save a lot of time, and if those shortcuts are based on memoQ auto-translation rules, they can also be used in a QA profile to ensure that every bit matches correctly.

As the screen capture from a memoQ Facebook group above suggests, the way to go about this is to identify which parts of the expression might vary with different deliberate and accidental typing. These are usually spaces and periods in the case of abbreviations; sometimes, particularly with German legal abbreviations, capitalization and dashes may play roles as well. (I tore my hair out not long ago trying to understand an Austrian legal text referring to two laws, which differed in their three-letter abbreviations only by a dash inserted after the first letter of one.)

In regular expressions, the question mark character means "zero or one" of whatever character precedes the question mark. So if I want a rule that acts in the case of one or no periods, I put a question mark after the period character. And because in the language of regular expressions, a period is shorthand for any character, if I want to talk about an actual period ("."), I have to precede that character by a backslash ("\."). In the technical jargon of Nerdworld that is known as "escaping the period" and there is no escaping such syntax if you want a regular expression rule about periods, period.

Spaces (normal or non-breaking ones) are represented by an escaped lowercase "s": "\s". So a matching rule for the English abbreviation "e.g" which catches a lot of typing variations might be

e\.?\s?g\.?

And in German, the target replacement rule might be

d.h.

Of course, if a typist is sloppy, there might be more than one space, or a comma might be typed accidentally instead of a period (the keys are adjacent, and if your screen is as dirty as mine gets sometimes, your eyes might not notice); capitalization might also differ accidentally or based on context. The regular expressions for matching can be adapted to handle all these cases if need be.

Rules of this type are not particularly difficult to construct, but refining them to accommodate all the variations you are likely to encounter may require an expert hand. Thus, as I have suggested before,. the average user should focus on documenting all the possible source variations clearly in a table which includes the desired target equivalents, and this table should be given to an expert (Kilgray support, a qualified consultant like Marek Pawelec or a technical programmer familiar with regular expressions and their use in memoQ). Trust me, this will save a lot of frayed nerves and probably significant time and money as well.

So now I am building a few memoQ auto-translation rulesets which are essentially fault-tolerant abbreviation glossaries. These, together with the similar rulesets for formatting bibliographical references and references to sections, paragraphs, lines, margin notes, etc. in laws, have been very helpful in reducing the time spent translating messy legal source texts, and the accuracy of the work has been improved significantly. Give it a try for your translation challenges!