Feb 27, 2017

Planning special rules for structured "expressions" and multi-word abbreviations

Translators and editors often deal with what I'll call "structured expressions" or "patterned data" in many forms, which include:
  • long and short dates (2016-01-13; 1/13/16; 13.01.2016; January 13, 2016; 13th January 2016; etc.
  • time expressions (14:35; 2:35 pm; 2:35 PM; 2:35 p.m.; etc.
  • currency expressions (EUR 2.3 million; € 2,300,000; €2.3m; etc.) 
  • legal references (Section 14a paragraph 3 line 2; section 14a (3) line 2; etc.
  • bibliographical references for chapters, pages, margin notes, etc.
  • and much more.
There is also a wealth of abbreviations for multiple word expressions in some categories of text; favorites in German include:
  • in Verbindung mit (variously written as i.V.m., i. V. m., iVm or some typoed hybrid of the aforementioned with spaces and periods included or forgotten depending on the authors' preferences and degree of care)
  • im Sinne des (i.S.d., i. S. d., iSd, etc.)
These can be devilishly hard to check efficiently for consistency or other quality factors in a long text, and for the translation, there is often no single "right" way to format the target text equivalents, with many individual preferences to be found with translation buyers. Even with a good style guide (all too rare anyway), these issues can be challenging time-wasters.

Translation assistance tools such as Apsic Xbench, SDL Trados Studio and others, even memoQ, have various approaches to making life easier for a translator or editor faced with these challenges. Unfortunately for most people, these approaches usually involve the use of "regular expressions" or "regex" as nerds affectionately call it. Not an easy thing even for many hardcore techies!

On past occasions when I have written about the use of regex in translation tools, I have usually stated clearly that the best approach for the best, most reliable results is to have the regex "rules" for handling the text developed by a knowledgeable third party. The experts who deal with this stuff routinely can often reduce a task that would take a semi-skilled person like myself hours or even days to the time for a coffee break, and even if a task takes a while and runs up a bit of a bill, it's much more likely to be done right the first or second time.

But... there's a catch usually. Most of these regex fireaters are not skilled in mind reading, many are not translators, and even those familiar with translation challenges might not be familiar with your working languages or your particular subject areas and their possibly unique challenges. So effective communication is really, really important (it always is, of course, but here even more so if you are dealing with a verbally challenged, monolingual math freak who might be your local expert for regex).

Even for areas I know reasonably well and languages I more or less master, I am often frustrated by help requests from colleagues and clients who need special rulesets developed for a client's preferences for date and currency information, because the request is not clear in its scope and detail, and many important cases are left out, so the end result is not fully satisfactory.

Over the years and with a lot of back and forth (sometimes inside my own head with yours truly as my nightmare of a "client"), I have developed a system of simple documentation for planning and testing rules to help translate and quality check patterned information or multi-word abbreviations. This system provides an easy structure for non-techies (or even hardcore techies) to organize the help request for most efficient handling. Here is an example of part of such a planning sheet for a recent project involving Arabic:


When the time comes to test, just copy the source text column into a separate file, add whatever variations you want to the examples to test your accomodation of typos, etc. and then load that file as a "translation text" for testing in your working environment. If you have the same information for another, overlapping language pair, such as German and English, it is easy to couple that to make a ruleset which maps multiple source languages to a target language. An example of such a result is a memoQ auto-translation ruleset for mapping long dates and month-plus-day dates from German, English, French and Spanish into Portuguese which can be obtained here.

This simple, tabular approach to data collection to plan regular expression rules has made me a lot more efficient at such tasks and faciulitated the re-use of data to make new rulesets for clients and colleagues (or myself) as needs arise. The liberal commenting of examples can be very helpful; information to include which could affect rule structure might involve capitalization, location in a sentence, variations or differences in particular contexts, etc.

For my own work, rulesets include a series for dates, currency and legal reference formats from German to English for generic and client-specific use for US and UK English. With the help of these tabular planning sheets, I can adapt any of these quickly for most other languages.

For tracking the development of rules and their improvement history I have another set of templates which I use for systematic planning and identification of areas to improve. That will be discussed on another occasion.

No comments:

Post a Comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)