Jan 23, 2014

memoQ auto-translation regex blues

Can you write a regular expression that matches each of the three character sequences marked in green below?
abcdefg
abcde
abc
That's how one interactive tutorial I found start off teaching regular expressions. I found that a bit puzzling, then noticed that the right side of the web page offered a list of notes on "regex" expressions. I've spent too much time in the last two weeks trying to sort out a variety of memoQ auto-translation rules with regular expressions, so the answer came quickly and I typed it in the text field. Then I looked again and realized there were a few alternatives that would work. So I tested them. And then a few more. Various expressions that would work include
[a-z]+
[a-g]+
[\w]*
[abcdefg]*
.+
\D+
and quite a few others. But which is correct? As with word choice in translation, that depends on context, and that's where it gets hard.

Those who have delved into the configurations of various translation environment tools such as SDL Trados Studio or memoQ have seen that regular expressions are used in different ways to identify patterns in information and then filter or transform that information based on those patterns.

Although there is great power in regular expressions, I see their current role in memoQ configuration as more of a liability as far as the average user is concerned. Regular expressions are currently part of memoQ configuration for at least two import/conversion filters (a text filter and a tagger), segmentation rules and "auto-translation". I find the last feature particularly troublesome in its current form.

There are many misunderstandings about what auto-translation is in memoQ. It has nothing whatsoever to do with machine translation, which some propagandists prefer to call "automatic translation" to gloss over the many difficulties it can cause. Nor is it part of the pre-translation feature, though it can be applied in pre-translation to deal with things like catalog numbers, dates and figures in tables rather efficiently.

Auto-translation in memoQ is used to convert certain patterned information into the format needed in translation. In monolingual editing projects, it can also be used to unify the formatting used for things like dates and currency expressions.

memoQ ships with a number of standard rule sets for number conversions. The "English group" that I use consists of ten different rules for converting most of the screwy number formats I encounter with separators for decimals (periods or commas) and 3-order magnitude groups (thousands, millions, billions, etc.) that might be grouped with spaces, apostrophes, periods or something else. The rule for converting hundreds of millions with decimal fractions to my usual preference is:

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{1,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

with the replacement rule $2,$3,$4.$5

Pretty damned intimidating for most of us. In fact, most of the people who grasp the basics of regex syntax will scratch their heads over the number assignments of the groups until they realize that conditionals in parentheses (like (?:\.|,)) don't count. Nobody points that out in any tutorial I've read.

If I suggested to most of my esteemed colleagues that they really need to learn this stuff, I think most juries in the civilized world would refuse to convict them of murdering me for the mental cruelty inflicted. There are brilliant translation technology consultants like Marek Pawelec, who eat stuff like this with their breakfast cereal and are a priceless resource to colleagues and corporate clients who need their expertise... and then there are the rest of us.

Kilgray CEO István Lengyel told me recently that there are plans to expand the examples shipped with memoQ later this year to include some reformatting for certain date structures and other information. Language Terminal has a few examples of useful conversion rules for dates, unusual number formats, e-mail addresses and more. You don't need to know regex to use these, just how to download the MQRES files and click Import in one of the memoQ modules for managing auto-translation rules to bring these into your set-up. Once there they can also be used in QA checks.

I think it would be nice if Kilgray or someone more expert than yours truly would produce a "recipe book" with clear documentation of examples so that users of more limited skill (like me) can adapt these examples to their specific needs. Some typical uses I see (some of which have eaten my evenings recently) are
  • stripping or adding spaces from numbers with percent signs, such as 37,5 % <> 37.5%
  • formatting other numbers with units according to a preferred convention, such as 1,2A >> 1.2 A
  • currency expression reformatting like
          TEUR 1.350 >> EUR 1,350 thousand
          34.664,45 €  >> €34,664.45
          € 1,2 Mrd.  >> €1.2 billion
          & cetera
  • Legal references such as
          § 15 Abs. 1 Nr. 3 GebrMG
          § 15(1) Nr. 3 GebrMG
          § 9 S. 2 Nr. 1 PatG
  • page designations and other elements often found in bibliographies and easily overlooked (for QA purposes)
  • conversion of dates like 23.05.67 to
          23 May 1967
          May 23, 1967
          1967-05-23
              or whatever other format one prefers, with or without non-breaking spaces
  • Lovely EU legislation designations like 93/42/EWG >> 93/42/EEC
  • Telephone number reformatting such as (0211) 45 66 - 500 >> +49 211 4566-500
Some might wonder about that last conversion with the two-digit years. As a veteran of the Y2K scam, I'm fond of two-digit years; they were part of my ticket over the Atlantic. The implementation of regular expressions in memoQ allows for the use of custom reference lists, which are delimited by hash symbols (#) in the expressions. So when I created my rule for converting those two-digit years in dates
(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(#21st-Century-2digit#)  and
(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(\d{2})

with the replacement rules $1 $2 20$3 and $1 $2 19$3 respectively,
I used two custom lists, one with translation pairs like 01 = Jan. and the other with the two-digit years I felt like assigning to the 21st century (i.e. the ones I am most likely to encounter that do, like 00 through 19; I'll have to adjust this in some cases).

In the case of my two-digit year conversion rules, the rule order is important. The conversion will not work as planned if the rules appear in reverse order. THIS IS A MAJOR PROBLEM with the current rules editor for memoQ auto-translatables. Each time a rule is edited, it goes to the bottom of the list. I'm currently working on a complex set of about 20 rules for converting financial expressions, and rule order is critical for several subgroups of rules (this was disputed after I originally made this post and I backed down... however, subsequent tests have proven that rule order is indeed critical!!!). So editing them in memoQ is a nightmare-. (one expert told me how he generates a basic rule set in memoQ, exports it and does all further rule editing in Notepad++, which also allows him to keep better track of his work with comments.) Some current problems with the edit dialog for auto-translation rules in memoQ are
  • the need for "order stability" for rules being edited to maintain grouping (for a better overview and the arrow buttons to move rules up and down easily for better grouping
  • insufficient field width/height and bad scrolling behavior, so that it is very difficult to edit long expressions - usually have to paste them into Notepad to keep an overview while I work
  • strange, severe bugs in the test window, so the rule results shown are sometimes not accurate; I deal with this by adding a test document with sample data to the project and looking at what the rules do with that text
  • helpful <!-- comments inserted in the rules --> to explain them disappear when MQRES files are imported, and there is no way to maintain explanatory comments to keep track of one's own work in the editor (except for the pitifully limited comment box in the resource properties dialog)
A few people I've mentioned auto-translatables to have said to me that they would have no use for them, because they use dictation software. I do that myself, but I find that in many cases (like for those legal references) it is nicer to have a rule configured to client preferences, so I can use  single keystroke to insert
Section 55(2) No. 3 Sentence 2
Section 55 Paragraph 2 No. 3 Sentence 2  or
Sect. 55(2) No. 3 S. 2
as the job calls for. And run a QA check to confirm that I have formatted consistently:


Dominique Pivard posted a nice video about memoQ auto-translatables on his "vlog" a while ago. It's worth a look if you want to see how to create these resources and see a good demonstration of how they work.

9 comments:

  1. Paul Filkin explained what ?: does: http://multifarious.filkin.com/2013/01/18/dogs-and-cats-regular-expressions-part-4/
    As with so many features of memoQ my first reaction is: let them first fix the bugs, then we'll see what improvements can be made to the functionality. Last July I pointed out to Support that numbers in unit conversions have the wrong decimal separator. It's still not fixed.

    ReplyDelete
    Replies
    1. You're right. Paul does cover that topic well! I owe him a thank you for that. I've seen some of his other posts in regex and the recommendation of his favorite "cheat tool". I'm probably an idiot for not dropping the $20 or whatever it is on that weeks ago.

      I haven't done much with the unit conversions page yet. What's the specific issue with that?

      Delete
    2. Let's say I have a measurement in inches 24.5" and I want centimeters. The number is calculated correctly as 62.23cm (although one might argue that it conveys greater accuracy than is justifiable based on the original number) but for my language I should see a comma, not a period. And there should be a space between the number and the unit.

      Delete
    3. You made me curious, so I was messing around with the auto-translation conversions while multitasking with a BBC episode of Father Brown. It's a wonder there's anyone left alive in that silly village at the rate they all die.

      There seems to be more than the trouble you describe unless you can get a correct answer for a conversion of 32 °F to centigrade :-)

      Delete
  2. It works with scale 0.555555 and offset -17.777778 (taken from http://kilgray.com/memoq/60/help-en/index.html?edit_auto_translatables.html)
    However, the - in negative numbers is ignored. Not a big deal when 0 in the source unit equals 0 in the target unit, but 0 °F is not 0 °C so it gets all messed up.

    ReplyDelete
    Replies
    1. Ack. I should have realized that solution given the order of operations, but as you said, negatives are ignored and that's not particularly helpful ;-)

      Delete
  3. When it comes to auto-conversion, it gets worse if decimals are included. I haven't tested the latest version, but at least in v. 5 decimal values were completely ignored for automatic conversion.

    BTW, when it comes to date format conversion, I've created something like this for English to Polish:
    (#day#),?\s(#month#)\s(\d{1,2})(?:st|nd|rd|th)?\s(\d{4}) > $1 $3. $2 $4 r.

    #day# and #month# are translation pairs lists.
    Source format: Monday, January 1st 2013 or Tuesday February 2 2013 or Saturday, March 23rd 2013,
    Target format: poniedziałek, 1. stycznia 2013 r.

    ReplyDelete
    Replies
    1. Marek, I've been looking at the latest set of financial auto-translation rules you showed me. That's about as good as I can imagine for practical documentation as you do it in Notepad++. And following your suggestions on editing and testing practices is a big help to me in overcoming the insanity of the integrated editor's limits.

      It's also been interesting to see auto-translation / regex questions on the Yahoo list in the past few days... always the same problems, time and again. Just little differences for particular languages. There is so much potential to reduce "pain" and improve QA if more relevant, well-documented examples are made available for use and adaptation!

      Delete
    2. The decimal value issue was fixed in version 6.2.21 after I reported it to Support in July last year.

      Delete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)