abcdefgThat's how one interactive tutorial I found start off teaching regular expressions. I found that a bit puzzling, then noticed that the right side of the web page offered a list of notes on "regex" expressions. I've spent too much time in the last two weeks trying to sort out a variety of memoQ auto-translation rules with regular expressions, so the answer came quickly and I typed it in the text field. Then I looked again and realized there were a few alternatives that would work. So I tested them. And then a few more. Various expressions that would work include
abcde
abc
[a-z]+and quite a few others. But which is correct? As with word choice in translation, that depends on context, and that's where it gets hard.
[a-g]+
[\w]*
[abcdefg]*
.+
\D+
Those who have delved into the configurations of various translation environment tools such as SDL Trados Studio or memoQ have seen that regular expressions are used in different ways to identify patterns in information and then filter or transform that information based on those patterns.
Although there is great power in regular expressions, I see their current role in memoQ configuration as more of a liability as far as the average user is concerned. Regular expressions are currently part of memoQ configuration for at least two import/conversion filters (a text filter and a tagger), segmentation rules and "auto-translation". I find the last feature particularly troublesome in its current form.
There are many misunderstandings about what auto-translation is in memoQ. It has nothing whatsoever to do with machine translation, which some propagandists prefer to call "automatic translation" to gloss over the many difficulties it can cause. Nor is it part of the pre-translation feature, though it can be applied in pre-translation to deal with things like catalog numbers, dates and figures in tables rather efficiently.
Auto-translation in memoQ is used to convert certain patterned information into the format needed in translation. In monolingual editing projects, it can also be used to unify the formatting used for things like dates and currency expressions.
memoQ ships with a number of standard rule sets for number conversions. The "English group" that I use consists of ten different rules for converting most of the screwy number formats I encounter with separators for decimals (periods or commas) and 3-order magnitude groups (thousands, millions, billions, etc.) that might be grouped with spaces, apostrophes, periods or something else. The rule for converting hundreds of millions with decimal fractions to my usual preference is:
(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{1,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
with the replacement rule $2,$3,$4.$5
Pretty damned intimidating for most of us. In fact, most of the people who grasp the basics of regex syntax will scratch their heads over the number assignments of the groups until they realize that conditionals in parentheses (like (?:\.|,)) don't count. Nobody points that out in any tutorial I've read.
If I suggested to most of my esteemed colleagues that they really need to learn this stuff, I think most juries in the civilized world would refuse to convict them of murdering me for the mental cruelty inflicted. There are brilliant translation technology consultants like Marek Pawelec, who eat stuff like this with their breakfast cereal and are a priceless resource to colleagues and corporate clients who need their expertise... and then there are the rest of us.
Kilgray CEO István Lengyel told me recently that there are plans to expand the examples shipped with memoQ later this year to include some reformatting for certain date structures and other information. Language Terminal has a few examples of useful conversion rules for dates, unusual number formats, e-mail addresses and more. You don't need to know regex to use these, just how to download the MQRES files and click Import in one of the memoQ modules for managing auto-translation rules to bring these into your set-up. Once there they can also be used in QA checks.
I think it would be nice if Kilgray or someone more expert than yours truly would produce a "recipe book" with clear documentation of examples so that users of more limited skill (like me) can adapt these examples to their specific needs. Some typical uses I see (some of which have eaten my evenings recently) are
- stripping or adding spaces from numbers with percent signs, such as 37,5 % <> 37.5%
- formatting other numbers with units according to a preferred convention, such as 1,2A >> 1.2 A
- currency expression reformatting like
TEUR 1.350 >> EUR 1,350 thousand
34.664,45 € >> €34,664.45
€ 1,2 Mrd. >> €1.2 billion
& cetera - Legal references such as
§ 15 Abs. 1 Nr. 3 GebrMG
§ 15(1) Nr. 3 GebrMG
§ 9 S. 2 Nr. 1 PatG - page designations and other elements often found in bibliographies and easily overlooked (for QA purposes)
- conversion of dates like 23.05.67 to
23 May 1967
May 23, 1967
1967-05-23
or whatever other format one prefers, with or without non-breaking spaces - Lovely EU legislation designations like 93/42/EWG >> 93/42/EEC
- Telephone number reformatting such as (0211) 45 66 - 500 >> +49 211 4566-500
(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(#21st-Century-2digit#) andI used two custom lists, one with translation pairs like 01 = Jan. and the other with the two-digit years I felt like assigning to the 21st century (i.e. the ones I am most likely to encounter that do, like 00 through 19; I'll have to adjust this in some cases).
(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(\d{2})
with the replacement rules $1 $2 20$3 and $1 $2 19$3 respectively,
In the case of my two-digit year conversion rules, the rule order is important. The conversion will not work as planned if the rules appear in reverse order. THIS IS A MAJOR PROBLEM with the current rules editor for memoQ auto-translatables. Each time a rule is edited, it goes to the bottom of the list. I'm currently working on a complex set of about 20 rules for converting financial expressions, and rule order is critical for several subgroups of rules (this was disputed after I originally made this post and I backed down... however, subsequent tests have proven that rule order is indeed critical!!!). So editing them in memoQ is a nightmare-. (one expert told me how he generates a basic rule set in memoQ, exports it and does all further rule editing in Notepad++, which also allows him to keep better track of his work with comments.) Some current problems with the edit dialog for auto-translation rules in memoQ are
- the need for "order stability" for rules being edited to maintain grouping (for a better overview and the arrow buttons to move rules up and down easily for better grouping
- insufficient field width/height and bad scrolling behavior, so that it is very difficult to edit long expressions - usually have to paste them into Notepad to keep an overview while I work
- strange, severe bugs in the test window, so the rule results shown are sometimes not accurate; I deal with this by adding a test document with sample data to the project and looking at what the rules do with that text
- helpful <!-- comments inserted in the rules --> to explain them disappear when MQRES files are imported, and there is no way to maintain explanatory comments to keep track of one's own work in the editor (except for the pitifully limited comment box in the resource properties dialog)
Section 55(2) No. 3 Sentence 2as the job calls for. And run a QA check to confirm that I have formatted consistently:
Section 55 Paragraph 2 No. 3 Sentence 2 or
Sect. 55(2) No. 3 S. 2
Dominique Pivard posted a nice video about memoQ auto-translatables on his "vlog" a while ago. It's worth a look if you want to see how to create these resources and see a good demonstration of how they work.