Translation Tribulations: numbers

Showing posts with label numbers. Show all posts

Jan 23, 2014

memoQ auto-translation regex blues

Can you write a regular expression that matches each of the three character sequences marked in green below?

abcdefg
abcde
abc

That's how one interactive tutorial I found start off teaching regular expressions. I found that a bit puzzling, then noticed that the right side of the web page offered a list of notes on "regex" expressions. I've spent too much time in the last two weeks trying to sort out a variety of memoQ auto-translation rules with regular expressions, so the answer came quickly and I typed it in the text field. Then I looked again and realized there were a few alternatives that would work. So I tested them. And then a few more. Various expressions that would work include

[a-z]+
[a-g]+
[\w]*
[abcdefg]*
.+
\D+

and quite a few others. But which is correct? As with word choice in translation, that depends on context, and that's where it gets hard.

Those who have delved into the configurations of various translation environment tools such as SDL Trados Studio or memoQ have seen that regular expressions are used in different ways to identify patterns in information and then filter or transform that information based on those patterns.

Although there is great power in regular expressions, I see their current role in memoQ configuration as more of a liability as far as the average user is concerned. Regular expressions are currently part of memoQ configuration for at least two import/conversion filters (a text filter and a tagger), segmentation rules and "auto-translation". I find the last feature particularly troublesome in its current form.

There are many misunderstandings about what auto-translation is in memoQ. It has nothing whatsoever to do with machine translation, which some propagandists prefer to call "automatic translation" to gloss over the many difficulties it can cause. Nor is it part of the pre-translation feature, though it can be applied in pre-translation to deal with things like catalog numbers, dates and figures in tables rather efficiently.

Auto-translation in memoQ is used to convert certain patterned information into the format needed in translation. In monolingual editing projects, it can also be used to unify the formatting used for things like dates and currency expressions.

memoQ ships with a number of standard rule sets for number conversions. The "English group" that I use consists of ten different rules for converting most of the screwy number formats I encounter with separators for decimals (periods or commas) and 3-order magnitude groups (thousands, millions, billions, etc.) that might be grouped with spaces, apostrophes, periods or something else. The rule for converting hundreds of millions with decimal fractions to my usual preference is:

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{1,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

with the replacement rule $2,$3,$4.$5

Pretty damned intimidating for most of us. In fact, most of the people who grasp the basics of regex syntax will scratch their heads over the number assignments of the groups until they realize that conditionals in parentheses (like (?:\.|,)) don't count. Nobody points that out in any tutorial I've read.

If I suggested to most of my esteemed colleagues that they really need to learn this stuff, I think most juries in the civilized world would refuse to convict them of murdering me for the mental cruelty inflicted. There are brilliant translation technology consultants like Marek Pawelec, who eat stuff like this with their breakfast cereal and are a priceless resource to colleagues and corporate clients who need their expertise... and then there are the rest of us.

Kilgray CEO István Lengyel told me recently that there are plans to expand the examples shipped with memoQ later this year to include some reformatting for certain date structures and other information. Language Terminal has a few examples of useful conversion rules for dates, unusual number formats, e-mail addresses and more. You don't need to know regex to use these, just how to download the MQRES files and click Import in one of the memoQ modules for managing auto-translation rules to bring these into your set-up. Once there they can also be used in QA checks.

I think it would be nice if Kilgray or someone more expert than yours truly would produce a "recipe book" with clear documentation of examples so that users of more limited skill (like me) can adapt these examples to their specific needs. Some typical uses I see (some of which have eaten my evenings recently) are

stripping or adding spaces from numbers with percent signs, such as 37,5 % <> 37.5%
formatting other numbers with units according to a preferred convention, such as 1,2A >> 1.2 A
currency expression reformatting like
      TEUR 1.350 >> EUR 1,350 thousand
      34.664,45 € >> €34,664.45
      € 1,2 Mrd. >> €1.2 billion
      & cetera
Legal references such as
      § 15 Abs. 1 Nr. 3 GebrMG
      § 15(1) Nr. 3 GebrMG
      § 9 S. 2 Nr. 1 PatG
page designations and other elements often found in bibliographies and easily overlooked (for QA purposes)
conversion of dates like 23.05.67 to
      23 May 1967
      May 23, 1967
      1967-05-23
          or whatever other format one prefers, with or without non-breaking spaces
Lovely EU legislation designations like 93/42/EWG >> 93/42/EEC
Telephone number reformatting such as (0211) 45 66 - 500 >> +49 211 4566-500

Some might wonder about that last conversion with the two-digit years. As a veteran of the Y2K scam, I'm fond of two-digit years; they were part of my ticket over the Atlantic. The implementation of regular expressions in memoQ allows for the use of custom reference lists, which are delimited by hash symbols (#) in the expressions. So when I created my rule for converting those two-digit years in dates

(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(#21st-Century-2digit#) and
(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(\d{2})

with the replacement rules $1 $2 20$3 and $1 $2 19$3 respectively,

I used two custom lists, one with translation pairs like 01 = Jan. and the other with the two-digit years I felt like assigning to the 21st century (i.e. the ones I am most likely to encounter that do, like 00 through 19; I'll have to adjust this in some cases).

In the case of my two-digit year conversion rules, the rule order is important. The conversion will not work as planned if the rules appear in reverse order. THIS IS A MAJOR PROBLEM with the current rules editor for memoQ auto-translatables. Each time a rule is edited, it goes to the bottom of the list. I'm currently working on a complex set of about 20 rules for converting financial expressions, and rule order is critical for several subgroups of rules (this was disputed after I originally made this post and I backed down... however, subsequent tests have proven that rule order is indeed critical!!!). So editing them in memoQ is a nightmare-. (one expert told me how he generates a basic rule set in memoQ, exports it and does all further rule editing in Notepad++, which also allows him to keep better track of his work with comments.) Some current problems with the edit dialog for auto-translation rules in memoQ are

the need for "order stability" for rules being edited to maintain grouping (for a better overview and the arrow buttons to move rules up and down easily for better grouping
insufficient field width/height and bad scrolling behavior, so that it is very difficult to edit long expressions - usually have to paste them into Notepad to keep an overview while I work
strange, severe bugs in the test window, so the rule results shown are sometimes not accurate; I deal with this by adding a test document with sample data to the project and looking at what the rules do with that text
helpful  to explain them disappear when MQRES files are imported, and there is no way to maintain explanatory comments to keep track of one's own work in the editor (except for the pitifully limited comment box in the resource properties dialog)

A few people I've mentioned auto-translatables to have said to me that they would have no use for them, because they use dictation software. I do that myself, but I find that in many cases (like for those legal references) it is nicer to have a rule configured to client preferences, so I can use single keystroke to insert

Section 55(2) No. 3 Sentence 2
Section 55 Paragraph 2 No. 3 Sentence 2 or
Sect. 55(2) No. 3 S. 2

as the job calls for. And run a QA check to confirm that I have formatted consistently:

Dominique Pivard posted a nice video about memoQ auto-translatables on his "vlog" a while ago. It's worth a look if you want to see how to create these resources and see a good demonstration of how they work.

Mar 4, 2012

International dates and English

Time and again I encounter problems with texts or translations because of dates. This is something we as translators and quality reviewers for languages products and services must keep in mind.

After a recent patent translation, my agency client was asked to provide translations of my appointment as a translator for the German courts and of my examination certificate for the state exams in Berlin in which I qualified as a specialist for natural science. The agency principal is a decent translator himself, and since translating my own certificates would be a conflict of interest, he undertook the task and sent me the results for approval. As expected, the translations were OK except for one thing... the dates. At the top of one translation, I did a double-take at the revelation that I was born in July, rather than October. The date stood clearly as 07/10/1961, and it took a minute before I remembered that Brits, Australians and some others write their dates differently than Americans do. The German way of writing the same date as numbers (07.10.1961) at least has the virtue of using a different separator than Americans use, and it's a different language, so it's hard to be confused there if you know the German custom. But in English, well... the trap is there.

A day later an English friend fell into the same trap in memoQ. She wanted to export some TMX data from a rather large memory - just the work of the last few days for a particular client. The export was made on the second day of March, and when there were some questions regarding the exported data and I was asked to look at the TMX file, I was surprised to see data going back to the third of February. Fortunately, the TM from which the export was made contained only data for that client, so no breach of confidentiality occurred, but in a general TM this would have occurred. What was the problem? She was thinking in "UK mode" and confused 02/03 and 03/02.

One way to avoid this problem is to use the International date format, as Kilgray does in memoQ's filter for the TM and termbase editors, for example:

I am fond of using the international YYYY-MM-DD format wherever appropriate. I expect technical people anywhere in the world to be able to deal with it. However, this numeric format has its limitations too, because there are quite a number of people who simply aren't familiar with the format or who may be a bit dyslexic with numbers. The best solution in most cases is to use dates which include the month written as a word or a least an abbreviation thereof. For English, I would use a specifically American or British number format only if one is very, very sure that wider distribution will never be required.

Search me!

Jan 23, 2014

memoQ auto-translation regex blues

Mar 4, 2012

International dates and English