Mar 17, 2016

Dynamic filtering with regular expressions in memoQ


Regular expressions (aka regex) are not a tool for everyone, though this is something that the nerdily inclined often fail to appreciate. For average users, a plain language query interface, perhaps with more limited options, is generally more accessible and used. However, sometimes it's nice to have such "shortcuts" available to select particular structures in a text for translation or editing, and the many people who complained for years that Kilgray did not provide a dynamic regex filter for the working translation grid - a feature of SDL Trados Studio for quite a while now - did have a point worth addressing in development. Now that has happened, though still a bit incompletely when considered in the full scope of memoQ's usual features for selecting text.

memoQ uses regex in a number of its modules, and Kilgray has several webinars which describe these applications, though they require some stamina to watch, and I expect that most people will become hopelessly confused if they try to take in more than one area of application in a single sitting. The uses of regex for segmentation rules, tagging, autotranslatables and text filtering on document import (with the Regex Text Filter) are very different in their approach, even though the underlying syntax of the regex is the same. However, all of these applications allow the configured rules to be saved and re-used, so one could ask an expert to create the settings needed and provide these in a resource file, and many users do exactly that. Thus as long as one understand that regex can be used for a particular problem, the details can be hired out.

This new application of regex for dynamically filtering, introduced in recent builds of memoQ 2015, is a little different (at present). Although the Find/Replace dialog will "remember" regex syntax in its dropdown menu of recent expressions, there is no way to store these expressions, and they must be entered manually to use them. This means that, for now, the average user will have to collect useful expressions like a tourist might scribble phrases in a notebook to use on holiday in a foreign country, and those with a little more sense of adventure might find themselves with a hovercraft full of eels and wonder why.

One such phrase might be the example in the screenshot above. I was translating some financial statements with several formats present for digits in account numbers, dates and monetary expressions. In order to work more systematically with these various formats, I used several different regex expressions to sort and separate them. In the example I was looking for instances where at least four digits were written together in a source segment. That isn't terribly selective, but most of these occurrences in my documents were account numbers, and this helpfully cleaned up the text a lot and allowed me to work a little faster. Other expressions were used to QA date formats and monetary expressions more specifically.

In the working grid for translation and editing, regular expressions can be used in one or both of the fields for the source and target text when the checkbox in the toolbar at the right is marked. Or the regular expressions option in the Find/Replace dialog can be used.


It is somewhat disappointing that regex cannot be used to create static views at the present time. While marking can be used in the Find dialog to enable one to go back and forth between the filter criteria and other configurations of the working grid, there is no way to make a permanent "record" of the filtered segments. For quite a few years, I have wished for the possibility to save the results of my filtering in the working grid in some sort of view, but I was always able at least to recreate the filtering criteria in the dialog to create a memoQ View, which could then be opened at any time or exported in various formats for clients and project collaborators. However, at the moment that is not possible with regex filtering. (There are workarounds involving a change in segment status, but these are often inconvenient in a project in progress.)

The addition of regex filtering to the working grid in memoQ is a welcome feature for many, which I hope will be expanded by Kilgray in the future to achieve more of its potential. But to take advantage of this potential in any way, the average user will indeed need a "phrase book" of sorts, and an efficient way of managing useful collected regex snippets (and naming them for easier re-use in searches and filtering) would be very desirable. If these "regex phrase books" for dynamic filtering and view creation were able to be saved as shareable light resources, it would be possible to build many useful collections to help users at all levels in the translation, editing and quality assurance tasks.

3 comments:

  1. Eirik BirkelandMarch 23, 2016 3:15 PM

    Good to see that someone else is using the RegEx functionality. Thanks for writing this great blog!

    I'd like to share some checks I found useful recently:

    Checking for missing period in target:
    Source: \.$
    Target [^.]$

    Check if word is present in source, but missing in target string:

    Source: someWord
    Target ^((?!someWord).)*$

    The target regex traverses the string step by step and uses a lookahead at each step to check if the word is matched. This is somewhat computing intensive. Whenever I write RegEx I mostly use the below regex101 as my testing ground.

    https://regex101.com/r/jK4aR3/1

    I totally agree that memoQ would benefit from having lists of regular expressions. That would really make it useful to anyone.

    ReplyDelete
    Replies
    1. Thank you for sharing your tips here! Our brilliant colleague Marek Pawelec has also published a good regex tutorial recently on an Adobe blog. It will be found at http://blogs.adobe.com/techcomm/2016/03/framemaker-regular-expressions.html

      The two Kilgray videos (with Denis Hay and Miklos Urban) on using regex are also well-presented, but I think both suffer from too great a scope for most viewers to absorb in one sitting. Covering more than one application area for regex in memoQ in detail is a recipe for a burst head in most people. These are probably best dealt with in application area chunks, but of course they are not indexed, so this isn't very easy.

      Delete
    2. I have been planning to make some memoQ RegEx demos, and I will eventually get around to putting up some - hopefully sooner than later! I will include live demos as well, so hopefully some translators can be converted.

      Thanks for the link. I read parts of it, and it looks like a clear and succinct summary of RegEx.

      I'd also like to recommend http://www.regular-expressions.info/ as a great look-up resource. It's by the author of the Regular Expressions Cookbook, but which sadly doesn't hold that much relevance to translators as it focuses on the common needs of developers (e-mail validation etc). What's funny is that many programmers don't know regular expressions very well, and the language actually seems to be more relevant to us translators at times.

      Sometimes I think the world might benefit from a translation-oriented book on regular expressions. E.g. a kind of tutorial + dictionary of trivial as well as non-trivial expressions that are useful across many languages. However, I would expect world sales to be around ... 100? :)

      Have you considered using the following free tool in your workflow?
      https://languagetool.org/

      And here are the rule lists for German and English:
      http://community.languagetool.org/rule/list?lang=de
      http://community.languagetool.org/rule/list?lang=en

      I can't tell whether these are any good, as my most common target language, Norwegian, doesn't even have a database (I've thought of starting one though, but not sure if anyone would take notice TBH!)

      Delete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)