Translation Tribulations: Regex Tagger

Showing posts with label Regex Tagger. Show all posts

Oct 18, 2023

An Unfiltered Look at memoQ Filters (webinar, 19 October 2023, 15:00 CET)

This presentation and discussion covered some of the challenges and opportunities to improve memoQ project workflows through correct filter choice and design. There are many different aspects to filters in memoQ, and the right choices for a given translatable file or project are not always clear, or different options may offer particular advantages in your situation.

Cascading filters - an important feature for dealing with complex source texts - are also part of the talk, not just the basics but also examples of going beyond what visible memoQ features allow, to do "the impossible". This session is part of the weekly open office hours for the course "memoQuickies Resource Camp", but everyone is welcome to attend these talks regardless of enrollment status. Those interested in full access to all the course resources and teaching may enroll until the end of January 2024.

To join sessions for the October and November office hours, register here.

After registering, you will receive a confirmation email containing information about joining the meeting.

Here is an edited recording of the October 19th session, with a time-coded index available on YouTube in the Description field:

Mar 2, 2023

memoQ Regex Assistant workshops re-run

The series of three workshops on the use of regex resources in memoQ, with a particular emphasis on the integrated Regex Assistant library, has been updated and will be offered again on March 9, 16 and 23 from 3:00 pm to 4:30 Lisbon time (4:00 pm to 5:30 pm CET, 10:00 am-11:30 am EST).

You can register here to attend any or all of the three sessions:
https://us02web.zoom.us/meeting/register/tZEpde-sqTkvGtCdMsrBl825tFrpDQ98FkAI

This is an evolving course, with the content continuously adapted in response to new questions, workflow challenges and process research as well as interoperability studies with other tools. Participants in the last series asked quite a number of interesting things during and after the talks, and their questions provided excellent material for new examples and approaches, and I hope for the same experience in this round.

The memoQ Regex Assistant is a unique library tool introduced in its current form in memoQ version 9.9. The little bit of public discussion there has been about this tool is quite misleading. Contrary to the "pitch" from memoQ employees and nerdy fans in the user base, this isn't really a tool for learning regular expressions. There are much better means for doing that. And I have strong personal objections to the idiotic statements I hear so often that "everyone should learn some regex". What utter nonsense.

What everyone should do is take advantage of the power regular expressions offer to simplify time-consuming tasks of translation, review, quality assurance and more to ensure accuracy and consistency in language resources and translations. The Regex Assistant helps with this by providing a platform where useful "expressions" can be collected and organized with readable names, labels and descriptions in any language. These libraries can be sorted, exchange with other users and applied for filtering, find and replace operations, QA checks, segmentation improvements, structured translation of dates, currency expressions, bibliographic information, legal citations and more or exported and converted to formats for easy use in other tools such as Trados Studio, Phrase/Memsource, Transtools+ and more. All without the need to learn any regular expression syntax!

HTML created from a memoQ Regex Assistant library export

An exported Regex Assistant library converted to a readable format by XSLT

My objective is not to teach regex syntax. It is to empower users to take more control of their work environment and save time and frustration for their teams and enjoy more life beyond the wordface. To help with that, I provide some usable examples in a follow-up mail after each sessions: resources that you can use in your own work and share freely with colleagues.

And in this next round of workshops, available for purchase, there will be some additional high value resources to help achieve better outcomes for work in particular language pairs and particular specialties, such as financial translations. These complex resources were developed over a period of years, sometimes at great cost. In the last session I'll be getting "down and dirty and a little nerdy" to show you my way of maintaining complex resources like these auto-translation rules and others in a very effective, sustainable way that enables you to adapt quickly to changing requirements and style guides.

Jan 12, 2023

memoQ&A: The Regex Assistant in Practice

Note: there will be another series of workshops for this subject matter in March. Details are HERE. Register once to attend any or all of the sessions.

Users of memoQ version 9.9 or later have a powerful library tool available with which they can organize solutions or solution elements that use regular expressions and apply these without the complications of learning regex syntax. In each session, we'll look at different ways in which these portable libraries can be used, with a particular emphasis on solving common problems faced by translators and reviewers. Materials will also be made available to participants for later study and practice.

There will be three sessions of 90 minutes each on three consecutive Thursdays: January 19, January 26 and February 2 at 11:00 a.m. Lisbon time (i.e. noon CET). The first session will introduce the Regex Assistant library and its basic functions for organizing and exchanging information and then move on to specific examples of using the library to deal with common problems encountered in translation and review work. Particular emphasis in the first session will be on filtering and Find/Replace operations.

The two later sessions will continue to explore filtering and options for making changes to texts and tags, and we will also take a tour of possibilities for using regular expression resources (from the library!) in other parts of memoQ such as the Regex Tagger, QA checks or auto-translation rules. As time permits, examples or requests from participants can also be explored.

Those interested in joining the free sessions can register here.

Update: a recording of the first session is available here: https://youtu.be/KKR5aH5oGH8

To get a little taste of what's to come, have a look at this video created by a colleague last year:

May 5, 2022

Understanding and mastering tags... with memoQ!

Everything you need to know... in 36 pages!

Following up on the success of his excellent guide to machine translation functions in memoQ, Marek Pawelec (Twitter: @wasaty) has now published his definitive guide to tag mastery in that translation environment. In a mere 36 pages of clearly written, engaging text, he has distilled more than a decade of personal expertise and exchanges with other top professionals in language services technology into simple recipes and strategies for success with situations which are often so messy that even experienced project managers and tech support gurus wail in despair. Garbage like this, for example:

This screenshot is taken from the import of The PPTX from Hell, which a frustrated PM asked for help with just as I began reviewing the draft of Marek's book about a month ago. It contained nearly 32,000 superfluous spacing tags and was such a mess that it choked all the best professional macros usually deployed to deal with such things. Last year, I had developed my own way of dealing with these things that involved RTF bilingual exports and some search and replace magic in Microsoft Word, but when I shared it with Marek, he said "There's a better way", and indeed there is. On page 23 of this book. It was much cleaner and faster, and in a few minutes I was able to produce a clean slide set that was much easier to read and translate in the CAT tool. A page that costs 50 cents (of the €18 purchase price of the guide) earned me a 140x return and saved hours of working frustration for the translation team.

The book covers a lot more than just the esoterica of really messed up source files. It is a superb introduction to dealing with tags and markup for students at university and for those new to the translation profession and its endemic technologies, and it has sober, engaging guidance at every level for experienced professionals. I consider it an essential troubleshooting work for those in support roles of internal translation departments and, quite honestly, for my esteemed colleagues in First Level Support at memoQ. Marek is a superb trainer and an articulate teacher, with a humility that masks expertise which very often surprises, delights and informs those of us who are sometimes thought to be experts.

I am also particularly pleased that in the final version of his text he addresses the seldom discussed matter of how to factor markup into cost quotations and service charges for translations. memoQ is particularly well designed to address these problems, because weighting factors equivalent to word or character counts can be incorporated in file statistics, offering a simple, transparent and fair way of dealing with the frustrations that too often leave project managers screaming and crying in frustration shortly before... or after planned deliveries.

Whatever aspect of tags may interest you in translation technology and most particularly in memoQ, this book will give you the concise, clear answers you need to understand the best actions to take.

The PDF e-book is available for purchase here: https://payhip.com/b/tHUDx

Jun 4, 2019

Regular expressions in memoQ demystified - THE workshop!

Next week in Utrecht there will be a unique workshop to enhance your productivity with memoQ, as you learn how to develop rules for automated formatting and QA of patterned expressions, such as dates, currency expressions, unusual or custom text formats and more. THIS knowledge is one of those "secret weapons" that I deploy to help the most sophisticated financial and legal translators I know save countless hours of mind-numbing donkey work doing QA on things like legal references and expressions involving currency (such as EUR 3 million vs. €3m, etc.) or creating those references in the first place and inserting them in the translation with a simple keystroke.

The course instructor, Marek Pawelec, is one of my personal resources when I am in over my head on technical problems or when I need to be very sure that a client of mine gets the right help in time. He has a rare gift of taking subject matter which many find baffling and presenting in a way that makes it accessible to most any educated adult.

Because of the scope of this subject matter and the importance of proper follow-up and support while learning it, the workshop will be held over two days - June 10 and 11 (Monday and Tuesday) - from 10 am to 4 pm each day, which will give plenty of time to learn the basics and move on to apply your new technical skills to common and not-so-common technical challenges in translation projects where memoQ is involved.

Trust me on this one: we are talking about critical process secrets to save massive amounts of time and do better work on things like annual reports, court briefs and more. Or creating projects for text formats that seem impossible to work with at first glance. THIS is where the money is in an increasingly competitive market.

Information to register now can be found on the Facebook event page for the workshop or on the relevant Regex Workshop page for the host, the All Round Translator education cooperative in the Netherlands.

Apr 3, 2018

Dealing with tagged translatable text in memoQ

Lately I've been doing a bit of custom filter development for some translation agency clients. Most of it has been relatively simple stuff, like chaining an HTML filter after an Excel filter to protect HTML tags around the text in the Excel cells, but some of it is more involved; in a few cases, three levels of filters had to be combined using memoQ's cascading filter feature.

And sometimes things go too far....

A client had quite a number of JSON files, which were the basis for some online programming tutorials. There was quite a lot of non-translatable content that made it past memoQ's default JSON filter, much of which - if modified in any way - would mess up the functionality of the translated content and require a lot of troublesome post-editing and correction. In the example above, Seconds in a day: is clearly translatable text, but the special rules used with the Regex Tagger turned that text (and others) into protected tags. And unfortunately the rules could not be edited efficiently to avoid this without leaving a lot of untranslatable content unprotected and driving up the cost (due to increased word count) for the client.

In situations like this, there is only one proper thing to do in memoQ: edit the tags!

There are two ways to do this:

use the inline tag editing features of memoQ or
edit the tag on the target side of a memoQ RTF bilingual review file.

The second approach can be carried out by someone (like the client) in any reasonable text editor; tags in an RTF bilingual are represented as red text:

If, however, you go the RTF bilingual route, it's important to specify that the full text of the tags is to be exported, or all you'll get are numbers in brackets as placeholders:

Editing tags in the memoQ working environment is also straightforward:

On the Edit ribbon, select Tag Commands and chose the option Edit Inline Tag.

When you change the tag content as required, remember to click the Save button in the editing dialog each time, or your changes will be lost.

These methods can be applied to cases such as HTML or XML attribute text which needs to be translated but which instead has been embedded in a tag due to an incorrectly configured filter. I've seen that rather often unfortunately.

The effort involved here is greater than the typical word- or character-based compensation schemes can justly compensate and should be charged at a decent hourly rate or be included in project management fees.

A lot of translators are rather "tag-phobic", but the reality of translation today is that tags are an essential part of the translatable content, serving to format translatable content in some cases and containing (unfortunately) embedded text which needs to be translated in other (fortunately less common) cases. Correct handling of tags by translation service providers delivers considerable value to end clients by enabling translations to be produced directly in the file formats needed, saving a great deal of time and money for the client in many cases.

One reasonable objection that many translators have is that the flawed compensation models typically used in the bulk market bog do not fairly include the extra effort of working with tags. In simple cases where the tags are simply part of the format (or are residual garbage from a poorly prepared OCR file, for example), a fair way of dealing with this is to count the tags as words or as an average character equivalent. This is what I usually do, but in the case of tags which need editing, this is not enough, and an hourly charge would apply.

In the filter development project for the JSON files received by my agency client, the text used was initially analyzed at

14,985 words; 111,085 characters; 65 tags

and after proper tagging of the coded content to be protected it was

8766 words; 46,949 characters; 2718 tags.

The reduction in text count more than covered the cost of the few hours needed to produce the cascading filter needed for this client's case and largely ensured that the translator could not alter text which would impair the function of the product.

Aug 6, 2016

Approaching memSource Cloud

It has been interesting to see the behavior of my codornizes since I moved them from the confines of a rabbit hutch in a stall at my old quinta to the fenced, outdoor enclosures in the shade of a Quercus suber grove. In the hutch, they were fearful creatures,panicking each time I opened their prison to give water and food or to collect eggs. Their diet was also rather miserable; the German hunters who first introduced me to these birds for training very authoritatively told me that they ate "only wheat", and I felt bold to offer them anything different like cracked corn or rice. In the concentration camp-like conditions in which they lived, they also developed a serious case of mites and lost a lot of feathers. I thought about slaughtering and eating them as an act of mercy.

Then last spring I moved to a new place with a friend, who built a large enclosure for my goats and chickens. She didn't know about the quail. I brought them one day and hastily improvised an enclosure for them with a large circle of wire fence around a tree, because I was afraid the goats might trample them. There was far more space in this area than they had before, and real, dry dirt for taking dust baths. Soon the mite infestations improved (even before regular dunks in pyrethrin solution began), and the behavior of the birds began to change. They became less nervous, though sometimes when someone approached the enclosure they flew straight up in panic as quail sometimes do and bloodied themselves on the wire.

A few months later I built a much larger enclosure for a mother hen and her chicks to keep them out from under trampling feet or from wandering through the chain link fence of the enclosure into the hungry mouths of six dogs who watched the birds most of the day like Trump fans with a case of beer and an NFL game on the TV. The quail were moved in with the chickens as an afterthought. With nine square meters of sheltered space, the three little birds underwent further transformations, becoming much calmer, never flying in panic and allowing themselves to be approached and picked up with relative ease. They also exhibited a taste for quite a variety of foods, including fresh fruit and weeds such as purslane. Most astonishing of all, they began to lay eggs regularly in an overturned flower pot with a bit of dried grass. Nowhere else. All the reading I've done on quail on the Internet tells me that quail are stupid birds who drop their eggs anywhere, do not maintain nests and seem to have no maternal instincts whatsoever. I am beginning to doubt all that.

At various times in my life I have heard many statements made about the cultural proclivities of various ethnic minorities, but these assertions usually fail to take into account historical background and circumstances of poverty and prejudice, choosing instead to blame victims. In cases where I have seen people of this background offered the same opportunities I take for granted or far less than my cultural privilege has afforded me, I cannot see any result which would offer itself for objective negative commentary.

There are a lot of ignorant assumptions and assertions made about the class of digital sharecroppers known as translators. Some of the most offensive ones are heard from the linguistic equivalents of plantation owners, some of whom have long years of caring for these hapless, technophobic, unreliable "autistics" who simply could not survive without the patriarchal hand of their agencies.

Fortunately, technology continues to evolve in ways which make it ever easier to take up the White Man's Burden and extract value from these finicky, "artistic" human translation resources. The best of breed in this sense could make old King Leopold II envious with the civilization they have brought to us savage translators.

On many occasions, I have advocated the use of various server-based or shared online solutions for coordinating translation work with others. And I will continue to do so wherever that makes sense to me. However, I have observed a number of persistent, dangerous assumptions and practices which reduce or even eliminate the value to be obtained from this approach. It's not a matter of the platform per se, usually, unless it is Across to bear, but too often over the past decade, I have seen how the acquisition of a translation memory management server such as memoQ or memSource or a project management tool such as Plunet, OTM or home-rolled solutions has led to a serious deterioration in the business practices of an enterprise as they put their faith more in technology and less in the people who remain as cogs in their business engines.

As the emphasis has shifted more and more to technologies remote to the sharecroppers actually working the fields of words, a naive belief has established itself as the firm faith of many otherwise rational persons. This is expressed in many ways – sometimes as a pronouncement that browser-based tools are truly the future of translation, often in the dubious, self-serving utterances of bottom-feeding brokers and tool vendors who proclaim the primacy of machine pseudo-translation while hiding behind the fig leaf argument that we need such things to master the mass of data now being generated. It is fortunate for them perhaps that this leaf is opaque enough to hide their true linguistic and intellectual potency from public view.

A related error which I see too often is the failure to distinguish between the convenience of process and project managers and the optimum environment for translating professionals. I don't think this mistake is malicious or deliberately ignores the real factors for optimal work as a wordworker; it's simply damned hard much of the time to understand the needs of someone in a different role. I could say the same for translators not understanding the needs of project managers or even translation consumers, and in fact I often do.

So indeed, the best tool for a project manager or a corporate process coordinator might not be the best tool for the results these people desire from their translators. Fortunately, this is usually a situation where, with a little understanding and testing, both sides can win and work with what works best for them. The mechanism to achieve this is often referred to by the nerdy term "interoperability".

Riccardo Schiaffino, an Italian translator and team leader based in the US, recently published a few articles (trouble and memoQ interoperability) about memSource, a cloud-based tool whose popularity among translation agencies and corporate or public entities with large translation needs continues to grow. High-octane translators like Riccardo and others have trouble sometimes understanding why these parties would choose a tool with such great technical limitations compared to some market leaders like SDL or memoQ, but the simplicity of getting started and the convenience of infrastructure managed elsewhere on secure, high-performance servers with sufficient capacity available for peak use is an understandably powerful draw.

And the support team of memSource and the tools developers are noted for their competence and responsiveness, which is equal in weight to a fat basket full of sexy technical options.

So I will not argue against the use of memSource by agencies and organizational users whose technical needs are not particularly complex and who do not have concerns about a tool almost entirely dependent on reliable, high bandwidth internet connectivity at all times to fulfill its key promises. In fact, it's a good and easy place to start for many, perhaps more so than the rival memoQ Cloud at present, which suffers sometimes from limited capacities (at the same data center used by memSource and others!) during peak use. Unlike the barbed-wire, unstable and unfriendly solution Across, which has achieved some popularity in its native Germany and elsewhere through sales tactics relying on fear, uncertainty and doubt regarding illusionary or delusional data security, memSource works, works well, and the data are portable elsewhere if a company or individual makes another choice some day.

But damn... it's just not very efficient for professional work, especially not for those of us who have amassed considerable personal work resources and become habituated to other tools like SDL Trados Studio, Déja Vu or memoQ like a carpenter is to his time- and work-tested favorite tools. Trading one of these for the memSource desktop editor or, God forbid, the browser-based translation interface feels worse than being forced to do carpentry with cheap Chinese tools cast from dodgy pot metal. Riccardo mentions a few of the disadvantages, and I could fill pages with a catalog of others. But compared to some other primitive tools, it's not so bad, and for those with little or no good experience with leading translation environment tools, it may seem perfectly OK. You don't miss a myriad of filtering options to edit text or sophisticated QA features if you are still amazed that a "translation memory" can spit out a sentence you translated once-upon-a-time if something similar shows up six months later.

And as mentioned, memSource - or some other tool - may indeed be the best solution on the project management side. So what's a professional translator to do if an interesting project is on offer but that platform is unavoidable? Riccardo's tips on how to process the MXLIFF files from memSource in memoQ offer part of a possible good solution which would work almost equally well in most other leading tools as well these days. One additional bit is needed in the memoQ Regex Tagger filter to handle the other tag type (dual curly brackets) in memSource, but otherwise the advice given will allow safe translation of the memSource files in other environments. I can even change the segmentation in memoQ if, as usual, the project manager has failed to create appropriate segmentation rules in memSource to accountfor some of the odd stuff one often sees in legal or financial texts, and this does not damage or change the segmentation seen later when the working file is returned to memSource.

Even concerns about the "lack" of access to shared online resources in memSource if an MXLIFF is translated elsewhere are easily addressed. A few useful things for this include:

pretranslation of the memSource files to put matches into the target before transferring to other environments,
leaving the browser-based or desktop editor for memSource open in the background for online term base or TM look-ups, and
occasionally exporting and synchronizing the MXLIFF in memSource to make the data available to team members working in parallel on a large project - this takes just a minute or two and allows one as much time as needed for polishing text in the other environment.

The last tip is particularly helpful to calm the nerves of project managers who are like mother hens on a nest of eggs which they fear might in fact be hand grenades and who panic if they don't see "progress" on their project servers days before anything is due. One can show them "progress" every twenty minutes or so without much ado if so inclined.

I am past the point where I recommend any translation memory management server in particular for agency and corporate processes. There are advantages to each (except Across, where these are actually hallucinations) and disadvantages, and where I see real problems, it is seldom due to the choice of platform but rather the lack of training and process knowledge by those responsible for the processes. The bright and shining prospects of a translation server are easily sold with a slick tongue, but without an honest analysis and recommendation of needs for initial and ongoing staff training these too often end up being bright and shining lies. I think very often of a favorite German customer who invested heavily in such a system four or five year ago and has not managed one single successful project with the system in all that time. This makes me sick to think of the waste of resources and possibilities.

So on the project management and process ownership side, memSource may be a great choice. Certainly some of my clients think so, and the improvements in their business often back this belief up. And for those who work with gangs of indigent, migrant or sharecropping translators whose marginal existences make the investment in professional resources like SDL Trados Studio or memoQ seem difficult or undesirable, it may be all that is needed by anyone.

The good news for those who depend on the efficiency of a favored tool, however, is that with a few simple steps, we need not compromise and can get full value from our better desktop tools while supporting interesting projects based in memSource. So each side of the translation project can work with what works best for them, without loss, compromise, risk or recriminations.

And the translating quail who start out in a dark box with a stunting lack of possibilities can look forward to the real possibilities of work liberation in a larger environment richer in healthy possibilities and rewards.

Jul 12, 2012

RegEx for translating DVX external view tables in memoQ

Atril's Dejà Vu was the first translation environment tool I am aware of to offer a means of exchanging translation content for review, correction and translation using an ordinary word processor. These "external views" were the original inspiration for memoQ's RTF bilingual tables, which are used in many interoperable workflows not only with people using a word processor but with many other CAT tools as well.

As with memoQ RTF bilinguals, the content in the "external view" which is not to be translated can be selected and hidden with a word processor, leaving only a target column into which the source text has been copied. But these steps alone with the standard RTF filter pose a problem:

The DVX "codes" (tags), which are represented by curly brackets enclosing a number, are not protected. Erasing parts of them can damage the content. It is also not possible to perform a tag check using the memoQ QA functions.

The solution is to use the Regex tagger in memoQ. There are two ways to do this.

If the document has already been imported,

the tagger can be run from the Format menu.

Enter the appropriate regular expression to convert the DVX code to a protected tag: \{(\d+)\}

This expression describes the pattern of the text to protect: a curly bracket (with a backslash in front of it to indicate that this is to be interpreted literally as a character, not as a bracket for grouping something), one or more digits (\d indicates a digit as opposed to d, which is just the letter d, and the plus sign means one or more) and a closing curly bracket ("escaped" with a backslash so it is understood literally as the bracket character in the DVX code.)

Click Add to put the rule in the list, then click Run tagger now.

The result is protected tags in the translation grid of memoQ. These can also be verified with a QA tag check after the translation is completed.

Your regular expression rules can be saved in the dialog above and re-used, or exported from the list under Tools > Resource console... > Filter configurations and shared with others.

The regular expression tagger can also be used as a cascading filter when the RTF file for the external view is imported:

Here the configuration can also be saved or another one loaded.

Search me!