May 30, 2022

Cleaning up language variants in memoQ term bases

While the idea of using sublanguage variants, such as UK, US or Canadian versions of English, sounds nice in principle, in practice these often create headaches for users of translation environments such as memoQ, particularly when exchanging glossaries with others but also when viewing and editing the data in the built-in editors. Many times I have heard colleagues and clients express a wish to "go back" and work only with generic variants of a language in order to simplify their management of terminology data. In the video below, I share one method to do so.

At 3:08 in the video, I share a little "aside" about how the exported term data can be edited to mark a term as forbidden (for instance, if its use is not desired by the translation buyer). Other changes to the information are also possible at this stage, such as the addition of context and use information for example. Other data fields from the term base can also be included in the export for cleanup if these play an important role in your memoQ term bases.

For years, users have requested an editing feature in memoQ that would make "unifying" language variants possible, but as you can see in this video tutorial, this possibility already exists and is neither difficult nor time-consuming to implement. 

If you do not wish to create a new term base to import the cleaned-up data (as shown in the video) but would rather bring it in to the same term base, it is important to configure the settings for your import correctly so that the original data will be overwritten and you won't end up with messy duplication of information. This is achieved with the following setting marked in red:

However, it should be noted that the term base will still have all the now-unused language variants, albeit with no entries for them. These can be removed by unchecking the boxes for the respective language variants in the term base's Properties dialog.

Speaking of the Properties dialog, some may have noted that in recent versions of memoQ there is an automated option for cleaning up those unwanted language variants:

Why bother with the XSLX route then? Well, depending on what version of memoQ you use, you may not have that command available in the dialog. But more importantly, I find that when merging data from various language variants I often want to do additional editing of the term information, and that really isn't possible when merging language variants in the Properties dialog. Doing the edits in Microsoft Excel gives you an overview of the data and the option to make whatever adjustments may be needed. In Excel you can also make further changes, such as altering the match properties for better hit results or more accurate quality assurance.

May 28, 2022

Filtering formatted text in Microsoft Office files

 Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

 After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.

May 5, 2022

Understanding and mastering tags... with memoQ!

Everything you need to know... in 36 pages!

Following up on the success of his excellent guide to machine translation functions in memoQ, Marek Pawelec (Twitter: @wasaty) has now published his definitive guide to tag mastery in that translation environment. In a mere 36 pages of clearly written, engaging text, he has distilled more than a decade of personal expertise and exchanges with other top professionals in language services technology into simple recipes and strategies for success with situations which are often so messy that even experienced project managers and tech support gurus wail in despair. Garbage like this, for example:

This screenshot is taken from the import of The PPTX from Hell, which a frustrated PM asked for help with just as I began reviewing the draft of Marek's book about a month ago. It contained nearly 32,000 superfluous spacing tags and was such a mess that it choked all the best professional macros usually deployed to deal with such things. Last year, I had developed my own way of dealing with these things that involved RTF bilingual exports and some search and replace magic in Microsoft Word, but when I shared it with Marek, he said "There's a better way", and indeed there is. On page 23 of this book. It was much cleaner and faster, and in a few minutes I was able to produce a clean slide set that was much easier to read and translate in the CAT tool. A page that costs 50 cents (of the €18 purchase price of the guide) earned me a 140x return and saved hours of working frustration for the translation team.

The book covers a lot more than just the esoterica of really messed up source files. It is a superb introduction to dealing with tags and markup for students at university and for those new to the translation profession and its endemic technologies, and it has sober, engaging guidance at every level for experienced professionals. I consider it an essential troubleshooting work for those in support roles of internal translation departments and, quite honestly, for my esteemed colleagues in First Level Support at memoQ. Marek is a superb trainer and an articulate teacher, with a humility that masks expertise which very often surprises, delights and informs those of us who are sometimes thought to be experts.

I am also particularly pleased that in the final version of his text he addresses the seldom discussed matter of how to factor markup into cost quotations and service charges for translations. memoQ is particularly well designed to address these problems, because weighting factors equivalent to word or character counts can be incorporated in file statistics, offering a simple, transparent and fair way of dealing with the frustrations that too often leave project managers screaming and crying in frustration shortly before... or after planned deliveries.

Whatever aspect of tags may interest you in translation technology and most particularly in memoQ, this book will give you the concise, clear answers you need to understand the best actions to take.

The PDF e-book is available for purchase here:

Forget the CAT, gimme a BAT!

It's been nine months since my last blog post. Rumors and celebrations of my demise are premature; I have simply felt a profound reluctance to wade in the increasingly troubled waters of public media and the trendy nonsense that too often passes for professional wisdom these days. And in pandemic times, when most everything goes online, I feel a better place for me is in a stall to be mucked or sitting on a stump somewhere watching rabbits and talking to goats, dogs or ducks. Certainly they have a better appreciation of the importance of technology than most advocates of "artificial intelligence".

But for those more engaged with such matters, a recent blog post by my friend and memoQ founder Balázs Kis, The Human Factor in the Development of Translation Software, is worth reading. In his typically thoughtful way, he explores some of the contradictions and abuses of technology in language services and postulates that

... for the foreseeable future, there will be translation software that is built around human users of extraordinary knowledge. The task of such software is to make their work as efficient and enjoyable as possible. The way we say it, they should not simply trudge through, but thrive in their work, partially thanks to the technology they are using. 

From the perspective of a software development organization, there are three ways to make this happen:  

  • Invent new functionality 
  • Interview power users and develop new functionality from them 
  • Go analytical and work from usage data and automate what can be automated; introduce shortcuts 

I think there is a critical element missing from that bullet list. Some time ago, I heard about a tribe in Africa where the men typically carry one tool with them into the field: a large knife. Whatever problem they might encounter is to be solved with two things: their human brains and, optionally, that knife. In a sense, we can look at good software tools in a similar way, as that optional knife. Beyond the basic range of organizing functions that one can expect from most modern translation environment tools, the solution to a challenge is more often to be found in the way we use our human brains to consider the matter, not so much the actual tool we use. So, from a user perspective and from the perspective of a software development organization, thriving work more often depends not so much on features but on a flexible approach to problem solving based on an understanding of the characteristics of the material challenge and the possibilities, often not adequately discussed, of the available tools. But developing capacities to think frequently seems much harder than "teaching" what to think, which is probably why the former approach is seldom found in professional language service training, even when the trainers may earnestly believe this is what they are facilitating.

I'll offer a simple example from recent experience. In the past year, most of my efforts have been devoted to consulting and training for language technology applications, trying to deal with crappy CMS systems for which developers never gave proper consideration to translation workflows or developing methods to handle really weird outliers like comment translation for distributed PDFs or filtering the "protected" content of Microsoft Word documents with restricted editing to... uh... protect the "restricted" parts.

That editing function in Microsoft Word was new to me despite the fact that I have explored and used many functions of that tool since I was first introduced to it in 1986. I qualify as a power user because I am probably familiar with at least five percent of the program's features, though I am constantly learning new ways to apply that five percent. And the 95% remaining is full of surprises:

Most of the text here can't be edited in MS Word, but default CAT tool filters cannot exclude it.

Only the highlighted text can be edited in the word processor, and that was also the only text to be translated. The real files were much larger than this example, of course, and the text to be translated was interspersed with a lot of text to be left alone. What can you do?

It was interesting to see the various "solutions" offered, some of which involved begging or instructing the customer to do one thing or another, which is not always a practical option. And imagine the hassles of any kind of manual selection, copying and replacement if you have hundreds of pages like this. So some kind of automation is needed, really. Oh, and you can't even hide the protected text. It will import with the default filters of the translation tool, where it will then be indistinguishable from the actual text to be translated and it can be modified. In other words, bye-bye "protection".

What can be done?

There are a number of possibilities that fall short of developing a new option for import filters, which could take years given the often sluggish development cycles for any major CAT tool. One would be...

... to consider that a Microsoft Word DOCX file is really a ZIP archive with a bunch of stuff inside it. That stuff includes a file called document.xml, which contains the actual text of the MS Word document:

That XML file has an interesting structure. All the document text is in one line as one can see when it is opened in a code editor like Notepad++:

I've highlighted the interesting part, the part with the only text I want to see after importing the file for translation (i.e. the text for which editing is not restricted in MS Word). Ah yes, my strategy here is to deal with the XML text container for the DOCX file and ignore the rest. When the question was raised, I knew there must be such a file, but despite exploring the internal bits of MS Office files with ZIP archive tools for about a decade now, I never actually had occasion to poke around inside of document.xml, and I knew nothing of that file's structure. But simple logic told me there must be a marker there somewhere which would offer a solution.

As it turned out, the relevant markers are a set of tags denoting the beginning and end of a text block with editing permission. These can be seen at the start and finish of the text I highlighted in the screenshot. So all that remains is to filter that mess. A simple thing, really.

In memoQ, there is a "filter" which is not really a filter: the Regex Text Filter. It's actually a toolkit for building filters for text-based files, and XML files are really just text files with a lot of funky markup. I don't care about any of that markup except in the blocks I want to import, so I customized the filter settings accordingly:

A smattering of regular expressions went a long way here, and the expressions used are just some of many possible ways to parse the relevant blocks. Then I added the default XML filter after the custom regex text filter, because memoQ makes filter sequencing of many kinds very easy that way. This problem can be solved with any major CAT tool I think, but I don't have to think very hard about such things when I work with memoQ. The result can be sent from memoQ as an XLIFF file to any other tool if the actual translator has other preferences. Oh, the joys of interoperable excellence....

The imported text for translation, with preview 

After translation, document.xml is replaced in the DOCX file by the new version, and the work is done, the "impossible" accomplished without any new features added to the basic toolkit. Computer assistance is all very well, but without brain-assisted translation you're more likely to achieve half the result with double the effort or more.