Translation Tribulations: Regex text filter

Showing posts with label Regex text filter. Show all posts

Oct 18, 2023

An Unfiltered Look at memoQ Filters (webinar, 19 October 2023, 15:00 CET)

This presentation and discussion covered some of the challenges and opportunities to improve memoQ project workflows through correct filter choice and design. There are many different aspects to filters in memoQ, and the right choices for a given translatable file or project are not always clear, or different options may offer particular advantages in your situation.

Cascading filters - an important feature for dealing with complex source texts - are also part of the talk, not just the basics but also examples of going beyond what visible memoQ features allow, to do "the impossible". This session is part of the weekly open office hours for the course "memoQuickies Resource Camp", but everyone is welcome to attend these talks regardless of enrollment status. Those interested in full access to all the course resources and teaching may enroll until the end of January 2024.

To join sessions for the October and November office hours, register here.

After registering, you will receive a confirmation email containing information about joining the meeting.

Here is an edited recording of the October 19th session, with a time-coded index available on YouTube in the Description field:

Mar 2, 2023

memoQ Regex Assistant workshops re-run

The series of three workshops on the use of regex resources in memoQ, with a particular emphasis on the integrated Regex Assistant library, has been updated and will be offered again on March 9, 16 and 23 from 3:00 pm to 4:30 Lisbon time (4:00 pm to 5:30 pm CET, 10:00 am-11:30 am EST).

You can register here to attend any or all of the three sessions:
https://us02web.zoom.us/meeting/register/tZEpde-sqTkvGtCdMsrBl825tFrpDQ98FkAI

This is an evolving course, with the content continuously adapted in response to new questions, workflow challenges and process research as well as interoperability studies with other tools. Participants in the last series asked quite a number of interesting things during and after the talks, and their questions provided excellent material for new examples and approaches, and I hope for the same experience in this round.

The memoQ Regex Assistant is a unique library tool introduced in its current form in memoQ version 9.9. The little bit of public discussion there has been about this tool is quite misleading. Contrary to the "pitch" from memoQ employees and nerdy fans in the user base, this isn't really a tool for learning regular expressions. There are much better means for doing that. And I have strong personal objections to the idiotic statements I hear so often that "everyone should learn some regex". What utter nonsense.

What everyone should do is take advantage of the power regular expressions offer to simplify time-consuming tasks of translation, review, quality assurance and more to ensure accuracy and consistency in language resources and translations. The Regex Assistant helps with this by providing a platform where useful "expressions" can be collected and organized with readable names, labels and descriptions in any language. These libraries can be sorted, exchange with other users and applied for filtering, find and replace operations, QA checks, segmentation improvements, structured translation of dates, currency expressions, bibliographic information, legal citations and more or exported and converted to formats for easy use in other tools such as Trados Studio, Phrase/Memsource, Transtools+ and more. All without the need to learn any regular expression syntax!

HTML created from a memoQ Regex Assistant library export

An exported Regex Assistant library converted to a readable format by XSLT

My objective is not to teach regex syntax. It is to empower users to take more control of their work environment and save time and frustration for their teams and enjoy more life beyond the wordface. To help with that, I provide some usable examples in a follow-up mail after each sessions: resources that you can use in your own work and share freely with colleagues.

And in this next round of workshops, available for purchase, there will be some additional high value resources to help achieve better outcomes for work in particular language pairs and particular specialties, such as financial translations. These complex resources were developed over a period of years, sometimes at great cost. In the last session I'll be getting "down and dirty and a little nerdy" to show you my way of maintaining complex resources like these auto-translation rules and others in a very effective, sustainable way that enables you to adapt quickly to changing requirements and style guides.

May 28, 2022

Filtering formatted text in Microsoft Office files

Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.

May 5, 2022

Forget the CAT, gimme a BAT!

It's been nine months since my last blog post. Rumors and celebrations of my demise are premature; I have simply felt a profound reluctance to wade in the increasingly troubled waters of public media and the trendy nonsense that too often passes for professional wisdom these days. And in pandemic times, when most everything goes online, I feel a better place for me is in a stall to be mucked or sitting on a stump somewhere watching rabbits and talking to goats, dogs or ducks. Certainly they have a better appreciation of the importance of technology than most advocates of "artificial intelligence".

But for those more engaged with such matters, a recent blog post by my friend and memoQ founder Balázs Kis, The Human Factor in the Development of Translation Software, is worth reading. In his typically thoughtful way, he explores some of the contradictions and abuses of technology in language services and postulates that

... for the foreseeable future, there will be translation software that is built around human users of extraordinary knowledge. The task of such software is to make their work as efficient and enjoyable as possible. The way we say it, they should not simply trudge through, but thrive in their work, partially thanks to the technology they are using.
From the perspective of a software development organization, there are three ways to make this happen:
Invent new functionality
Interview power users and develop new functionality from them
Go analytical and work from usage data and automate what can be automated; introduce shortcuts

I think there is a critical element missing from that bullet list. Some time ago, I heard about a tribe in Africa where the men typically carry one tool with them into the field: a large knife. Whatever problem they might encounter is to be solved with two things: their human brains and, optionally, that knife. In a sense, we can look at good software tools in a similar way, as that optional knife. Beyond the basic range of organizing functions that one can expect from most modern translation environment tools, the solution to a challenge is more often to be found in the way we use our human brains to consider the matter, not so much the actual tool we use. So, from a user perspective and from the perspective of a software development organization, thriving work more often depends not so much on features but on a flexible approach to problem solving based on an understanding of the characteristics of the material challenge and the possibilities, often not adequately discussed, of the available tools. But developing capacities to think frequently seems much harder than "teaching" what to think, which is probably why the former approach is seldom found in professional language service training, even when the trainers may earnestly believe this is what they are facilitating.

I'll offer a simple example from recent experience. In the past year, most of my efforts have been devoted to consulting and training for language technology applications, trying to deal with crappy CMS systems for which developers never gave proper consideration to translation workflows or developing methods to handle really weird outliers like comment translation for distributed PDFs or filtering the "protected" content of Microsoft Word documents with restricted editing to... uh... protect the "restricted" parts.

That editing function in Microsoft Word was new to me despite the fact that I have explored and used many functions of that tool since I was first introduced to it in 1986. I qualify as a power user because I am probably familiar with at least five percent of the program's features, though I am constantly learning new ways to apply that five percent. And the 95% remaining is full of surprises:

Most of the text here can't be edited in MS Word, but default CAT tool filters cannot exclude it.

Only the highlighted text can be edited in the word processor, and that was also the only text to be translated. The real files were much larger than this example, of course, and the text to be translated was interspersed with a lot of text to be left alone. What can you do?

It was interesting to see the various "solutions" offered, some of which involved begging or instructing the customer to do one thing or another, which is not always a practical option. And imagine the hassles of any kind of manual selection, copying and replacement if you have hundreds of pages like this. So some kind of automation is needed, really. Oh, and you can't even hide the protected text. It will import with the default filters of the translation tool, where it will then be indistinguishable from the actual text to be translated and it can be modified. In other words, bye-bye "protection".

What can be done?

There are a number of possibilities that fall short of developing a new option for import filters, which could take years given the often sluggish development cycles for any major CAT tool. One would be...

... to consider that a Microsoft Word DOCX file is really a ZIP archive with a bunch of stuff inside it. That stuff includes a file called document.xml, which contains the actual text of the MS Word document:

That XML file has an interesting structure. All the document text is in one line as one can see when it is opened in a code editor like Notepad++:

I've highlighted the interesting part, the part with the only text I want to see after importing the file for translation (i.e. the text for which editing is not restricted in MS Word). Ah yes, my strategy here is to deal with the XML text container for the DOCX file and ignore the rest. When the question was raised, I knew there must be such a file, but despite exploring the internal bits of MS Office files with ZIP archive tools for about a decade now, I never actually had occasion to poke around inside of document.xml, and I knew nothing of that file's structure. But simple logic told me there must be a marker there somewhere which would offer a solution.

As it turned out, the relevant markers are a set of tags denoting the beginning and end of a text block with editing permission. These can be seen at the start and finish of the text I highlighted in the screenshot. So all that remains is to filter that mess. A simple thing, really.

In memoQ, there is a "filter" which is not really a filter: the Regex Text Filter. It's actually a toolkit for building filters for text-based files, and XML files are really just text files with a lot of funky markup. I don't care about any of that markup except in the blocks I want to import, so I customized the filter settings accordingly:

A smattering of regular expressions went a long way here, and the expressions used are just some of many possible ways to parse the relevant blocks. Then I added the default XML filter after the custom regex text filter, because memoQ makes filter sequencing of many kinds very easy that way. This problem can be solved with any major CAT tool I think, but I don't have to think very hard about such things when I work with memoQ. The result can be sent from memoQ as an XLIFF file to any other tool if the actual translator has other preferences. Oh, the joys of interoperable excellence....

The imported text for translation, with preview

After translation, document.xml is replaced in the DOCX file by the new version, and the work is done, the "impossible" accomplished without any new features added to the basic toolkit. Computer assistance is all very well, but without brain-assisted translation you're more likely to achieve half the result with double the effort or more.

Jun 4, 2019

Regular expressions in memoQ demystified - THE workshop!

Next week in Utrecht there will be a unique workshop to enhance your productivity with memoQ, as you learn how to develop rules for automated formatting and QA of patterned expressions, such as dates, currency expressions, unusual or custom text formats and more. THIS knowledge is one of those "secret weapons" that I deploy to help the most sophisticated financial and legal translators I know save countless hours of mind-numbing donkey work doing QA on things like legal references and expressions involving currency (such as EUR 3 million vs. €3m, etc.) or creating those references in the first place and inserting them in the translation with a simple keystroke.

The course instructor, Marek Pawelec, is one of my personal resources when I am in over my head on technical problems or when I need to be very sure that a client of mine gets the right help in time. He has a rare gift of taking subject matter which many find baffling and presenting in a way that makes it accessible to most any educated adult.

Because of the scope of this subject matter and the importance of proper follow-up and support while learning it, the workshop will be held over two days - June 10 and 11 (Monday and Tuesday) - from 10 am to 4 pm each day, which will give plenty of time to learn the basics and move on to apply your new technical skills to common and not-so-common technical challenges in translation projects where memoQ is involved.

Trust me on this one: we are talking about critical process secrets to save massive amounts of time and do better work on things like annual reports, court briefs and more. Or creating projects for text formats that seem impossible to work with at first glance. THIS is where the money is in an increasingly competitive market.

Information to register now can be found on the Facebook event page for the workshop or on the relevant Regex Workshop page for the host, the All Round Translator education cooperative in the Netherlands.

Jul 8, 2013

Text preview workaround for memoQ

For years I have been nagging Kilgray to add a preview for plain text files imported to memoQ. I translate quite a number of TXT files sometimes, and it bothers me not to be able to see the text in the preview as I work. The workaround I have used for this - resaving the file as RTF, DOC or DOCX - runs the risk of getting me in trouble if I don't remember to re-save as plain text before delivery.

While editing some video caption texts recently, it dawned on me that users now have another alternative. One of the quiet improvements made in memoQ 6.2 was the introduction of a preview for files imported with the Regex text filter.

If a text file is brought into the project using Import with options... and the import filter is changed from the plain text filter to the regex text filter (with the defaults - no other settings needed), a preview will be created as shown in the screenshot above.

There will be lines between the paragraphs, and the imported text will have a green background highlight. A bit annoying perhaps, but better than the previous complete lack of a preview.

Those who work with Sanskrit and Arabic may note some minor problems in the example shown. I was a little surprised to notice that there were no problems displaying the words with other scripts on the right side of the working window, but the results in the translation grid are not quite as good. I thought perhaps this was due to my choice of display font, and that may be the case, as switching to Arial Unicode MS fixed all but the comma display, but perhaps there is more to the problem.

Jul 5, 2013

Translating video captions in memoQ

Since my first explorations of editing video caption files in a text editor last week, I've learned quite a few ways to improve the process. I found a free, cross-platform Open Source tool, Aegisub, for editing the captions. It is particularly helpful when the timing needs to be adjusted, and its use is fairly intuitive. It beats working in Notepad or Microsoft Word by a long shot.

For translating caption files I also discovered a useful resource on Kilgray's Language Terminal: a Regex text filter designed to filter out the segment numbers and time codes in the caption files. Useful exclusion rules to configure for this are as follows:

The resource file for the filter settings (MQRES) and some sample cation files in English can be downloaded here.

Here is an example (preview) of how text is filtered:

To ensure that the correct filter settings are used for the captions text file, use Import with options... in memoQ and set the file type to "All files (*.*)" for the likely case that the file extension is not recognized by memoQ:

If you want to change the text breaks in a given time segment, use the Join function to combine segments, and place the tag for the line break wherever it makes sense to do so:

Search me!