Translation Tribulations: filters

Showing posts with label filters. Show all posts

Oct 18, 2023

An Unfiltered Look at memoQ Filters (webinar, 19 October 2023, 15:00 CET)

This presentation and discussion covered some of the challenges and opportunities to improve memoQ project workflows through correct filter choice and design. There are many different aspects to filters in memoQ, and the right choices for a given translatable file or project are not always clear, or different options may offer particular advantages in your situation.

Cascading filters - an important feature for dealing with complex source texts - are also part of the talk, not just the basics but also examples of going beyond what visible memoQ features allow, to do "the impossible". This session is part of the weekly open office hours for the course "memoQuickies Resource Camp", but everyone is welcome to attend these talks regardless of enrollment status. Those interested in full access to all the course resources and teaching may enroll until the end of January 2024.

To join sessions for the October and November office hours, register here.

After registering, you will receive a confirmation email containing information about joining the meeting.

Here is an edited recording of the October 19th session, with a time-coded index available on YouTube in the Description field:

Sep 4, 2023

New online course: "memoQuickies Resource Camp"

Summer is almost over, but technically, "camping season" will continue in memoQ World until November 30th. Or maybe January 31st, depending on how you count.

Today, a three-month journey of exploration begins, covering six important kinds of resources to make work with the memoQ translation desktop and server environments more pleasant and efficient... and profitable. This self-guided online course will give participants full access to my 14 years of cumulative experience as a memoQ user translating, managing projects and developing hundreds of solutions with this world-leading productivity tool.

Click here or on the icon bar above to have a look at the course description and to see (and maybe download) some of the publicly available information and resources for better work in many language pairs.

The emphasis of teaching will shift to a new resource every two weeks (with auto-translation rules as the main topic for the first two weeks), but throughout the course, information will be added continuously to all topic sections as I trawl through, sort, upgrade and publish the best or most interesting stuff from my archives. And course participants have access to open virtual office hours each week on Thursdays and some other occasions, where any questions can be asked and special requests made.

A special enrollment discount of 40% is available for the first week (code: HALFOFFLAUNCH) until September 10th, but you can join at any time and work with any of the material posted, ask questions and receive feedback. Learning material and downloadable, ready-to-use and -adapt resources will continue to be added until the end of November, and the full course will remain online through January 2024. Enrollment fees and content are subject to change without notice.

Addendum 1: On Thursday afternoon, September 7th, 2023, a presentation was made to introduce the first course topic - "Auto-translation Rules for Everyone". The recording and slides can be found here.

Addendum 2: Payment options for groups and monthly budgets have been introduced now. These options enable teams, departments and organizations to obtain blocks of passes for their members to receive continuing professional education in translation workflow tools. The host site applies VAT and other taxes where relevant and generates appropriate invoices. All relevant information can be found at the bottom of the information and enrollment page.

May 28, 2022

Filtering formatted text in Microsoft Office files

Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.

May 5, 2022

Forget the CAT, gimme a BAT!

It's been nine months since my last blog post. Rumors and celebrations of my demise are premature; I have simply felt a profound reluctance to wade in the increasingly troubled waters of public media and the trendy nonsense that too often passes for professional wisdom these days. And in pandemic times, when most everything goes online, I feel a better place for me is in a stall to be mucked or sitting on a stump somewhere watching rabbits and talking to goats, dogs or ducks. Certainly they have a better appreciation of the importance of technology than most advocates of "artificial intelligence".

But for those more engaged with such matters, a recent blog post by my friend and memoQ founder Balázs Kis, The Human Factor in the Development of Translation Software, is worth reading. In his typically thoughtful way, he explores some of the contradictions and abuses of technology in language services and postulates that

... for the foreseeable future, there will be translation software that is built around human users of extraordinary knowledge. The task of such software is to make their work as efficient and enjoyable as possible. The way we say it, they should not simply trudge through, but thrive in their work, partially thanks to the technology they are using.
From the perspective of a software development organization, there are three ways to make this happen:
Invent new functionality
Interview power users and develop new functionality from them
Go analytical and work from usage data and automate what can be automated; introduce shortcuts

I think there is a critical element missing from that bullet list. Some time ago, I heard about a tribe in Africa where the men typically carry one tool with them into the field: a large knife. Whatever problem they might encounter is to be solved with two things: their human brains and, optionally, that knife. In a sense, we can look at good software tools in a similar way, as that optional knife. Beyond the basic range of organizing functions that one can expect from most modern translation environment tools, the solution to a challenge is more often to be found in the way we use our human brains to consider the matter, not so much the actual tool we use. So, from a user perspective and from the perspective of a software development organization, thriving work more often depends not so much on features but on a flexible approach to problem solving based on an understanding of the characteristics of the material challenge and the possibilities, often not adequately discussed, of the available tools. But developing capacities to think frequently seems much harder than "teaching" what to think, which is probably why the former approach is seldom found in professional language service training, even when the trainers may earnestly believe this is what they are facilitating.

I'll offer a simple example from recent experience. In the past year, most of my efforts have been devoted to consulting and training for language technology applications, trying to deal with crappy CMS systems for which developers never gave proper consideration to translation workflows or developing methods to handle really weird outliers like comment translation for distributed PDFs or filtering the "protected" content of Microsoft Word documents with restricted editing to... uh... protect the "restricted" parts.

That editing function in Microsoft Word was new to me despite the fact that I have explored and used many functions of that tool since I was first introduced to it in 1986. I qualify as a power user because I am probably familiar with at least five percent of the program's features, though I am constantly learning new ways to apply that five percent. And the 95% remaining is full of surprises:

Most of the text here can't be edited in MS Word, but default CAT tool filters cannot exclude it.

Only the highlighted text can be edited in the word processor, and that was also the only text to be translated. The real files were much larger than this example, of course, and the text to be translated was interspersed with a lot of text to be left alone. What can you do?

It was interesting to see the various "solutions" offered, some of which involved begging or instructing the customer to do one thing or another, which is not always a practical option. And imagine the hassles of any kind of manual selection, copying and replacement if you have hundreds of pages like this. So some kind of automation is needed, really. Oh, and you can't even hide the protected text. It will import with the default filters of the translation tool, where it will then be indistinguishable from the actual text to be translated and it can be modified. In other words, bye-bye "protection".

What can be done?

There are a number of possibilities that fall short of developing a new option for import filters, which could take years given the often sluggish development cycles for any major CAT tool. One would be...

... to consider that a Microsoft Word DOCX file is really a ZIP archive with a bunch of stuff inside it. That stuff includes a file called document.xml, which contains the actual text of the MS Word document:

That XML file has an interesting structure. All the document text is in one line as one can see when it is opened in a code editor like Notepad++:

I've highlighted the interesting part, the part with the only text I want to see after importing the file for translation (i.e. the text for which editing is not restricted in MS Word). Ah yes, my strategy here is to deal with the XML text container for the DOCX file and ignore the rest. When the question was raised, I knew there must be such a file, but despite exploring the internal bits of MS Office files with ZIP archive tools for about a decade now, I never actually had occasion to poke around inside of document.xml, and I knew nothing of that file's structure. But simple logic told me there must be a marker there somewhere which would offer a solution.

As it turned out, the relevant markers are a set of tags denoting the beginning and end of a text block with editing permission. These can be seen at the start and finish of the text I highlighted in the screenshot. So all that remains is to filter that mess. A simple thing, really.

In memoQ, there is a "filter" which is not really a filter: the Regex Text Filter. It's actually a toolkit for building filters for text-based files, and XML files are really just text files with a lot of funky markup. I don't care about any of that markup except in the blocks I want to import, so I customized the filter settings accordingly:

A smattering of regular expressions went a long way here, and the expressions used are just some of many possible ways to parse the relevant blocks. Then I added the default XML filter after the custom regex text filter, because memoQ makes filter sequencing of many kinds very easy that way. This problem can be solved with any major CAT tool I think, but I don't have to think very hard about such things when I work with memoQ. The result can be sent from memoQ as an XLIFF file to any other tool if the actual translator has other preferences. Oh, the joys of interoperable excellence....

The imported text for translation, with preview

After translation, document.xml is replaced in the DOCX file by the new version, and the work is done, the "impossible" accomplished without any new features added to the basic toolkit. Computer assistance is all very well, but without brain-assisted translation you're more likely to achieve half the result with double the effort or more.

Jun 4, 2019

Regular expressions in memoQ demystified - THE workshop!

Next week in Utrecht there will be a unique workshop to enhance your productivity with memoQ, as you learn how to develop rules for automated formatting and QA of patterned expressions, such as dates, currency expressions, unusual or custom text formats and more. THIS knowledge is one of those "secret weapons" that I deploy to help the most sophisticated financial and legal translators I know save countless hours of mind-numbing donkey work doing QA on things like legal references and expressions involving currency (such as EUR 3 million vs. €3m, etc.) or creating those references in the first place and inserting them in the translation with a simple keystroke.

The course instructor, Marek Pawelec, is one of my personal resources when I am in over my head on technical problems or when I need to be very sure that a client of mine gets the right help in time. He has a rare gift of taking subject matter which many find baffling and presenting in a way that makes it accessible to most any educated adult.

Because of the scope of this subject matter and the importance of proper follow-up and support while learning it, the workshop will be held over two days - June 10 and 11 (Monday and Tuesday) - from 10 am to 4 pm each day, which will give plenty of time to learn the basics and move on to apply your new technical skills to common and not-so-common technical challenges in translation projects where memoQ is involved.

Trust me on this one: we are talking about critical process secrets to save massive amounts of time and do better work on things like annual reports, court briefs and more. Or creating projects for text formats that seem impossible to work with at first glance. THIS is where the money is in an increasingly competitive market.

Information to register now can be found on the Facebook event page for the workshop or on the relevant Regex Workshop page for the host, the All Round Translator education cooperative in the Netherlands.

Jan 4, 2019

Translating Microsoft Publisher files

Every few months or so I run across a question in social media or am confronted with a project like this:

Some time ago, Paul Filkin published an interesting discussion of an Open Exchange application that enables SDL Trados Studio users to deal with the Microsoft Publisher format with some limitations; in the article, he also discussed other approaches, including one I have known about for some time: the use of Western Standard's Fluency.

I looked at Fluency some years ago, and while I found some interesting things there, such as its transcription module, on the whole the application never seemed ready for prime time with its sloppy programming of details. I spent some time trying to persuade its underfunded team to correct some of the problems I saw, but after a while it became clear that the company and its product were not able to cope with the demanding technical challenges routinely faced by language service providers today.

The discussion which followed the posted question suggested a number of approaches, but if the colleague's client expected to receive a translated PUB file instead of some other format, the only realistic option for this possibly one-off job would be to use Fluency in some way. I assumed (and suggested) that a workflow involving

*.pub <-> Fluency <-> (exchange format) <-> memoQ

might do the trick (with the exchange format probably being XLIFF, but otherwise the bilingual RTF format that I remembered from my tests of Fluency long ago.)

And so it proved to be. But the Devil is in the details.

The first sign of trouble came from a colleague - a professor at a local university who is known for his technical curiosity and flexibility in translation courses - who told me that Fluency does indeed offer an XLIFF export but that memoQ experienced problems importing it. His description of the error message sounded a lot to me like the typical mistakes that CAT tool programmers who are XLIFF newbies make when implementing a spec that they are probably too lazy to read and test. (I found the same error myself and submitted it to memoQ Support for comment a few hours ago.) He said that he had then tried the RTF export, but it wasn't clear to me what the result was and he was under time pressure, so I didn't press the matter but resolved to have a look myself.

I used a modified English template file for an invitation as my PUB file to test. The file imported easily into Fluency:

I assume that "terminology" download is some silly, unhelpful public domain dictionary I would never use.

The Fluency user interface offered a sort of WYSIWYG representation for the text, which makes it appear not bad for work, though appearances are deceiving. In fact, this proved to be a source of some trouble later.

As mentioned, the XLIFF export could not be used in memoQ, and although I am capable enough of analyzing structure problems in a tagged file, I wasn't in the mood to clean up someone else's mess, so I exported a "Fluency Work File" as my next attempt. That is app jargon for a bilingual RTF file similar to that found in other applications.

The difference with Fluency RTFs is that they include the WYSIWYG text representation. Nice, really, and this makes the work in another environment a little easier. I copied the source text column and pasted it into a new file (DOCX), then imported that to memoQ for translation:

Afterward, the translation exported from memoQ was pasted into the target column of the Fluency Work File (bilingual RTF exchange file). I imported that bilingual file back into Fluency and then exported a translated PUB file using the File / Save As command. I got a strange error message saying that there had been some trouble with the export and that some manual adjustment might be needed in Microsoft publisher.

At first glance I thought, "Looks OK" and then... WTF??? Everything was OK except the title. Not only was the text cut off, it was not even the text I had translated in German. When I copied the text out of the field and pasted it into Notepad, this is what I saw:

Tag der Tag der kulturellen Vielfalt
kulturellen Vielfalt
Vielfalt
kulturellen Vielfalt
kulturellen Vielfalt
Vielfalt
kulturellen Vielfalt
kulturellen Vielfalt
Vielfalt

No joke. Fluency somehow went berserk exporting the text of the title field, and sliced, diced and multiplied the whole mess in a truly bizarre way.

In my nearly 5 decades of casual and occasionally professional programming I have seen almost every stupidity imaginable, so in this case I imagined that somehow the problem lay in sloppy programming associated with text that is longer than the space provided in the field. Interestingly, Fluency enabled me to change the size of the target text in the translation window, so I reduced it by about half and tried to export a new target PUB file.

That worked in fact. So Fluency can indeed be used as a sort of filter for Microsoft Publisher files to be translated in other tools such as memoQ, but the process is not without trouble on the Fluency side, at least when text overruns the field size available, as one might expect to happen with some frequency.

Western Standard offers a 15-day trial of Fluency Now, their desktop tool for freelance translators, and the application can be paid on a monthly subscription of only 15 US dollars. So perhaps for the occasional project or client that requires work with PUB files that is an option. Microsoft Publisher is not taken seriously as a layout and publishing tool by graphics professionals and CAT tool providers, but because it is part of the Microsoft Office suite, one will find it in use from time to time, and this imperfect solution may be the best option for helping such clients.

Apr 14, 2018

memoQ filter for MS Outlook e-mail

A few days ago I was preparing screenshots in memoQ for lecture slides. As I tried to select a PDF file to import, the defective trackpad on my laptop caused a file farther down in the list to be selected, and I got a surprise. Not believing my eyes, I tried again and saw that, yes, what I saw was indeed possible...

... saved Microsoft Outlook MSG files (e-mail) are imported to memoQ with all their graphics and attachments! Kilgray created a filter some time ago and simply forgot to document its existence publicly. As of the current versions of memoQ you won't see this in the documentation or the filter lists of the interface, but memoQ can "see" MSG files, and if they are selected, this hidden filter will appear in the import dialog.

And this also works for LiveDocs.

At the time of this discovery, I was working on a little job for a friend's agency, and her project manager had sent me a list of abbreviations in an e-mail. I was too lazy to make the entries in my termbase, so I simply imported the mail to the LiveDocs corpus I maintain for her shop so that it would show up in concordance searches:

So when people tell you memoQ is good, don't believe them. It's actually better than that, but the truth is a well-kept secret :-)

Apr 4, 2018

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:

Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:

That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:

That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:

The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:

Then choose that custom configuration when you import the file using Import with Options:

This cascaded configuration can also be saved using the corresponding icon button.

This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:

If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

Jan 30, 2018

Doing memSource better in memoQ with @wasaty!

This post has been updated. The good two-template solution has been improved to make a one-template solution. This is user engagement at its best in the world of memoQ.

Marek Pawelec (aka @wasaty), one of my favorite technical solution finders in translation, has published an effective improvement for those who prefer to do memSource projects in memoQ. I have done a good bit of this in the past, as I greatly dislike the limitations of the memSource local editor and dislike browser environments (from any firm) even more for translation, but the funky interpretation of XLIFF used by that tool requires some custom filter configuration to enable work to proceed without the risk to unrecognized tags. Even so, the inability to transfer match percentage information and locked status for segments gave me more than a few headaches with these projects.

Someone at Kilgray mentioned a while ago that a proper memSource filter had been considered, but that resources were, alas, focused on other priorities, like 8.x "fixes" to features that weren't broken so that life would become more interesting for legal and financial translators whose work was becoming too easy with memoQ 7.8. No matter: once again, Marek has come through with an excellent professional solution for doing memSource better in memoQ.

Some highlights of the template provided:

memSource match rates are visible in memoQ
locked segments stay locked!
"translated" status will be kept
machine pseudo-translated garbage is marked with "MT" status in memoQ
memSource tags can be converted to memoQ tags
populated segments can be given "edited" status

Currently, this template is the best technical solution for working more efficiently and accurately with memSource MXLIFF files in memoQ and will probably remain so until Kilgray does get around to creating a properly integrated filter with configurable options. So if you have valued customers who use memSource but you want to leverage all your memoQ resources to do the work better, Marek's template is for you. Check out the detailed description and instructions on his blog!

Jun 5, 2017

Technology for Legal Translation

Last April I was a guest at the Buenos Aires University Facultad de Derecho, where I had an opportunity to meet students and staff from the law school's integrated degree program for certified public translators and to speak about my use of various technologies to assist my work in legal translation. This post is based loosely on that presentation and a subsequent workshop at the Universidade de Évora.

Useful ideas seldom develop in isolation, and to the extent that I can claim good practice in the use of assistive technologies for my translation work in legal and other domains it is largely the product of my interactions with many colleagues over the past seventeen years of commercial translation activity. These fine people have served as mentors, giving me my first exposure to the concepts of platform interoperability for translation tools, and as inspirations by sharing the many challenges they face in their work and clearly articulating the desired outcomes they hoped to achieve as professionals. They have also generously and frequently shared with me the solutions that they have found and have often unselfishly shared their ideas on how and why we should do better in our daily practice. And I am grateful that I can continue to learn with them, work better, and help others to do so as well.

A variety of tools for information management and transformation can benefit the work of a legal translator in areas which include but are not limited to:

corpus utilization,
text conversion,
terminology management,
diverse information retrieval,
assisted drafting,
dictated speech to text,
quality assurance,
version control and comparison, and
source and target text review.

Though not exhaustive, the list above can provide a fairly comprehensive basis for education of future colleagues and continued professional development for those already active as legal translators. But with any of the technologies discussed below, it is important to remember that the driving force is not the hardware and software we use in technical devices but rather the human mind and its understanding of subject matter and the needs of the particular task or work process in the legal domain. No matter how great our experience, there is always something more and useful to be learned, and often the best way to do this is to discuss the challenges of technology and workflow with others and keep an open mind for new approaches with promise.

Reference texts of many kinds are important in legal translation work (and in other types of translation too, of course). These may be monolingual or multilingual texts, and they provide a wealth of information on subject matter, terminology and typical usage in particular contexts. These collections of text – or corpora – are most useful when the information found in them can be read in context rather than isolation. Translation memories – used by many in our work – are also corpora of a kind, but they are seriously flawed in their usual implementations, because only short segments of text are displayed in a bilingual format, and the meaning and context of these retrieved snippets are too often obscure.

An excerpt from a parallel corpus showing a treaty text in English, Portuguese and Spanish

The best corpus tools for translation work allow concordance searches in multiple selected corpora and provide access to the full context of the information found. Currently, the best example of integrated document context with information searches in a translation environment tool is found in the LiveDocs module of Kilgray's memoQ.

A memoQ concordance search with a link to an "aligned" translation

A past translation and its preview stored in a memoQ LiveDocs corpus, accessed via concordance search

A memoQ LiveDocs corpus has all the advantages of the familiar "translation memory" but can include other information, such as previews of the translated work as well. It is always clear in which document the information "hit" was found, and corpora can also include any number of monolingual documents in source and target languages, something which is not possible with a traditional translation memory.

In many cases, however, much context can be restored to a traditional translation memory by transforming it into a "document" in a LiveDocs corpus. This is because in most cases, substantial portions of the translation memory will have its individual segment records stored in document order; if the content is exported as a TMX file or tab-delimited text file and then imported as a bilingual document in a LiveDocs corpus, the result will be almost as if the original translations had been aligned and saved, and from a concordance hit one can open the bilingual content directly and read the parts before and after the text found in the concordance search.

Legal translation can involve text conversion in a broad sense in many ways. Legal translators must often deal with hardcopy or faxed material or scanned files created from these. Often documents to translate and reference documents are provided in portable document format (PDF), in which finding and editing information can be difficult. Using special software, these texts can be converted into documents which can be edited, and portions can be copied, pasted and overwritten easily, or they can be imported in translation assistance platforms such as SDL Trados Studio, Wordfast or memoQ. (Some of these environments include integrated facilities for converting PDF texts, but the results are seldom as suitable for work as PDF or scanned files converted with optical character recognition software such as ABBYY FineReader or OmniPage.)

Software tools like ABBYY FineReader can also convert "dead" scanned text images into searchable documents. This will even work with bad contrast or color images in the background, making it easier, for example, to look for information in mountains of scanned documents used in legal discovery. Text-on-image files like the example shown above completely preserve the layout and image context of the text to be read in the best way. I first discovered and used this option while writing a report for a client in which I had to reference sections of a very long, scanned policy document from the European Parliament. It was driving me crazy to page through the scanned document to find information I wanted to cite but where I had failed to make notes during my first reading. Converting that scanned policy to a searchable PDF made it easy to find what I needed in seconds and accurately cite its page number, etc. Where there is text on pictures, difficult contrast and other features this is often far better for reference purposes than converting to an MS Word document, for example, where the layouts are likely to become garbled.

Software tools for translation can also make text in many other original formats accessible to translators in an ergonomically simpler form, also ensuring, where necessary, that no text is overlooked because of a complicated layout or because it is in an easily overlooked footnote or margin note. Text import filters in translation environments make it easy to read and translate the words in a uniform working environment, with many reference tools and other help available, and then render the translated text back into its original format or some more useful bilingual format.

An excerpt of translated patent claims exported as a bilingual table for review

Technology also offers many possibilities for identifying, recording and controlling relevant terminology in legal translation work.

Large quantities of text can be analyzed quickly to find the most frequent special vocabulary likely to be relevant to the translation work and save these in project glossaries, often enabling that work to be organized better with much of the clarification of terms taking place prior to translation. This is particularly valuable in large projects where it may be advisable to ensure that a team of translators all use the same terms in the target language to avoid possible confusion and misunderstanding.

Glossaries created in translation assistance tools can provide terminology hints during work and even save keystrokes when linked to predictive, "intelligent" writing features.

Integrated quality checking features in translation environments enable possible deviations of terminology or other issues to be identified and corrected quickly.

Technical features in working software for translation allow not only desirable terms to be identified and elaborated; they also enable undesired terms to be recorded and avoided. Barred terms can be marked as such while translating or automatically identified in a quality check.

A patent glossary exported from memoQ and then made into a PDF dictionary via SDL Trados MultiTerm

Technical tools enable terminology to be shared in many different ways. Glossaries in appropriate formats can be moved easily between different environments to share them with others on a team which uses diverse technologies; they can also be output as spreadsheets, web pages or even formatted dictionaries (as shown in the example above). This can help to ensure consistency over time in the terms used by translators and attorneys involved in a particular case.

There are also many different ways that terminology can be shared dynamically in a team. Various terminology servers available usually suffer from being restricted to particular platforms, but freely available tools like Google Sheets coupled with web look-up interfaces and linked spreadsheets customized for importing into particular environments can be set up quickly and easily, with access restricted to a selected team.

The links in the screenshot above show a simple example using some data from SAP. There is a master spreadsheet where the data is maintained and several "slavesheets" designed for simple importing into particular translation environment tools. Forms can also be used for simplified data entry and maintenance.

If Google Sheets do not meet the confidentiality requirements of a particular situation, similar solutions can be designed using intranets, extranets, VPNs, etc.

Technical tools for translators can help to locate information in a great variety of environments and media in ways that usually integrate smoothly with their workflow. Some available tools enable glossaries and bilingual corpora to be accessed in any application, including word processors, presentation software and web pages.

Corpus information in translation memories, memoQ LiveDocs or external sources can be looked up automatically or in concordance searches based on whole or partial content matches or specified search terms, and then useful parts can be inserted into the target text to assist translation. In some cases, differences between a current source text and archived information is highlighted to assist in identifying and incorporating changes.

Structured information such as dates, currency expressions, legal citations and bibliographical references can also be prepared for simple keystroke insertion in the translated text or automated quality checking. This can save many frustrating hours of typing and copy revision. In this regard, memoQ currently offers the best options for translation with its "auto-translation" rulesets, but many tools offer rules-based QA facilities for checking structured information.

Voice recognition technologies offer ergonomically superior options for transcription in many languages and can often enable heavy translation workloads with short deadlines to be handled with greater ease, maintaining or even improving text quality. Experienced translators with good subject matter knowledge and voice recognition software skills can typically produce more finished text in a day than the best post-editing operations for machine pseudo-translation, with the exception that the text produced by human voice transcription is actually usable in most situations, while the "gloss" added to machine "translations" is at best lipstick on a pig.

Reviewing a text for errors is hard work, and a pressing deadline to file a brief doesn't make the job easier. Technical tools for translation enable tens of thousands of words of text to be scanned for particular errors in seconds or minutes, ensuring that dates and references are correct and consistent, that correct terminology has been used, et cetera.

The best tools even offer sophisticated tools for tracking changes, differences in source and target text versions, even historical revisions to a translation at the sentence level. And tools like SDL Trados Studio or memoQ enable a translation and its reference corpora to be updated quickly and easily by importing a modified (monolingual) target text.

When time is short and new versions of a source text may follow in quick succession, technology offers possibilities to identify differences quickly, automatically process the parts which remain unchanged and keep everything on track and on schedule.

For all its myriad features, good translation technology cannot replace human knowledge of language and subject matter. Those claiming the contrary are either ignorant or often have a Trumpian disregard for the truth and common sense and are all too eager to relieve their victims of the burdens of excess cash without giving the expected value in exchange.

Technologies which do not assist translation experts to work more efficiently or with less stress in the wide range of challenges found in legal translation work are largely useless. This really does include machine pseudo-translation (MpT). The best “parts” of that swindle are essentially the corpus matching for translation memory archives and corpora found in CAT tools like memoQ or SDL Trados Studio, and what is added is often incorrect and dangerously liable to lead to errors and misinterpretations. There are also documented, damaging effects on one’s use of language when exposed to machine pseudo-translation for extended periods.

Legal translation professionals today can benefit in many ways from technology to work better and faster, but the basis for this remains what it was ten, twenty, forty or a hundred years ago: language skill and an understanding of the law and legal procedure. And a good, sound, well-rested mind.

*******

Further references

Speech recognition

Dragon NaturallySpeaking: https://www.nuance.com/dragon.html
Tiago Neto on applications: https://tiagoneto.com/tag/speech-recognition
Translation Tribulations – free mobile for many languages: http://www.translationtribulations.com/2015/04/free-good-quality-speech-recognition.html
Circuit Magazine - The Speech Recognition Revolution: http://www.circuitmagazine.org/chroniques-128/des-techniques
The Chronicle - Speech Recognition to Go: http://www.atanet.org/chronicle-online/highlights/speech-recognition-to-go/
The Chronicle - Speech Recognition Is in Your Back Pocket (or Wherever You Keep Your Mobile Phone): http://www.atanet.org/chronicle-online/none/speech-recognition-is-in-your-back-pocket-or-wherever-you-keep-your-mobile-phone/

Document indexing, search tools and techniques

Archivarius 3000: http://www.likasoft.com/document-search/
Copernic Desktop Search: https://www.copernic.com/en/products/desktop-search/
AntConc concordance: http://www.laurenceanthony.net/software/antconc/
Multiple, separate concordances with memoQ: http://www.translationtribulations.com/2014/01/multiple-separate-concordances-with.html
memoQ TM Search Tool: http://www.translationtribulations.com/2014/01/the-memoq-tm-search-tool.html
memoQ web search for images: http://www.translationtribulations.com/2016/12/getting-picture-with-automated-web.html
Upgrading translation memories for document context: http://www.translationtribulations.com/2015/08/upgrading-translation-memories-for.html
Free shareable, searchable glossaries with Google Sheets: http://www.translationtribulations.com/2016/12/free-shareable-searchable-glossaries.html

Auto-translation rules for formatted text (dates, citations, etc.)

Translation Tribulations, various articles on specifications, dealing with abbreviations & more:
http://www.translationtribulations.com/search/label/autotranslatables
Marek Pawelec, regular expressions in memoQ: http://wasaty.pl/blog/2012/05/17/regular-expressions-in-memoq/

Authoring original texts in CAT tools

Translation Tribulations: http://www.translationtribulations.com/2015/02/cat-tools-re-imagined-approach-to.html

Autocorrection for typing in memoQ

Translation Tribulations: http://www.translationtribulations.com/2014/01/memoq-autocorrect-update-ms-word-export.html

Mar 23, 2017

First month with SDL Trados 2017

A month ago, when I announced the Great Leap Forward from my rather neglected SDL Trados 2014 license to the latest, presumably greatest version, SDL Trados 2017, after seeing how wet the largely untested release of memoQ 8 (aka Adriatic) has proved to be, there was some surprise, as well as smiles and frowns from various quarters. It's been a busy month, and I am still testing options for effective workflow migration and exchange (useful in any case given how often memoQ users work together with those who prefer SDL tools) as well as discussing the good and bad experiences of friends, colleagues and clients who use SDL Trados Studio 2017.

As can be expected, this product has more than a bit of a bleeding edge character, though on the whole it does seem to be a little more stable and less buggy than memoQ Adriatic so far, with fewer what the Hell were they smoking moments. However....

I was a little concerned at the report from a colleague in Lisbon that the integration of the plug-in for SDL Trados Studio access to Kilgray Language Terminal amd memoQ Server translation memories doesn't work with SDL Trados 2017 after functioning so well in SDL Trados 2014 and 2015. Despite the stupid inter-company politics between SDL and Kilgray, which hindered the approval of the plug-in so that a warning dialog appeared each time it was loaded in SDL Trados Studio (bad form by the boys in Maidenhead), it was a great tool for users of SDL Trados Studio and memoQ to share TMs in small team projects. I was very happy with how it worked with SDL Trados Studio 2014, and I am very disappointed to see that API changes in the latest version have bunged things up so that Kilgray will have more work to re-enable this useful means of collaboration. I hope that SDL will see fit to be less petty and more cooperative with the upcoming "fixed" plug-in! It is in their interest to do so, as this makes it easier for SDL Trados users to stick to their favorite tool while working on jobs for or with those who prefer memoQ as their resource. Better work ergonomics for everyone and no BS with CAT wars.

I was pleased to see that SDL Trados Studio has added AutoCorrect facilities recently. And they seem to work reasonably well in English and mostly in German, though there was a strange quirk which hamstrung the "correct as you type" feature. That setting took a while to "stick" somehow when I tested it first with German. It was fine for Portuguese too. However, Ukrainian and Arab colleagues can't get it to work for some reason. I did not believe this at first until a colleague in Egypt showed me live via shared screens in Skype how the autocorrection simply failed to activate. Perhaps this is an issue with languages that don't use the Roman alphabet, so I suppose colleagues in Russia, Serbia, Japan and elsewhere may be tearing some hair out over this one. It doesn't affect me directly, but it looks like a pretty serious bug that ought to be addressed ASAP.

SDL generally kicks some butt with regex facilities in SDL Trados Studio; customer service guru Paul Filkin has written a lot about these features on his Multifarious blog, and most advanced users of the platform make heavy use of regular expressions in filters and QA rules. For a long time, memoQ users could only look on in envy at all the excellent possibilities before Kilgray belatedly added more regex options to its work environment. However, there are a few raw rubs remaining.

My Arabic translator friend pinged me recently to ask if I was aware of the "regex trouble" in the latest Studio version. He made heavy use of these features for Arabic and English work in some rather amazing, creative and inspiring ways (I had not imagined) in earlier versions of SDL Trados Studio, and some of these features are rather broken at present in SDL Trados 2017. He gave me a very useful tutorial (which I had planned to beg him for anyway soon) in the use of regex in SDL Trados Studio for basic filtering, advanced filtering and QA checks. Overall I was very impressed with the possibilities, but the failure of some regular expressions which worked well in the advanced filters to work at all in the basic filter or in QA rulesets was very disturbing. We argued a little about what the basis of the problem could be in the software programming, but it is a major problem which limits the functionality of SDL's latest software severely and should cause advanced users and LSPs to wait and watch for the fix before upgrading to the latest version. The persistence of such a major flaw in such an important area as quality assurance some 6 months after release is frankly shocking. I hope this will be addressed very soon so that I can migrate and upgrade some of me favorite QA routines from memoQ.

Last but not least is an irritating bug in an auxiliary feature for what has always been one of my favorite terminology tools, MultiTerm. It was the first Trados product many years ago, and despite many quirks over the decades, it remains one of the best. Face it: the memoQ terminology model is OK for most practical uses, but for maintaining high quality corporate terminologies tracking many important attributes it is hopeless garbage. Most other CAT tool terminology databases and glossaries are far worse. MultiTerm sets the standard today still for affordable, flexible, powerful terminology management. For 17 years I have used this excellent platform for my best terminologies for my best clients and delighted in its output management options (even when they can be a pain in the butt to configure properly).

When I want to access my high value MultiTerm resources while translating in memoQ or working in web pages or MS Word, I use the convenient MultiTerm widget to access the data. However, I am very disappointed to find that recent versions do not display the attributes for terms when the widget is used for lookup. Damn. That makes the results just as annoying as the lobotomized MultiTerm/TBX imports into memoQ. I really hope that SDL fixes this flaw ASAP and remains on top of the terminology game with MultiTerm and its lookup tools as a valuable resource even for translators who hate Trados Studio and won't use it.

Overall I am seeing a lot of nice things in SDL Trados Studio 2017, and I would say it is probably more mature and stable than memoQ 8 at this point. But it really is just a late-stage beta release, and more fixes are needed before I can trust it for routine production work. We are all better off for now to stick with the prior versions of both SDL Trados Studio and memoQ.

Mar 17, 2016

Dynamic filtering with regular expressions in memoQ

Regular expressions (aka regex) are not a tool for everyone, though this is something that the nerdily inclined often fail to appreciate. For average users, a plain language query interface, perhaps with more limited options, is generally more accessible and used. However, sometimes it's nice to have such "shortcuts" available to select particular structures in a text for translation or editing, and the many people who complained for years that Kilgray did not provide a dynamic regex filter for the working translation grid - a feature of SDL Trados Studio for quite a while now - did have a point worth addressing in development. Now that has happened, though still a bit incompletely when considered in the full scope of memoQ's usual features for selecting text.

memoQ uses regex in a number of its modules, and Kilgray has several webinars which describe these applications, though they require some stamina to watch, and I expect that most people will become hopelessly confused if they try to take in more than one area of application in a single sitting. The uses of regex for segmentation rules, tagging, autotranslatables and text filtering on document import (with the Regex Text Filter) are very different in their approach, even though the underlying syntax of the regex is the same. However, all of these applications allow the configured rules to be saved and re-used, so one could ask an expert to create the settings needed and provide these in a resource file, and many users do exactly that. Thus as long as one understand that regex can be used for a particular problem, the details can be hired out.

This new application of regex for dynamically filtering, introduced in recent builds of memoQ 2015, is a little different (at present). Although the Find/Replace dialog will "remember" regex syntax in its dropdown menu of recent expressions, there is no way to store these expressions, and they must be entered manually to use them. This means that, for now, the average user will have to collect useful expressions like a tourist might scribble phrases in a notebook to use on holiday in a foreign country, and those with a little more sense of adventure might find themselves with a hovercraft full of eels and wonder why.

One such phrase might be the example in the screenshot above. I was translating some financial statements with several formats present for digits in account numbers, dates and monetary expressions. In order to work more systematically with these various formats, I used several different regex expressions to sort and separate them. In the example I was looking for instances where at least four digits were written together in a source segment. That isn't terribly selective, but most of these occurrences in my documents were account numbers, and this helpfully cleaned up the text a lot and allowed me to work a little faster. Other expressions were used to QA date formats and monetary expressions more specifically.

In the working grid for translation and editing, regular expressions can be used in one or both of the fields for the source and target text when the checkbox in the toolbar at the right is marked. Or the regular expressions option in the Find/Replace dialog can be used.

It is somewhat disappointing that regex cannot be used to create static views at the present time. While marking can be used in the Find dialog to enable one to go back and forth between the filter criteria and other configurations of the working grid, there is no way to make a permanent "record" of the filtered segments. For quite a few years, I have wished for the possibility to save the results of my filtering in the working grid in some sort of view, but I was always able at least to recreate the filtering criteria in the dialog to create a memoQ View, which could then be opened at any time or exported in various formats for clients and project collaborators. However, at the moment that is not possible with regex filtering. (There are workarounds involving a change in segment status, but these are often inconvenient in a project in progress.)

The addition of regex filtering to the working grid in memoQ is a welcome feature for many, which I hope will be expanded by Kilgray in the future to achieve more of its potential. But to take advantage of this potential in any way, the average user will indeed need a "phrase book" of sorts, and an efficient way of managing useful collected regex snippets (and naming them for easier re-use in searches and filtering) would be very desirable. If these "regex phrase books" for dynamic filtering and view creation were able to be saved as shareable light resources, it would be possible to build many useful collections to help users at all levels in the translation, editing and quality assurance tasks.

Search me!