Translation Tribulations: DOCX

Showing posts with label DOCX. Show all posts

May 28, 2022

Filtering formatted text in Microsoft Office files

Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.

Nov 30, 2013

The state of the upgrade: memoQ 2013 R2

The memoQ 2013 release started off on the wrong foot with me in many ways. I was deeply disappointed by the features that were previewed in Budapest at the last memoQfest, and I was even less happy after I saw what a hash had been made of one of the features I use most: comments. In fact, I wrote a rather annoyed blog post about that not long after the release. There was a lot of talk about "game-changing innovation", but frankly I really could not see it. My translating colleagues asked me if it was worth it to upgrade, and aside from my usual warnings about the need to wait for at least 2 or 3 months after any release for it to mature and stabilize, I just could not find any compelling arguments for a freelance translator to move from the stable, excellent 6.2 version to the rather dodgy 6.5 version, or "memoQ 2013" as it was rechristened.

Almost on the usual schedule, however, two months later the bugs were largely sorted out, the initial mistakes in the comment feature redesign were well fixed, and I no longer saw the memoQ 2013 release in the same dim light, but could actually see some benefits for my freelance colleagues to upgrade to that version and no actual harm in doing so. And as I got to know the fuzzy term matching feature better and saw how it helped me deal with typo-laden source documents or the usual spelling chaos of German technical writers, I began to see some very compelling value in memoQ 2013 for translators.

Most of the "game changers" talked about in May actually arrived a month ago with Release 2 of memoQ 2013. I did my best to lower expectations for this release, not because I think it is crap, but because I think this is one of the best CAT tool version upgrades I have seen in 13 years, and I knew it would need the usual time to mature. I think by the end of the year this version will have so much to offer that I would rather not have people stressing over the small stuff that I am confident will be fixed well.

However, I decided to live dangerously, and I switched over all my production work to use this version even before the official release. The first few weeks were not fun with all the little quirks I discovered and duly reported, but I encountered nothing data-destroying or really shocking, mostly just housekeeping details like somebody forgetting to vacuum the rug after gutting the whole house and giving it a nice remodel.

One month after the official release, memoQ 2013 R2 is far more reliable than I remember any memoQ version being one month after release. There has been steady refinement in its features, and I continue to discover hidden gems that I sometimes suspect most of the Kilgray team aren't even aware of yet because so much was added and changed, but not in a way that disrupted older work processes. I have a long shopping list of refinements that I think should be made to new features like the TM search tool (which has only actually worked on my system since the release of the 6.8.5 build about a week ago) or that ground-breaking monolingual review feature which (will probably be the next big CAT feature to copy), but even the new features I consider rather immature are already looking pretty damned good. I can't guarantee that this release can be trusted for all your work right now (though it actually seems pretty good to me right now), but since it can be safely installed in parallel with older versions, I definitely recommend taking a look and joining the conversation on refinements still needed. I think Kilgray has been very responsive to user feedback in this round, and I can't say I am anything but encouraged by what I have seen in the last month.

One very exciting change for me in the current build (6.8.6) is that the rather risky non-optional export of target text comments with DOCX files has been sorted out very nicely. The solution seems a little strange to me right now, but it's a great step forward with some excellent possibilities.

When I saw those "severity levels" added to the commenting features in memoQ 2013 (6.5), I had very little good to say about them. I still don't think much about how they are named and wish I could choose my own labels, but now I can only applaud their usefulness. Why? Because the addition of the five checkboxes above has given me the control I want over comments to be included in an exported translation of a DOCX file. I can cleanly separate the comments which are notes to myself from those for my project partners and comments for my customers. This is very helpful.

I do think it is odd that this control was placed at Tools > Options > Miscellaneous > Translation when the comment exports (as far as I know) only affect DOCX files, but if there are plans to extend this feature to other exported formats, then this makes sense. I would like to see similar filtering controls for the ordinary view filters (on that last tab where comment and tag filtering criteria can be specified) and for comment inclusion in a bilingual RTF export. Either of these would be an enormous help to my frequent work processes, because I use a lot of comments intended for different people, and sorting these out cleanly can be laborious.

In recent weeks I have been working on the new edition of my memoQ tips book and taking a very close look at "corners" of the software that I suspect very few have time or inclination to look in. And I've had days when it really felt like Christmas has come early. One discovery after another of nice little refinements, lots of incremental improvements, which added together give a total with what I feel is a lot of value. I'm writing way too many private thank-yous to some of the people at Kilgray for what I see as excellent new directions even if I am inclined to argue over some of the details.

Since the release of memoQ 6.2 and its follow-ups with the bilingual text/Excel filter, there has been such a steady flow of useful improvements to help individual translators work better that those who claim that all the effort of development has been spent catering to the corporate sausage-making interests of the low-paying cattle call crowd simply haven't been paying attention. Or they have been confused by Kilgray's occasionally appalling failure to organize their messages properly for different interest groups. If you're talking to a big group of freelance translators and start discussing "great server features to monitor your translators' productivity", don't expect blown kisses and showers of rose petals. Sometimes it's obvious that the makers of the tool don't always understand the importance of what they have created for our work. Well, why should they? We're the ones doing it. But I tell you, right now there is a lot more gold for individual translators in the memoQ mine than anyone realizes. That goes for me too. I am surprised by fat new nuggets I find almost every week.

Do I care that so much effort is spent on developing cutting edge project management features for memoQ translation servers, even ones that I think can be abused in some pretty awful ways by some companies whose business practices I detest? Well yes I do... I think it's great. Besides, I can actually come up with nice uses of those awful features. You can do a lot of things with a cutting-edge: chop up a tasty salad... or the local nursery school. Blame the fool, not the tool.

Kilgray has avoided the disastrous errors committed by Atril in the last decade as their market mis-focus and disastrous failure to get the maintenance revenue needed to fix and develop features steadily eroded the ability of its loyal users to cope with a changing market. There was nearly a complete failure to compete for the business of translation agencies and corporate and government translation departments. And the solutions that prevailed in those quarters were mostly rather awful. I watched whole departments of Siemens traumatized by the disastrous Trados Teamworks, which made a number of those in the translation team of the medical products division look forward to retirement.

Kilgray has steadily built its business in the markets ignored by Atril a decade ago and in doing so has secured its future far better and ensured the funding of a truly remarkable series of improvements in the four and a half years I have been using memoQ. And now... when I look at the features of the recent SDL Trados 2014 release I see good things that I have known from other tools for a long time for the most part, nice to have really, but as I stifle a yawn I wonder if it all really has to be so complex since I'm not depending on consulting or training for SDL to pay my bills. And then I get back to memoQ and keep getting rocked by the "wow factor" as I find useful new things while trying to concentrate and get a job done. memoQ 2013 R2 is one of the worst offenders I've seen in a long time for its very real threats to make my work a lot easier and more fun!

Aug 2, 2013

Translating SDL Trados Studio SDLXLIFF files & more in memoQ!

My latest demonstration video actually covers a number of memoQ features so that I would have an excuse to create this video index:

Time Description
0:32 Importing the first SDLXLIFF file to memoQ
1:12 Exporting the finished translation
1:27 Viewing the translation in SDL Trados Studio 2009
1:40 Re-importing the edited translation for a TM update
3:24 Saving the translation in a LiveDocs corpus for later reference
3:55 Importing a new version of the text in an SDLXLIFF source file
4:25 Comparing source text versions
5:55 Document-based pretranslation ("X-Translate")
7:11 Examining a "warning" for forgotten tags
7:46 Results of the second translation in SDL Trados Studio

That is the sort of thing I was talking about in a recent blog post about new approaches for online instruction. Many times I have wished for just such an index for long webinars or even much shorter reference videos like this one.

This tutorial was inspired by a Skype chat with a colleague in the US a few days ago. She uses memoQ but works with a number of others who use various versions of SDL Trados Studio, and there were some questions about about how one might deal with TM updates after a translation as well as the inevitable new versions that legal and financial translators often encounter.

I have also noticed that quite a number of people are not up to date on SDLXLIFF compatibility with memoQ; this video also shows that former issues with preserving segment status have been taken care of, and everything now works well.

What is not obvious in the video is that one can also change the segmentation of the SDLXLIFF in memoQ; this happens only in the memoQ environment to allow better translation and more sensible translation memory content, and when the SDLXLIFF file is exported from memoQ, the original segmentation from Trados is preserved in the Trados environment.

Also not shown in the video is how I imported a third version of the source text, this time as a Microsoft Word file, not an SDLXLIFF. The document-based pre-translation (X-Translate) worked perfectly, and the target file was exported in the proper format (DOCX).

There are, of course, many other ways one could handle a "project" like this, but the procedure shown is not unlike what I sometimes do in projects myself.

********

I apologize for the quirky click animation in this tutorial; Camstudio had some problems I have never encountered before, and I'll have to get to the bottom of that if I keep using that tool. Otherwise, the video quality is probably the best I have achieved so far, and I would like to thank the friend who revealed the "secret" of better quality video for YouTube.

Jul 17, 2013

How would you translate the chart in this DOCX file?

Can anyone tell me quickly the best way to translate the chart in this DOCX file? Or how to get an accurate word count of the words to be translated in the file?

*****

I love to see the different approaches people take to this problem. It's one which I think is encountered with some frequency by translators, and in the past I too many different approaches to it - long ago I usually did something involving PDF conversion, editing of the PDF and making a screenshot. But that is inefficient and doesn't allow the use of CAT tools.

Yesterday I picked up a project with 18 of those silly charts embedded in it. A real nuisance. Here's what happens if you try to edit one of those charts in situ:

Hopeless, right? A lot of very authoritative web pages make it clear that without having the linked Excel files, you cannot modify the text. Not true, actually. With or without hints, a number of technically versatile colleagues found ways to solve he problem or at least made close guesses. Some of these are here in the comments. One very interesting exchange on Twitter showed than somehow the settings of the OmegaT import filters can be tweaked to solve this:

The thing about OmegaT is that it's sort of geeky - the solution looks pretty good here, but I can't actually make it work myself.

The solution I worked out last night is very similar to the one described by Stanislas in the comments.

Change the file extension to ZIP
Look inside the ZIP file with Windows Explorer or another suitable tool as described in other blog posts.
Inside the "word" subfolder there is a folder named "charts". It contains XML data with all the chart headings, numbers and labels. Copy it.
Paste a copy of the folder where you want your source files. Import the chart XML files into any CAT tool or XML editor. It's a good idea to configure a filter to exclude and protect the references to the original Excel files with the data. (Though I am curious whether deliberately spoiling these data can protect against the unwanted update that one person worried about in the comments. I'll have to try that.)
When the translations are completed, paste the XML files back inside the charts folder in the file structure.
Rename the extension back to what it was at the start (DOCX in this case). You're done. No refresh necessary (unlike with embedded Excel or PowerPoint objects).

A memoQ filter configuration for these XML files can now be found on Kilgray's Language Terminal.

Jul 10, 2013

Coping with objects and graphics to translate in Microsoft Office documents

About a year ago, I published a series of posts describing a simple way to get at the objects and graphics embedded in Microsoft Office documents, such as Microsoft Word DOCX documents or PowerPoint PPTX presentations. These investigations were inspired by a series of jobs where I had to cope with up to 60 embedded Excel tables in a Microsoft Word document. The four related posts are:

The post titles may differ a little from the text in the links here, which is updated for a little more clarity.

I've also added two short videos to my YouTube channel which illustrate how to remove embedded objects from a DOCX for translating separately from the Microsoft Word document and how to put them back afterward.

Here's how to extract the embeddings folder from the DOCX file:

And here is how to put the translated embedded objects into the DOCX file and refresh the view of the embedded objects in your translation:

These and other videos I've produced recently are part of an effort I began recently to develop integrated courses for self-instruction and review with software tools used by many of us. These courses use the Moodle platform and offer text, screenshots, audio, video and data files such as examples of file formats to translate, backups of memoQ practice projects to restore on your local computer for training, configuration resources for memoQ, useful macros to support work with many translation environment (CAT) tools and a host of other resources and learning links.

Aug 29, 2012

Replacing bitmap graphics in MS Office 2007/2010

A few weeks ago I published a series of posts about different aspects of embedded objects in DOCX, XLSX and PPTX files and how to translate them a bit more conveniently. When I described this procedure to a colleague this afternoon, he thought it was a solution for convenient substitution of bitmap graphics in these files as well. Well, sort of, but not quite.

If you rename the extension of a DOCX file to ZIP and open the ZIP file using Windows Explorer (in order not to mess up the compression), you will see a folder named word.

Inside this word folder are other folders of interest:

The embeddings folder has the objects such as Excel tables or PowerPoint slides described in previous posts. The bitmap graphics or pictures are in the media folder, however:

The view inside the media folder above shows one bitmap graphic (the JPEG file) and various other files with images of the embedded objects (an equation, and Excel table and a PowerPoint slide). Only the bitmap files are of real interest. If other graphic files localized for the target language are named the same as the original files in the media folder and substituted there, when the ZIP files is renamed to have its original extension, the substituted graphics will appear in the document the next time it is opened.

This way, for example, screen shots for an entire file can be substituted quickly. One could, of course do this by a number of other means, but this way is fairly convenient and could probably be automated without much ado if your organization needs to make such substitutions a lot.

Addendum: I was curious about those other files in the media folder - the ones with the views of the embedded objects. So I deleted them to see if they would re-generate when the document is opened. Instead, this message was displayed at the location of each object:

Double-clicking the "broken" object display opened the object and restored the view. So clearly, refreshing object views involves updating the content of the media folder.

Jul 27, 2012

Translating embedded objects in Microsoft Office documents

Yesterday a colleague sent me a note to say he had been searching my blog for information about translating compound Microsoft Office documents (that is documents with embedded objects) in memoQ and couldn't find any. I presume he was referring to the article about how often one CAT tool is not enough - combined workflows with other tools can frequently help solve many tricky translation problems, and DVX2 or STAR TRANSIT are definitely useful options for preparing compound Microsoft Office documents for translation in memoQ. Some time ago I recommended using STAR TRANSIT as a pre-processing tool to one of my agency friends, and he carried out a very large, complex project successfully using memoQ's excellent integration features for STAR TRANSIT projects.

There is, of course, another simple way to translate the embedded objects in a Microsoft Office document that does not involve purchasing other software licenses. I don't usually talk about it, because there are a few limitations, and until recently I had not figured out how to avoid corrupting the files when I tried to do things the "easy" way. This approach is not limited to memoQ and will actually work with most CAT tools - so SDL Trados Studio users can do this as well, for example.

It is useful to know that the Microsoft Office 2007/2010 file formats (DOCX, PPTX, XLSX) are really just ZIP files containing XML and a bunch of other stuff. That stuff includes a folder with the embedded objects in formats that can be dealt with directly.

If you have an older, binary MS Office document (DOC, PPT, XLS) with embedded objects, convert it to a 2007/2010 format.

If you rename the file extension DOCX, PPTX or XSLX to ZIP and unpack the ZIP file, inside the folder you will find a folder called "embeddings". The files in that folder can be copied elsewhere and usually handled directly in your CAT tool. But problems usually arise when you put them back, rezip the folder and change back to the original extension. The compression gets screwed up, and the Microsoft Office file is corrupted and won't open.

The only reliable method I have found for avoiding this is to use the Windows Explorer (under Windows 7) to open the ZIP file:

Here's what the "guts" of one DOCX file with a bunch of embedded Excel tables looks like:

Inside the word folder you'll find the embeddings folder:

The contents of the embeddings folder look like this:

Simply copy the embeddings folder somewhere safe, translate its contents, then copy them back to the ZIP file using Windows Explorer. Then rename the ZIP extension to the original extension for the file.

If you open the file and look at it, you'll get a shock. When you see all the objects in their original language, you might think something went wrong. Nothing bad has happened; you merely need to refresh the objects. This can be done by opening each briefly to edit or using a macro to open each object and close it again quickly. In a job with dozens of embedded objects in a long file, this macro is a helpful shortcut.

Given how easily accessible this embedded content actually is, one has to wonder why other major CAT tool providers like SDL and Kilgray have failed to offer the option of importing embedded content in their filters up to now. Let's hope they do soon. In the meantime, this workaround should enable many people to deal with this complex and irritating file format challenge.

Here's a summary of the procedure once again:

Rename the *.???x file to *.zip
Under Windows 7, right-click on the ZIP file and open it using the Windows Explorer. Using ZIP tools of any kind risks corruption by changing the compression ratios.
Find the embeddings folder inside the ZIP structure. Copy this elsewhere and use it as the source for translation. It will contain all the embedded objects as single files.
Copy the translated content back into the embeddings folder in the ZIP structure.
Rename the ZIP file to its original extension.
Open the file and refresh each embedded object (which will initially appear not to have been translated) by right-clicking and opening it from the context menu or running a macro to do that.

Jul 2, 2012

Sometimes one CAT tool is not enough

Not long ago, a colleague in New Zealand expressed her frustration about the limits of interoperability for common translation environment tools and her sense of unfulfilled promises:

In the case she was concerned with, she was quite right. There are workarounds for complex MS Word documents with footnotes, but none of these are really optimal for a team working simultaneously in several different CAT tools. In the case of memoQ 5 (which was part of the mix) the lack of support for footnotes in RTF/DOC bilinguals made it impossible to review an uncleaned translation done in WordFast Classic (not a problem for simpler files), and the use of a bilingual DOC export from memoQ used the "simple" format of one segment per line, thus losing the format for the working translator. I hope that will be dealt with in time by Kilgray's developers.

But fortunately, interoperability really does work - it is "the art of compromise" as one industry guru put it, but there are many acceptable compromise strategies that allow productive collaboration, and memoQ excels in this regard more than any other tool I know. But as I have said so often, we need a broad palette of tools to enable us to handle any job efficiently, and last week's project here was a good example of this.

No good deed goes unpunished, and my punishment for an almost miraculous rescue of the editing and harmonization of a large, complex financial report done in a hurry by several translators, some of whom don't use CAT tools at all, was that I got to do the update of that text and see all the little stuff we missed the first time around when the client CEO and I traded sleep for coffee and Excel spreadsheets. Actually, I loved that job, and I was proud of what we could accomplish in 48 hours that should have taken a week or more of overtime. All of it possible only thanks to memoQ LiveDocs and the QA module. And lots and lots of coffee.

In this round, however, I was determined to avoid some of the pain caused last time by file format problem. The Notes to the annual report contained about 30 embedded Excel tables in a Word document. "So what?" says the user of Star Transit or DVX2. "Uh oh!" say the Trados and memoQ users. This is where interoperability saved me hours of bother.

I'm no longer comfortable doing routine work in my former preferred tool, Déjà Vu. The working environment of memoQ is more ergonomic for me, and although I still miss a number of very useful features in DVX, on the balance, the features I gained in memoQ allow me to do many more things better (or even at all). Nonetheless, this time Atril had the clear advantage.

I translated the main text of the Notes in memoQ, making full use of my translation memories, glossaries and QA settings there. I enjoyed the previews of the embedded Excel documents, which gave me necessary context for some of my work, but the actual content of those tables was untouchable in memoQ. Then I exported the translation, which was an English document with embedded tables in German.

This compound document was then imported to DVX2 together with my TM. I copied the source to target, locked all the English content (it was helpful that the content extracted from the Excel tables was at the end of the translation scroll) and pretranslated what remained from the TM. Less than an hour later I exported the completely finished translation - and saved a lot of fiddly work exporting and importing those stupid tables like I had to do before. I really do hope that memoQ's filters for MS Office documents will be updated to handle embedded objects soon - it's not uncommon that I have Excel, Visio or PowerPoint objects stuck in my Word documents.

After delivering the text, I then turned to the next task: exporting my terminology. Once again, interoperability came to my rescue here. This customer places a lot of importance on the correct use of IFRS and their own terminology. One of the ways we coordinate this is to exchange glossary information in a format that this customer, who doesn't know a CAT tool from a Persian feline, can cope with. A nicely formatted DOCX or PDF dictionary does the trick. But I can't do that with memoQ.

I've been advocating the addition of XSL script selection to memoQ's XML term export for some time now. My own efforts to create good scripts for my purposes are hampered by the fact that I haven't done much programming for a decade now and I've lost most of my skills. So until I sort that problem out, I take the terms in XML from memoQ and import them to SDL Trados MultiTerm. MultiTerm is unique among the terminology tools on the low end of the market in that it has always offered some useful export format templates (which can be adapted) for re-use of the term information in other environments. Formatted RTF dictionaries like the one shown here as a thumbnail, web pages, custom text exports... the sky's the limit if you can deal with the odd configuration options and unexpected crashes. Having traversed that minefield often enough in the past decade, I can usually produce something good-looking from my memoQ terminology with SDL Trados MultiTerm without much ado. And my clients like it a lot more than an ugly CSV export.

So why didn't I just use Déjà Vu or Trados in the first place? Re-read the text above. None of the three CAT tools I use was capable of doing everything I required as efficiently as I needed it done. DVX2 came the closest, but the lack of a preview, the primitive way that tags (codes) are still managed and the lack of comfort I feel translating in that environment (I'm much slower now) made it a poor option for the bulk of the work. But working in carefully planned concert, these three tools produced excellent results, made my client happy and made me happy by saving the rest of my day with an early delivery.

Jun 16, 2012

memoQuickie: footnote, cross-reference & index entry segmentation in Microsoft Word files

If you have a Microsoft Word DOC file or RTF to translate, it is important to be aware of the different behaviors of the memoQ import filter options you can use. If there are footnotes, cross-references or index entries, it is far better to use the option to import the DOC or RTF file as DOCX.

The DOC file shown below has a footnote, a cross-reference and an index entry:

Adding it to a memoQ project with the default filter for Microsoft Word in memoQ 5

gives the following segmentation result:

Importing the same document with the DOCX option of the filter

yields much cleaner segmentation and better tags to work with:

Compare what some other programs do with this file:

WordFast Pro

DVX2 (DOC)

DVX2 (DOCX)

TagEditor salad (partial)

SDL Trados Studio 2009 segmentation

SDL Trados Studio 2011

There is room for improvement with most tools.

Apr 26, 2012

Twitterview: SDL Trados Studio, memoQ, DVX2 and PDF extraction

When I began using Twitter somewhat hesitantly three years ago, I never expected that it would eventually prove to be one of the most useful social media tools for gathering information of professional value. Much of this is serendipitous; I really never know what will come floating down the twitstream or where some of the conversations in it will go. Like the direct chat I had with with a colleague in New Zealand about features she liked best in the two main CAT tools she uses, SDL Trados Studio and memoQ.

We both really appreciate the TM-driven segmentation in memoQ and the superior leverage this offers. But to my surprise, she expressed a preference for SDL Trados Studio, particularly for the quality of its PDF text extractions from electronically generated files. This is not a feature I make heavy use of in either tool, though I have used it more often lately in memoQ for alignments in the LiveDocs module and found it generally satisfactory. Most of my work involving PDF files is with scanned documents - there one has no choice but to use a good OCR tool like OmniPage or ABBYY FineReader.

So I was quite intrigued that the quality of PDF was "better" than from standalone tools. Especially because my experience is quite different. Further discussion (not shown in the graphic) revealed that what she actually meant was that the quality of the text extraction with the CAT tool usually beat the quality of text received from translation agencies who performed conversions. That is easy to explain, really. In my experience, most agencies are clueless about how to use conversion tools and too often use automated settings and save the results "with layout". This is very often utterly unsuited for work with translation environment tools or requires a lot of cleanup and code zapping.

For years I have recommended to agencies and colleagues that they spare themselves a lot of headaches by saving PDF conversions as plain text and adding any desired formatting later. Most people ignore that advice and suffer accordingly. So in a way, a CAT tool that does so encourages "best practice" for PDF translation for those files they are actually able to handle.

Encouraged by the Twitter exchange, I decided to do a few tests with files from recent projects. I took a PDF I had with various IFRS-related texts from EU publications. It appeared to extract quickly and cleanly in memoQ, giving me a translation grid full of nicely segmented text. SDL Trados Studio 2009 choked badly on it and extracted nothing. Her extraction in SDL Trados Studio 2011 caused a timeout with the project I was told, but the text itself was completely extracted and converted to DOCX format. This is useful, because unlike the extraction to plain text in memoQ, this offers the possibility to add or change some text formatting in the translation grid. Other extraction examples from SDL Trados Studio 2011 showed that text formatting was preserved.

A closer examination of the extracted texts revealed some problems with both the memoQ and Trados Studio extractions. The memoQ 5 PDF text extraction engine proved incapable of handling text in multiple columns properly. The paragraph order was all fouled up. The extraction with SDL Trados Studio had a great number of superfluous spaces. Whether it is possible to optimize this in the settings somehow I do not know. The results of all the extraction tests are downloadable here in a 6 MB ZIP file. I've included the SDL Trados Studio extraction saved to plain text as well for a better comparison of the text order and surplus spaces problems.

Overall, I am personally not very pleased with the results of the text extractions from PDF in either tool. The results from SDL Trados Studio are clearly better, and other examples that were shared made it clear that this tool works better than many an untrained PM with better PDF conversion software. This is certainly much better than solutions I see many translators using. But really, nothing beats good OCR software, an understanding of how to use it well and a proper workflow to get a good TM and target file better fit for most purposes.

*****

Update 2012-05-22: I met colleague Victor Dewsbery at a recent gathering in Berlin, and he told me about his tests with the recently introduced PDF import feature of Atril's Déjà Vu X2 translation environment. He kindly offered to share his results (available for download here) and wrote:

Here is the result of the PDF>DVX2>RTF>ZIP process for your monster EU PDF file. Comments on the process and the result:

The steps involved were: 1. import the file into DVX2 as a PDF file; 2. mark all segments and copy source to target; 3. export the file as if it were a translated file (it comes out as an RTF file). The RTF file is 20 MB in size and zips to 3 MB.

Steps 1 and 3 took a long time, and DVX2 claimed to be not responding. For step 1 I just left it and it eventually came up with the goods. Step 3 exported the RTF file perfectly, even though DVX2 claimed that the export had not finished. I was able to open the RTF file (it was locked, but I simply renamed it), and this is the version which I enclose. Half an hour later DVX2 had still not ended the export process (and had to be closed via the Task Manager), although the exported file was in fact perfectly OK. The procedure worked more smoothly with a couple of smaller PDF files. Atril is working on streamlining the process and ironing out the glitches in the process, especially the “not responding” messages.

The result actually looks very good to me. There are hardly any codes in the DVX2 project file (the import routine also integrates CodeZapper). I didn’t spot any mistakes in the sequence of the text. Indented sections with numbering seem to be formatted properly - i.e. with tabs and without any multiple spaces.

The top and bottom page boundaries in the exported file are too wide, so most pages run over and the document has over 900 pages instead of just under 500. Marking the whole document and dragging the header/footer spaces in Word seems to fix this fairly quickly.

I note that some headlines are made up of individual letters with spaces between them. This may be related to the German habit of using letter spacing (“Sperrschrift”) for emphasis as an alternative to bold type.

I found one instance where text was chopped up into a table on page 857 of the file.

There are occasional arbitrary jumps in type size and right/left page boundaries between sections.

On the strength of this sample, it would usually be OK to simply import the PDF file into DVX2, translate in the normal way, and then fix any formatting problems in the exported file.

Dec 27, 2011

SDL Trados Studio: Translating memoQ bilingual RTF files

Some time ago, I noted that SDL Trados Studio experiences difficulties importing XLIFF files in which the sublanguages are not exactly specified if the default languages are not set to the same major language. So if you plan to translate an XLIFF from memoQ or another tool in SDL Trados Studio, it is necessary to ask the one generating the file to specify the sublanguages or, if that is not practical, use the workaround described here. I discovered this bug before the release of the 2011 version of Studio and spoke to SDL development and management staff specifically about this at the TM Europe conference in Warsaw, but apparently this is not a priority to fix compared to other issues, and it may be a while before SDL Trados Studio users can work with client XLIFF files without coping with this headache.

Several of my client agencies using memoQ for project management have quite a number of freelance translators using various Trados versions and who have no intention to stop doing so. It's important to work smoothly with these resources in a compatible way, which also protects the data and formats. In a recent article on processing memoQ content with Trados TagEditor, I published a procedure I developed which enables the memoQ tags in the text of the bilingual RTF table export to be protected as tags when working in SDL Trados TagEditor. Now I would like to present a similar approach for Trados Studio users, which can serve as an alternative to XLIFF exchange.

If the bilingual RTF table is created in memoQ specifying the mqInternal style for tags

this style setting can be specified as non-translatable in SDL Trados Studio. To do this, select the menu choice Tools > Options, and in the dialog which appears under File Types, add the mqInternal style to the list of styles to be converted to internal tags in the appropriate formats (RTF, and just in case the file gets re-saved as a Microsoft Word document, for Microsoft Word 200-2003 and Microsoft Word 2007-2010 as well):

SDL Trados Studio dialog for setting RTF, DOC and DOCX styles as "non-translatable" (converting to tags)

Once the mqInternal style has been entered this way in SDL Trados Studio, the prepared bilingual RTF file can be imported. "Preparation" for import includes copying the source text to the target and hiding all the text you do not intend to translate (the file header, the source column, and the comments and status columns if present). The result will look something like this:

The prepared memoQ bilingual RTF file imported to SDL Trados Studio. Note that the bold and
italic type are displayed normally as in memoQ, which offers the translator greater working ease.

Please note that the same procedure described for working with these files in TagEditor (hiding the red text of the tags, see the TagEditor article for details) also works for SDL Trados Studio, but this method involving the mqInternal style saves a few steps.

Clean up the tag mess with CodeZapper for all CAT tools

Readers of this blog probably know by now that I am a Dave Turner fan. His CodeZapper macros have probably saved me hundreds of hours of wasted time over the years (not an exaggeration), and I think there are a lot of other translators and project managers with similar experiences. It doesn't solve every problem with superfluous tags, but it solves a lot of them, and Mr. Turner works steadily at improving the tool. I blogged the release of the latest version not long ago; it is now available directly from him for a modest fee of 20 euros (see the link to the release announcement for a contact link). That means it pays for itself in far less than an hour of saved time.

Over the past few days I have been updating some training documentation and running a lot of tests on tagged files as part of this. During this work, I have been struck time and again by the differences in the tags "found" by different tools working with the same file. Sometimes one tool looks better than another, but the patterns are not always consistent. What is most consistent is the ability of CodeZapper to clean up the files in various versions of Microsoft Word and make the tag structures appear a little more uniform.

Here's an example of the same DOCX file "unzapped" in several tools:

Import into memoQ 5, as-is, no tag clean-up. Previous versions of the same file showed more tags in places.

SDL Trados Studio 2009 before tag clean-up.

TagEditor in SDL Trados 2007 before tag clean-up

Initially, OmegaT would not import that particular DOCX without a tag cleanup. I reported the problem to the developers, who upgraded the filter to handle a previously unfamiliar character in internal paths of the ZIP file (DOCX is actually just a renamed ZIP package like many other file types). See http://tech.groups.yahoo.com/group/OmegaT/message/23931 for information on the new release. Opening, editing and re-saving the troublesome file enabled it to be imported after all without the latest version bugfix. So users should keep that trick in mind perhaps if a similar problem is encountered. I've had to do similar actions in the past with other tools, so this is probably a good general tip to keep in mind regardless of what tool you use. When I downloaded an tested the latest standard release of OmegaT (2.3.0_4), the tag structure looked fine - no zapping of the DOCX was necessary in this case.

After treatment with CodeZapper, the file looked the same in memoQ (where the extra tags weren't present in the first place, though one can't count on things always being this way). The view in Trados Studio and TagEditor improved significantly, though there were still more tags, and OmegaT accepted the DOCX after tag cleaning.

SDL Trados Studio 2009 import of the DOCX file after tag cleanup with CodeZapper

SDL Trados 2007 TagEditor import of the DOCX file after tag cleanup with CodeZapper

OmegaT import of the DOCX file after tag cleanup with CodeZapper (OmegaT 2.3.0_3)

It is important to consider that superfluous tags mean wasted work time with formatting and QA corrections, perhaps even a higher risk of file failure (such as the inability to import the file at all into one tool). This is why for some time now, I and others have advocated modifying the costing of volume-based translation work to include the amount of tags. This requires, of course, that you have access to a counting tool which reports the number of tags (SDL Trados Studio does this - Atril's Déjà Vu has long offered this feature, and memoQ even allows you to assign a word or character "weight" for counting purposes). This is the only fair way I know of to account for the extra work (beside time-based charges). Consider that everyone is affected: translators, reviewers and project managers! I've had to talk more than one of the last group through "tag rescue" techniques after hours.

Perhaps it is worth considering as well that cleaner tagging will also improve "leverage" (match quality) in translation memories. So if a tool does offer cleaner tag structures (fora variety of source formats) consistently, working with that tool efficiently to manage projects will save time and money as well on top of the time and money saved with the use of CodeZapper macros in MS Word files.

Dec 26, 2011

OmegaT: Best practice for translating content from memoQ

OmegaT is popular in some circles because it is Java-based and thus cross-platform, and it is free. Although rather limited in many respects compared with full-featured commercial tools such as SDL Trados Studio or memoQ, this Open Source tool can handle quite a number of formats well, offers interoperability pathways with the leading commercial tools and there are a good number of excellent professional translators who are satisfied with its features. Thus outsourcers using memoQ should understand the best procedures to follow if working with translators using OmegaT in order to avoid difficulties.

In the past, I have recommended using the bilingual XLIFF exports from memoQ for compatibility with memoQ. In theory, it's a nice approach, but I am encountering difficulties with memoQ-generated XLIFF files (possibly a Kilgray problem or a problem specific to my installation, not one having to do with OmegaT, which handled XLIFF from other sources properly in my tests). So for now I would say that a workflow involving memoQ's bilingual RTF tables is the best approach. Do the following to prepare the content for the translator:

Create a bilingual RTF table export from memoQ of the content to be translated. Use the "mqInternal" option for tags in order to change their color and facilitate proofreading of the final result.
Copy the source content cells into an empty DOCX or ODT file. OmegaT cannot read RTF and requires one of these two formats to be used in this case. The translator will be able to read these directly and translate.
Other resources such as TMs and glossaries:

OmegaT uses TMX for its translation memory. If you have a TM, provide it to the translator in this format.

The OmegaT glossary format is:
source term target term additional information
Provide terminology to the translator in this format if possible.
OmegaT is also capable of reading TBX, the industry-standard for glossary files.

The table cell content from the prepared file will look something like this in OmegaT:

Note that the memoQ tags are surrounded by additional OmegaT tags. Since OmegaT does not actually protect tags in its working environment, it is important that the translator verify the tags and proofread carefully, checking that all tags are present and applied correctly.

Once the translation is ready in the target DOCX or ODT file, open it in Microsoft Word, copy the translated table cells and paste into the target column of the bilingual RTF file, add any comments necessary to the Comments column of the table (if present). After the bilingual RTF is re-imported to memoQ, run a QA check to verify the tags again. After that the work can be proofread for content in memoQ or a bilingual export of an appropriate kind and the target file generated and delivered afterward.

Search me!