Translation Tribulations: format

Showing posts with label format. Show all posts

May 28, 2022

Filtering formatted text in Microsoft Office files

Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.

Oct 27, 2015

Beware the document Reimport trap in memoQ!

In between sneezes and hot shots of gingered lime tea I saw the Skype icon on my Windows task bar change to indicate a message. A distress call from a financial translator friend who had just received a new version of the Q3 report she was translating. memoQ has excellent version management features, which include a document-based pretranslation (X-Translate), which allows one to use a current or previous version of a translation to identify unchanged sections which have already been translated when the client sends a new version. This avoids potential confusion with undesired matches coming out of any ofd many translation memories or LiveDocs corpora which might be attached to a project.

This time, however, memoQ seemed to be getting weird on her, with error messages referring to ZIP archives and password protection. Her customer's file was not password protected, and as far as she knew, there was no ZIP archive anywhere in sight. She was dealing with "ordinary Word files". I have no idea what those are, but I hear about them often enough, and that is often where the trouble starts.

Last July I was teaching a week-long introductory course to memoQ in Lisbon, and when I wanted to show the course participants how this X-Translate feature worked, everyone ran into unexpected problems. When it was first introduced in memoQ, I noticed that the updates would work in any format. A translation which starts out as a script in a word processing file might later be updated as a set of presentation slides, and memoQ's document-based pretranslation did an excellent job of enabling me to focus quickly on the new material. It still does, but since the early days, some advocate of unintelligent programming decided that the filter used for the Reimport function to bring in the updated source text should assume that the source format was unchanged from the previous version rather than simply offer an appropriate filter for the current format. One must specify the filter to be used for an updated version if this assumption is not correct (as I also explained in my book New Beginnings with memoQ shortly after noticing this).

I can probably guess why this was done. With certain filters, the filter to use is not obvious from the extension (the multilingual delimited text filter, for example, if it is needed), or there may be a custom configuration of an "obvious" filter needed. In these cases, the assumption of using the last filter settings makes a lot of sense. However, if there is a change of format, where it is clear that the new filter should not apply, then some action should be taken other than a virtual assault on the user with mysterious error messages.

In the case of my financial translator friend, the update came as a DOC file, where the original had been DOCX. Geeks who have nothing better to learn with their time might know that DOCX files are actually renamed ZIP files, so at least the confusing error message above was "truthful" in a sense.

I see this sort of "switch hitting" with Microsoft Word file formats of various generations or changes from RTF to DOC or DOCX rather often. But in the case of importing new document versions, these changes mean trouble for memoQ if the user does not notice the difference, and given that the majority of working translators I have encountered who use Windows operating systems never fix the default system setting which hides the extensions of known file extensions, the chances that your average mortal wordworker will figure out this problem is just about zilch.

Armed with new insight into the problem, my friend was able to import the new document version successfully by specifying the appropriate filter manually and then use X-Translate to get her previous translation applied to sections of source text which had not changed (so that inappropriate 100% matches from a TM or LiveDocs corpus could be avoided). But for the future, I hope that Kilgray will apply a little more intelligent logic to the selection of filters for the document Reimport function of memoQ.

Jan 23, 2014

memoQ auto-translation regex blues

Can you write a regular expression that matches each of the three character sequences marked in green below?

abcdefg
abcde
abc

That's how one interactive tutorial I found start off teaching regular expressions. I found that a bit puzzling, then noticed that the right side of the web page offered a list of notes on "regex" expressions. I've spent too much time in the last two weeks trying to sort out a variety of memoQ auto-translation rules with regular expressions, so the answer came quickly and I typed it in the text field. Then I looked again and realized there were a few alternatives that would work. So I tested them. And then a few more. Various expressions that would work include

[a-z]+
[a-g]+
[\w]*
[abcdefg]*
.+
\D+

and quite a few others. But which is correct? As with word choice in translation, that depends on context, and that's where it gets hard.

Those who have delved into the configurations of various translation environment tools such as SDL Trados Studio or memoQ have seen that regular expressions are used in different ways to identify patterns in information and then filter or transform that information based on those patterns.

Although there is great power in regular expressions, I see their current role in memoQ configuration as more of a liability as far as the average user is concerned. Regular expressions are currently part of memoQ configuration for at least two import/conversion filters (a text filter and a tagger), segmentation rules and "auto-translation". I find the last feature particularly troublesome in its current form.

There are many misunderstandings about what auto-translation is in memoQ. It has nothing whatsoever to do with machine translation, which some propagandists prefer to call "automatic translation" to gloss over the many difficulties it can cause. Nor is it part of the pre-translation feature, though it can be applied in pre-translation to deal with things like catalog numbers, dates and figures in tables rather efficiently.

Auto-translation in memoQ is used to convert certain patterned information into the format needed in translation. In monolingual editing projects, it can also be used to unify the formatting used for things like dates and currency expressions.

memoQ ships with a number of standard rule sets for number conversions. The "English group" that I use consists of ten different rules for converting most of the screwy number formats I encounter with separators for decimals (periods or commas) and 3-order magnitude groups (thousands, millions, billions, etc.) that might be grouped with spaces, apostrophes, periods or something else. The rule for converting hundreds of millions with decimal fractions to my usual preference is:

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{1,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

with the replacement rule $2,$3,$4.$5

Pretty damned intimidating for most of us. In fact, most of the people who grasp the basics of regex syntax will scratch their heads over the number assignments of the groups until they realize that conditionals in parentheses (like (?:\.|,)) don't count. Nobody points that out in any tutorial I've read.

If I suggested to most of my esteemed colleagues that they really need to learn this stuff, I think most juries in the civilized world would refuse to convict them of murdering me for the mental cruelty inflicted. There are brilliant translation technology consultants like Marek Pawelec, who eat stuff like this with their breakfast cereal and are a priceless resource to colleagues and corporate clients who need their expertise... and then there are the rest of us.

Kilgray CEO István Lengyel told me recently that there are plans to expand the examples shipped with memoQ later this year to include some reformatting for certain date structures and other information. Language Terminal has a few examples of useful conversion rules for dates, unusual number formats, e-mail addresses and more. You don't need to know regex to use these, just how to download the MQRES files and click Import in one of the memoQ modules for managing auto-translation rules to bring these into your set-up. Once there they can also be used in QA checks.

I think it would be nice if Kilgray or someone more expert than yours truly would produce a "recipe book" with clear documentation of examples so that users of more limited skill (like me) can adapt these examples to their specific needs. Some typical uses I see (some of which have eaten my evenings recently) are

stripping or adding spaces from numbers with percent signs, such as 37,5 % <> 37.5%
formatting other numbers with units according to a preferred convention, such as 1,2A >> 1.2 A
currency expression reformatting like
      TEUR 1.350 >> EUR 1,350 thousand
      34.664,45 € >> €34,664.45
      € 1,2 Mrd. >> €1.2 billion
      & cetera
Legal references such as
      § 15 Abs. 1 Nr. 3 GebrMG
      § 15(1) Nr. 3 GebrMG
      § 9 S. 2 Nr. 1 PatG
page designations and other elements often found in bibliographies and easily overlooked (for QA purposes)
conversion of dates like 23.05.67 to
      23 May 1967
      May 23, 1967
      1967-05-23
          or whatever other format one prefers, with or without non-breaking spaces
Lovely EU legislation designations like 93/42/EWG >> 93/42/EEC
Telephone number reformatting such as (0211) 45 66 - 500 >> +49 211 4566-500

Some might wonder about that last conversion with the two-digit years. As a veteran of the Y2K scam, I'm fond of two-digit years; they were part of my ticket over the Atlantic. The implementation of regular expressions in memoQ allows for the use of custom reference lists, which are delimited by hash symbols (#) in the expressions. So when I created my rule for converting those two-digit years in dates

(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(#21st-Century-2digit#) and
(\d{1,2})\s?\.\s?(#month-num-to-text#)\s?\.\s?(\d{2})

with the replacement rules $1 $2 20$3 and $1 $2 19$3 respectively,

I used two custom lists, one with translation pairs like 01 = Jan. and the other with the two-digit years I felt like assigning to the 21st century (i.e. the ones I am most likely to encounter that do, like 00 through 19; I'll have to adjust this in some cases).

In the case of my two-digit year conversion rules, the rule order is important. The conversion will not work as planned if the rules appear in reverse order. THIS IS A MAJOR PROBLEM with the current rules editor for memoQ auto-translatables. Each time a rule is edited, it goes to the bottom of the list. I'm currently working on a complex set of about 20 rules for converting financial expressions, and rule order is critical for several subgroups of rules (this was disputed after I originally made this post and I backed down... however, subsequent tests have proven that rule order is indeed critical!!!). So editing them in memoQ is a nightmare-. (one expert told me how he generates a basic rule set in memoQ, exports it and does all further rule editing in Notepad++, which also allows him to keep better track of his work with comments.) Some current problems with the edit dialog for auto-translation rules in memoQ are

the need for "order stability" for rules being edited to maintain grouping (for a better overview and the arrow buttons to move rules up and down easily for better grouping
insufficient field width/height and bad scrolling behavior, so that it is very difficult to edit long expressions - usually have to paste them into Notepad to keep an overview while I work
strange, severe bugs in the test window, so the rule results shown are sometimes not accurate; I deal with this by adding a test document with sample data to the project and looking at what the rules do with that text
helpful  to explain them disappear when MQRES files are imported, and there is no way to maintain explanatory comments to keep track of one's own work in the editor (except for the pitifully limited comment box in the resource properties dialog)

A few people I've mentioned auto-translatables to have said to me that they would have no use for them, because they use dictation software. I do that myself, but I find that in many cases (like for those legal references) it is nicer to have a rule configured to client preferences, so I can use single keystroke to insert

Section 55(2) No. 3 Sentence 2
Section 55 Paragraph 2 No. 3 Sentence 2 or
Sect. 55(2) No. 3 S. 2

as the job calls for. And run a QA check to confirm that I have formatted consistently:

Dominique Pivard posted a nice video about memoQ auto-translatables on his "vlog" a while ago. It's worth a look if you want to see how to create these resources and see a good demonstration of how they work.

Oct 21, 2012

Put OCR in Your Business Model

This article originally appeared on an online translators portal four years ago and was long overdue for removal there. Here is an update.

*****

Optical character recognition (OCR) software is discussed often online and at translators' events, usually in the context of how to deal with PDF files. Hector Calabia, Peter Linton and others have made a useful technical contributions on this subject in articles and forums and at various conferences. However, it is useful to consider OCR software in a broader translation business context. Document conversion is often very useful for translation purposes and greatly facilitates automated quality checks of the draft, for example, but OCR can also generate additional income for your business and reduce quotation risk.
OCR for translation
There are a number of programs available for this purpose, and which one is best for your purposes may depend on the language combinations you deal with and other factors. For years now I have used Abbyy FineReader, because years ago it gave the best test results for the particular set of European languages one of our clients offered. It is also relatively inexpensive (I paid about 100 euros for FineReader 11) and easy to use.

Many OCR conversions of TIFF, JPEG and PDF documents which I receive from agencies are difficult to use for translation purposes and require significant modification - if they can be used at all. Particularly in cases where TM tools are to be used or target texts differ significantly in length (especially when they are longer) there may be problems. The best ways to avoid these problems are

avoid automatic settings for OCR conversions; use zone definitions instead
avoid saving the converted texts with full formatting in most cases
use a suitable post-OCR workflow to clean up the converted document by joining broken sentences, removing superfluous characters, fixing conversion errors, etc.

If the idea of doing individual zone definitions on each page of a 100 page document is intimidating, take heart. Programs such as Abbyy FineReader often allow you to define layout templates, speeding up the work considerably. One translator I know became so skilled at the use of these OCR templates and was so good with his conversions that agencies hire him just to do high-quality OCR work for them. Which brings me to….

OCR as an income-generating activity for the translator or agency
Hardcopy, scanned documents, faxes and PDF documents generally require more work for translators than electronically editable documents and require different, sometimes more fallible quality control measures than a typical workflow for a translator using original electronic documents in a translation memory system. If no conversion is performed, it is more time-consuming to check terminology or use concordances during the translation, and it is also unfortunately too easy for eyes to skip over bits of text. Under time pressure this can lead to very serious problems. Even with conversion, the OCR text requires careful checking against the original document to identify and correct any errors introduced (and there will be some at times with even the best OCR software). So it is not at all unreasonable for a translator to charge a higher rate for dealing with hardcopy, scanned documents, faxes and PDF documents.

There are a number of ways to incorporate these higher charges into your business model. The two obvious ways are a premium (surcharged) word/line/page rate and hourly service charges. I usually offer both options to my clients, with the word/line rate surcharge representing the “fixed” rate and the hourly rate the “flexible” rate where I make an non-binding estimate and they may end up paying more or less according to the actual effort. For pure OCR conversion jobs where I am not doing the translating, I charge a typical proofreading rate or a bit more, because I go through the entire document and see that it is correctly formatted for translation work and that obvious errors are fixed (i.e. basic spellcheck, etc.).

Sometimes I hear that “the client doesn’t want to pay for that”. Well, that’s OK, too. The client has the option of doing the work and doing it right and saving me the effort. The recognition that there is additional effort involved and that this effort should be compensated is important. But usually there is a way to sugar-coat the "bitter" cost pill, and this is where your marketing savvy comes into play. Some win-win arguments you might present include:

the availability of an editable source text the client can use for future versions;
the ability to create TM resources using the OCR text (which can save time/money later);
potentially better quality assurance, especially with tight deadlines.

Returning a clean, nicely formatted OCR of the source document is often good "advertising". End clients may appreciate how this saves time and allows them to use the original text in a variety of ways (attorneys may like to quote arguments from the opposing side, and copy/paste beats retyping). Discriminating agencies may recognize your skill at creating documents that don’t go crazy when edited (because of screwy text boxes, bad font definitions and other format errors) and offer you more work. If your language pair is in low demand or is very competitive, this may be one more way of distinguishing yourself from the pack.
I got started doing OCR work and charging for it after suffering through the conversion of several long PDF documents by more manual methods. I finally wised up, bought FineReader and started to use it with most of the hardcopy, scanned documents, faxes and PDF documents I received simply because it enabled me to use my TM tools and do better quality checks. I started sending the cleaner-looking source texts converted with OCR along with the target text translations, and soon I started getting requests for paid OCR work. A number of my agency clients then began to buy OCR tols and use them with varying degrees of success. Even if they do all the conversion work, I still win if they do it right, because I save time for what I enjoy more – the translation.

OCR as tool for quotation
Some people I know still haven’t learned to do a high-quality OCR (or they don’t care to), but they still use the software effectively in a very important area of their business: quotation and risk limitation.

There are lots of good tools out there for text counting, which is important to many methods of costing and time planning in the translation business. Some people even still do it manually, which, though time consuming, is not a bad way of checking the numbers from an electronic estimate. A number of factors can result in text counts being too low – embedded objects, such Excel tables or PowerPoint slides in a Microsoft Word documents, or graphics with text - or even too high (as is the case with at least one CAT tool counting RTF and MS Word files). Keep using whichever method you prefer - I won't try to persuade you that any one approach is best. I use a number of methods myself.

When translating larger documents, however, or documents with a complex structure, it is often useful to have a “sanity check” for your text counts. On a number of occasions I have received translation jobs from agency clients where the text count was given a X words, where in fact there were quite a few more words embedded in Excel objects, bitmap graphics, Visio charts, etc. which had not been measured by the method used. In a few cases these clients had to take a loss on the job after giving a fixed price bid to the end client. Using OCR to check your estimates can prevent such an unfortunate scenario.

To do this, print the document (whatever it is) to a PDF file. Then run the PDF file through an OCR program with automatic settings (to save time – you don’t need to translate this OCR). Save the text and count it. There will probably be a bit more text due to headers or footers or perhaps garbage from graphics, but the results should be close to your other estimate. (You can always subtract an appropriate factor for the text count in headers and footers to improve your OCR estimate.) If there is a major deviation, this is a clear sign that you should take a much closer look at the document(s) before quoting the job.

Searchable scanned documents
Another use I have found for OCR in recent years is creating searchable "text-on-image" documents from scanned PDFs, TIFF files and other bitmap formats. Although I have used these searchable PDFs mostly for reference while I work (searching for bits of text while viewing the original, unadulterated context) and supplied them to clients on only a few occasions, the potential for an additional value-added service is fairly obvious in this case.

Conclusion
OCR software is an essential tool for the work of many translators today, even more so than CAT software in many cases. Not just a tool for recovering “lost” electronic documents or making legacy typed material more accessible for translation work, it also offers possibilities for generating additional projects and income, differentiating one’s services and reducing risks when quoting large jobs. Key features of whatever OCR you choose should include the ability to select text areas for conversion and to determine their sequence in the converted text (using user-defined zones). Various options for saving the converted text (full page format, limited text formatting and no formatting) are also very helpful. Most important of all, though, is a good quality-checking workflow for your OCR documents (possibly including formatting) to avoid difficulties in the translation process and ensure that your work has a polished, professional appearance.

OCR software is another good tool for improving your visibility with clients and making your work processes easier in an age when many archiving and ERP systems are focused on the retention of PDF documents or TIFFs and even actively discourage saving original formats. The major providers of this software often have free, functional demonstration versions to use before making a purchase decision. Try several options and choose the best one for you. You won’t be sorry.

Jul 27, 2012

Translating embedded objects in Microsoft Office documents

Yesterday a colleague sent me a note to say he had been searching my blog for information about translating compound Microsoft Office documents (that is documents with embedded objects) in memoQ and couldn't find any. I presume he was referring to the article about how often one CAT tool is not enough - combined workflows with other tools can frequently help solve many tricky translation problems, and DVX2 or STAR TRANSIT are definitely useful options for preparing compound Microsoft Office documents for translation in memoQ. Some time ago I recommended using STAR TRANSIT as a pre-processing tool to one of my agency friends, and he carried out a very large, complex project successfully using memoQ's excellent integration features for STAR TRANSIT projects.

There is, of course, another simple way to translate the embedded objects in a Microsoft Office document that does not involve purchasing other software licenses. I don't usually talk about it, because there are a few limitations, and until recently I had not figured out how to avoid corrupting the files when I tried to do things the "easy" way. This approach is not limited to memoQ and will actually work with most CAT tools - so SDL Trados Studio users can do this as well, for example.

It is useful to know that the Microsoft Office 2007/2010 file formats (DOCX, PPTX, XLSX) are really just ZIP files containing XML and a bunch of other stuff. That stuff includes a folder with the embedded objects in formats that can be dealt with directly.

If you have an older, binary MS Office document (DOC, PPT, XLS) with embedded objects, convert it to a 2007/2010 format.

If you rename the file extension DOCX, PPTX or XSLX to ZIP and unpack the ZIP file, inside the folder you will find a folder called "embeddings". The files in that folder can be copied elsewhere and usually handled directly in your CAT tool. But problems usually arise when you put them back, rezip the folder and change back to the original extension. The compression gets screwed up, and the Microsoft Office file is corrupted and won't open.

The only reliable method I have found for avoiding this is to use the Windows Explorer (under Windows 7) to open the ZIP file:

Here's what the "guts" of one DOCX file with a bunch of embedded Excel tables looks like:

Inside the word folder you'll find the embeddings folder:

The contents of the embeddings folder look like this:

Simply copy the embeddings folder somewhere safe, translate its contents, then copy them back to the ZIP file using Windows Explorer. Then rename the ZIP extension to the original extension for the file.

If you open the file and look at it, you'll get a shock. When you see all the objects in their original language, you might think something went wrong. Nothing bad has happened; you merely need to refresh the objects. This can be done by opening each briefly to edit or using a macro to open each object and close it again quickly. In a job with dozens of embedded objects in a long file, this macro is a helpful shortcut.

Given how easily accessible this embedded content actually is, one has to wonder why other major CAT tool providers like SDL and Kilgray have failed to offer the option of importing embedded content in their filters up to now. Let's hope they do soon. In the meantime, this workaround should enable many people to deal with this complex and irritating file format challenge.

Here's a summary of the procedure once again:

Rename the *.???x file to *.zip
Under Windows 7, right-click on the ZIP file and open it using the Windows Explorer. Using ZIP tools of any kind risks corruption by changing the compression ratios.
Find the embeddings folder inside the ZIP structure. Copy this elsewhere and use it as the source for translation. It will contain all the embedded objects as single files.
Copy the translated content back into the embeddings folder in the ZIP structure.
Rename the ZIP file to its original extension.
Open the file and refresh each embedded object (which will initially appear not to have been translated) by right-clicking and opening it from the context menu or running a macro to do that.

Jun 16, 2012

memoQuickie: footnote, cross-reference & index entry segmentation in Microsoft Word files

If you have a Microsoft Word DOC file or RTF to translate, it is important to be aware of the different behaviors of the memoQ import filter options you can use. If there are footnotes, cross-references or index entries, it is far better to use the option to import the DOC or RTF file as DOCX.

The DOC file shown below has a footnote, a cross-reference and an index entry:

Adding it to a memoQ project with the default filter for Microsoft Word in memoQ 5

gives the following segmentation result:

Importing the same document with the DOCX option of the filter

yields much cleaner segmentation and better tags to work with:

Compare what some other programs do with this file:

WordFast Pro

DVX2 (DOC)

DVX2 (DOCX)

TagEditor salad (partial)

SDL Trados Studio 2009 segmentation

SDL Trados Studio 2011

There is room for improvement with most tools.

Jun 12, 2012

memoQuickie: using the memoQ PDF filter

The memoQ PDF filter is limited: it only works with PDF from editable text, only extracts plain text, and may have problems with complex layouts (such as multiple columns). A PDF with complex or scanned content requires tools such as OCR software (OmniPage, ABBYY FineReader, etc.) to create a source file in RTF, DOC or other formats.

To translate a PDF with the memoQ filter, add it in the Project Wizard or via Project home > Translations > Add document. There are no configurable options.

Even simple files may have format problems.

Examine the extracted text carefully, compare to the original and ensure that segmentation and word order are correct. If not, editing may be required after translation. Note how the ingredients list in Segment 3 is run together.

The target file exported after translation will be plain text. If formatting is needed, it must be applied with other software.

Apr 17, 2012

Final checks in memoQ

"Having to do a separate final check in Word is a major MemoQ disadvantage over the Word/Trados Workbench (and Wordfast Classic) WYSIWYG procedure. It might even make some of us abandon MemoQ."

I read that statement in a recent digest from the Yahoogroups memoQ forum with some puzzlement. What exactly does the author of those words mean?

There are a few arguments I can muster in favor of the necessity to do a final check in MS Word or another original format. The limitations of the spellchecker in memoQ is one of these. Even when the MS Word spellchecker is used, as I recall memoQ (in the versions where I noticed this problem) did not flag doubled words, and I have a bad habit of typing "and and" and the like.

The use of style guide and consistency-checking tools like PerfectIt! are other good reasons to do such external checks.

But when I do such things, I work on my second monitor and immediately incorporate changes in my memoQ project to keep the TM updated among other things. Also, the filters in memoQ enable me to examine the scope of some problems faster and with greater ease than multiple "Find" operations in a word processor or other software.

But if the person quoted meant simple ease of reading on the screen, I wonder if he has paid any attention to the optimal use of the memoQ translation preview. One could simply resize that pane after translation and read through a preview of the translation:

If a problem is found, clicking on the text will select it and cause the translation window (above the preview) to jump to the segment to be corrected. And of course this works for any format that yields a preview in memoQ, so you are not limited as you would be working with the Trados Workbench macros or Wordfast Classic in MS Word. Excel files, PowerPoint slides, HTML, ODT files and other formats enable you to work this way.

But another reason why I would hesitate strongly before regressing to the tools mentioned is that I would sacrifice the ability to do terminology checks with the QA module. (This is, of course, possible to a limited extent in TagEditor.) Or other QA checks which may be of interest. These features are severely underutilized, but they aren't hard to learn, and they offer considerable benefits to freelance translators in the competition for consistent formal quality.

Similar advantages are likely to be had from other recent versions of leading translation environment tools. Very often it pays to consider the points of difficulty we have with these and discuss them with other users, because often new and better ways of using them will come to light.

Apr 13, 2012

memoQuickie: character controls

memoQ has a few useful, though somewhat badly organized, functions in the Edit menu and on the toolbar which provide support for special characters or the visibility of non-printing character (to clean up extra spaces or verify the presence of a non-breaking space, for example).

To toggle the display of non-printing characters (spaces, etc.), click the corresponding icon in the toolbar:

Special characters are a bit disjointed. The omega icon on the toolbar offers a few useful options:

However, if you need other characters, such as mathematical operators, copyright or registered trademark symbols, etc., you must go to Edit > Insert Symbol...

This opens the familiar Windows dialog:

Personally, I think this should be re-organized with that dialog also accessible under the toolbar's omega icon. I like the selection offered there, but adopting the "see more" approach of Microsoft Word would be helpful:

Jan 2, 2012

ODT files in translation environment tools

After an interesting afternoon with a friend who was a bit frustrated with the behavior of her translation assistance technology with an ODT (Open Office text) source file, I decided to have a look at how a variety of common tools handle this format. I created a small test file which contained some of the troublesome elements and saved it as *.odt for testing. The test file looked like this:

The ordered list was created using the numbering feature.

When the file was imported to OmegaT, the segmentation looked as follows:

Fairly clean, though the segmentation is a bit off due to the encoding of the space after the end of the sentence in the second block of text. Nine segments where there should have been ten.

With memoQ, the result was:

Altogether there were a dozen segments after import. The part with the hyperlink was segmented incorrectly in three parts instead of one. However, memoQ did handle the space tag after "tool." correctly and start a new segment at "Here". Once can, of course, use the segment joining function

to correct the segmentation until Kilgray gets around to fixing the segmentation on the hyperlink tag:

Update 9 January 2012: The developers at Kilgray have informed me now that this quirk in the ODT filter has been corrected and will be included in the next build released.

When I tried to test my SDL Trados Studio 2009 license, at first it refused to joint the party:

Never a dull moment with SDL as we all know. Of course SDL Trados 2007 was in fact installed, but when I upgraded to Studio 2009, of course it trashed my 2007 installation, and I had been too irritated to do anything about it for over half a year since I don't use Trados for anything more than file preparation and compatibility testing anymore, and I was still able to do that for my projects with the damaged installation. However, when I discovered that the ODT file caused TagEditor to run and hide without even saying goodbye, I sighed deeply and wasted half an hour reinstalling SDL Trados 2007. At least I didn't have to go through that insane check-in/check-out license procedure online. I trusted in God and my Windows Registry entries, and the location of my license file was remembered, so all was well.

The second attempt at SDL Trados Studio 2009 was much better:

Same segmentation problem as OmegaT, and examining the tags reveals where the issue might be addressed in a tweak of the filter.

I haven't got the latest upgrade, but someone was kind enough to run my test file through SDL Trados Studio 2011, which appears to offer the best results for filtering ODT (the settings were slightly different, with the URL included, but that is also possible with some other tools):

SDL Trados TagEditor also worked after re-installation. The results were:

Oh dear. Well, it works, but if I still used TagEditor, I would run, not walk, to the much cleaner interface of OmegaT for this sort of thing if I didn't have the good sense to upgrade to Studio or something else commercial. Note the same segmentation issue and the need for filter modification.

Victor Dewsbery was kind enough to import my test file to the original Atril DVX and the newer DVX2 and send me the results:

DVX import of the test file

DVX2 import of the test file.

I also tried to test SDLX, Wordfast Pro and Wordfast Anywhere. The first two tools don't support ODT. Wordfast Anywhere claims too, but went nowhere, with the following status message displayed in my browser for about half an hour before I gave up and went to lunch:

Of course I canceled. I had a blog post to write and a New Year to get on with. Anyone who wants to try the test file in another tool (to compare apples with apples) can get it here.

Dec 21, 2011

Presegmented "classic" Trados files

Given that many outsourcing translators, agencies and companies still use older versions of Trados but often want to work with qualified translators without tripping over tool issues, this is still a current topic despite the new SDL Trados tools having been on the market for several years. And my old published procedures on these matters are either no longer publicly available or are somewhat in need of updating.

Before I began blogging in 2008, I wrote a number of procedures to help my partner, colleagues and clients understand the best procedures for handling "Trados jobs" with other translation environment tools. When translating a TTX file with Déjà Vu, memoQ and many other applications, it is often best practice to "presegment" the file using a demo or licensed version of Trados 2007 or earlier. In fact, if this is done on the client's system, many little quirks of incompatibility that can be experienced if the translator used a different build of Trados (for example) can be avoided.

What does "presegment" actually mean? It is a particular method of pretranslation in which for segments where the translation memory offers no match, the source text is copied to the target segment. If performed with an empty TM, the target segments are initially identical to the source segments. If this procedure is followed, full, reliable compatibility is achieved between applications such as Déjà Vu and memoQ for clients using Trados versions predating Trados Studio 2009. For newer versions of Trados, the best procedure involves working with the SDLXLIFF files from Studio. If a freelance translator does not own a copy of SDL Trados 2007 or an earlier version used by an agency or direct client, this is the procedure to share with a request for presegmentation. While some clients might expect the translator to do such work using his or her own copy of Trados, I have experienced enough trouble with complex files over the years when different builds of the same version of Trados are used that I consider this to be the safest procedure to follow - safer even than having the translator do the work in Trados in many cases.

Step 1: Prepare the source files
Before creating a TTX file and presegmenting it for translation in DVX or creating a presegmented RTF, DOC or DOCX file compatible with the Trados Workbench or Wordfast Classic macros in Microsoft Word, it is a very good idea to take a look at the file and clean up any "garbage" such as optional hyphens, unwanted carriage returns or breaks, inappropriate tabbing in the middle of sentences, etc. Also, if the file has been produced by incompetent OCR processes, there may be a host of subtle font changes or spacing between letters, etc. that will create a horrible mess of tags when you try to work with most translation environment tools. Dave Turner's CodeZapper macros are a big help in such cases, and other techniques may include copying and pasting to and from WordPad or even converting to naked text in Notepad and reapplying any desired formatting. This will ensure that your work will not be burdened by superfluous tags and that the uncleaned file after the translation will have good quality segmentation.

Step 2: Segment the source files
If the source files are of types which Trados handles only via the TagEditor interface, then they may be pretranslated directly by Trados Workbench to produce presegmented TTX files. If they are RTF or Microsoft Word files, on the other hand, and a TTX file is desired, you must first launch TagEditor, open the files in that environment and then save them to create the TTX files, which are then subsequently pre-translated using Trados Workbench. If a presegmented RTF or Microsoft Word file is desired (for subsequent review using the word processor, for example), then the files can be processed directly with Trados Workbench.

Important Trados settings:

In Trados Workbench, select the menu option Options > Translation Memory Options… and make sure that the checkbox option Copy source on no match is marked.

In the dialog for the menu option Tools > Translate, mark the options to Segment unknown sentences and Update document.

After the settings for Trados Workbench are configured correctly, select the files you wish to translate in the dialog for the Workbench menu option Tools > Translate and pretranslate them by clicking the Translate button. This will create the "presegmented" files for import into DVX, memoQ, etc. If the job involves a lot of terminology in a MultiTerm database, which cannot be made available for the translation in the other environment (perhaps due to password protection or no suitable MultiTerm installation on the other computer), you might want to consider selecting the Workbench option to insert the terms.

Note: to get a full source-to target copy, use an empty Trados Workbench TM. However, if an original customer TM is used for this step you will often get better "leverage" (higher match rates) than if you work only with a TMX export of the TM to the other environment. If I am supplied with a TWB TM, I usually presegment with it first, then export it to TMX and bring it into memoQ or DVX for concordancing purposes. However, in some cases, such as with the use of memoQ's "TM-driven segmentation", you might get better matches in the other environment (not Trados).

The one performing the presegmentation might want to inspect the segmented files in TagEditor or MS Word to ensure that the segmentation does not require adjustment. Segments can typically be joined in other environments such as memoQ in order to have sensible TM entries in that environment or deal with structural issues in the language, but this will not avoid useless segments in the content for Trados. The best way to deal with that is by fixing segments there. Otherwise, I often provide a TMX export from memoQ to improve the quality of the Trados TM.

Step 3: Import the segmented source files into the other environment
The procedure for this varies depending on your translation environment tool. Usually the file type will be recognized and the appropriate filter offered. In some cases, the correct filter type must be specified (such as in memoQ, where a presegmented bilingual RTF/DOC must be imported using the "Add document as..." function and specifying "Bilingual DOC/RTF filter" instead of the default "Microsoft Word filter".

Some tools, like memoQ, offer the possibility of importing content which Trados ignores, such as numbers and dates, This is extremely useful when number and date formats differ between the languages involved. It saves tedious post-editing in Word or TagEditor and also enables a correct word count to be made.

A few words about output from the other (non-Trados) environment
If you import a TTX to Déjà Vu, memoQ, etc., what you will get when you export the result is a translated TTX file, which must then be cleaned using Trados under the usual conditions. Exporting a presegmented RTF or Microsoft Word file from DVX gives you the translated, presegmented file. The ordinary export from memoQ will clean that file and give you a deliverable target file. To get the bilingual format for review, etc. you will have to use the option to export a bilingual file.

Other environments such as memoQ or Déjà Vu may also offer useful features like the export of bilingual, commented tables for feedback. This saves time in communicating issues such as source file problems, terminology questions, etc. and is infinitely superior to the awful Excel feedback sheets that some translation agencies try to impose on their partners.

Editing translations performed with Trados
A translation performed using the Trados Workbench macros in Microsoft Word or using TagEditor can be easily reviewed in many other environments such as Déjà Vu or memoQ. In fact, I find that the QA tools and general working environment with this approach is far superior to working in TagEditor or Word, for example. Tag checks can be performed easily, compliance with standard terminology can be verified, content can be filtered for more efficient updates and more.

Editing translations performed with more recent versions of Trados (SDL Trados Studio 2009 and 2011) is also straightforward, as these SDLXLIFF files are XLIFF files which can be reviewed in any XLIFF-compatible tool.

Aug 7, 2011

Format surcharging in translation

One of the strategies I pursued early on in my career as a commercial translator was to equip myself with tools able to handle a great variety of source formats and then learn (mostly by nail-biting troubleshooting for my projects and those of others) to cope with the exceptions, typical problems for a given format and the insoluble and unexpected disasters of files with Hotel California workflows in various CAT tools. At a time when a majority of translators in my language combination were perceived as incorrigible technophobes and many agencies struggled to deal with the technical intricacies of IT and data exchange in translation, this was a path to appallingly rapid business growth.

How times have changed. Or haven't. Translation environment tools have evolved more in the past decade than some industry critics will admit, making more formats reliably accessible and data exchange between environments less likely to trigger calls to the local suicide hotline. Lots and lots of translators now "have" Trados or some other tool, or the tools have them. But it's a tenuous relationship in the majority of cases. More tenuous than many realize until suddenly the simply formatted Word file won't "save as target" from the TagEditor horror chamber, or they get the idea of actually using "integrated" terminology features in some tools and learn a new definition of despair.

My Luddite friends are right to speak of the complexities that can lurk in even the "simplest" translation environments, but I believe that dealing with these complexities in simple, rational ways and sharing the information will go farther toward simplifying our lives and enhancing our professional status than desperately clinging to outmoded ways that will increasingly restrict the flow of business. However, as we adopt new ways, we must think more about these complexities, the real effort involved and how to offset this effort in simple, economic terms.

Take file formats, for example. Over the years I have heard many suggestions from colleagues for how to charge different formats. Many of these seem rather arbitrary and not necessarily sensible to me, such as all the myriad ways people charge for the translation of PowerPoint slides. Work with PowerPoint can be simple and straightforward or a hideous nightmare requiring complex, creative combined strategies of pre-translation repair, filtering, dissection and reassembly and much more. Microsoft Word is seen as a simple format, but add a hundred footnotes, cross-references, formulae, "Word Art", embedded Excel and Visio ~~abjects~~ objects, a rainbow of colors for coding and some massive, uncompressed images for good measure and you often face quite a challenge.

How do you deal with such complexity, plan for it in your schedule and charge it in a manner which is fair to the persons performing the service and those paying for it? The answer is not easy, but the typical response to the question - ignore it and charge "usual" rates or hocus pocus some percentage mark-up - is not very satisfactory.

Discrete, pre- or post-translation tasks such as OCR, format repairs, extraction and re-embedding of translatable objects or the transfer of these to separate documents for "key pair" translation are all fairly easy to handle in an acceptable, transparent way with hourly fees for the effort. When I deal with such matters, I occasionally provide the client with detailed work instructions for how to go about performing these tasks cleanly to "save" money with the caveat that if it isn't right, the work will be re-done and charged.

I have yet to come up with a standard way of coping with files that are simply so big that they choke the tools I use or tie up my resources for an hour while exporting a translated file. Here, technological aikido is usually the most effective strategy: at various stages in the past decade, for example, I have converted graphics-laden RTF or DOC files to HTML, TTX and now DOCX to minimize troubles and speed up processing. Once I have worked out a way of avoiding those big resource tie-ups (often at the cost of hours or days of thought and experimentation), I feel I don't have to consider the charge issue (but of course I'm really wrong). However, the risks of format failure are so great in my experience that "round trip" tests must be performed to ensure that once a translation has taken place the results can be transferred to their deliverable form without much ado. If I forget to do this under pressure, I very often regret it. Think of round trip workflow testing for the files you translate as a possibly life-saving pre-flight safety check. You might not die in a crash, but business relationships will.

One issue I have meditated on for a very long time and mentioned at intervals in translators' forums without finding a reasonable answer is that of markup tags. Many tools didn't even used to count them; at one point Déjà Vu was the only one I was aware of that did. The best answers that colleagues seemed to offer for tag-laden documents, which inevitably require more work and frequently lead to stability problems, was to "charge more" or "run like Hell". Both good answers, really, but lacking in the quantitative rigor my background in science leads me to prefer.

The solution arrived somewhat unexpectedly with the beta version of memoQ 5. At first I thought that SDL Trados Studio 2009 offers no solution here, but with that tool the context in which you view the statistics is important. Look at the analysis under "Reports", not "Files". Any counting tool that reports tag frequency can be used to calculate this solution, if need be with a spreadsheet if the factors cannot be added the tool's internal statistics for words or characters.

The solution is obvious and really wasn't far from the discussion which has taken place over the years: simple word or character weighting for the tags. However, it was not until I saw the fields in the new memoQ count statistics window that I really began to think about what those factors should be.

In SDL Studio 2009 the tag statistics are found in the analysis under "Reports" as mentioned and look like this (thank you to Paul Filkin for the technical update and the graphic):

I thought about my own experience with tags in files over the years and the actual extra effort of inserting them for formatting at the beginning and end of segments or somewhere inline. For reasons I won't try to explain in an over-long post, I figure that a single tag costs me the effort of about half a word, or given the average word length in my source language, about 3.5 characters. So I put in "0.5" words and "3.5" characters as the weight factors in memoQ, and my count statistics are increased to compensate for the additional effort involved.

Now you may disagree on the appropriate factor, saying it should be more or perhaps less. That's OK. I consider this a matter open to negotiation with clients. The important thing for me is that we have a quantitative basis for discussion and negotiation which anyone may check. It's important that this and other issues relevant to project planning and compensation be brought out of the closet and discussed rationally. Not just to get "fair" compensation, but to educate those involved in the processes about the effort involved and to set more realistic project goals as well.

For some of the OCR trash that clients produce ineptly and try to foist off on translators as source files, this "tag penalty" may encourage better practice or at least offset the effort of using Dave Turner's CodeZapper and other methods to clean up the mess. (However basic structural problems caused by automated settings in OCR tools will never be overcome this way.)

In any case, this is a technique which I hope will inspire discussion and study to find its best application in various environments. And I do hope to see widespread adoption of such options in modern translation environment tools to further offset the grief occasionally encountered in modern translation.

Jan 2, 2011

Dining on Tag Salad

Often translation environment tools make our lives easier, but there are cases where this is clearly not true. Files and formats with a lot of mark-up pose particular problems in some cases. The screen shot above is just one (fairly mild) example of what one can encounter in a Microsoft Word file when the author has a psychedelic obsession with changing colors and fonts within a sentence (and, unfortunately, throughout a long document); more extreme examples can often be found with layout formats such as InDesign. (I have opted for an abbreviated tag view here; the full descriptive tags for the formatting would fill a page.) I had once had a file in which I counted about 100 tags embedded in a single word. Three hundred words of easy text took a full day to translate. Files with a high density of tags are also highly prone to spacing errors and other problems in many cases, and it may be extremely difficult to identify these in anything but the final format.

The Devil is in the details, and the details of a typical CAT analysis with the same content untagged in the TM will show a high fuzzy match, which will be minimally compensated with the all-too-popular discount schemes for matches that one often encounters. The reality, however, is that arranging these formatting tags can cost more time than actual translation.

The text analysis of Déjà Vu X from Atril includes a count of tags, so one could in cases like this use that information to charge the tags in some quantified way. However, most other tools do not offer such a capability, leaving us to consider the best approach to negotiating fair compensation for such a mess. Hourly charges come to mind in this particular case, though I seldom favor that for translation work.

It is a great convenience for end clients to work in complex, native or tagged formats, but it is important to recognize the extra effort this may involve, discuss this with the client and make appropriate arrangements. What approach have you taken to this problem in the past?

Search me!