Translation Tribulations: preparation

Showing posts with label preparation. Show all posts

Dec 26, 2016

The challenge of too many little files to translate

It seems to me that most translators face this challenge eventually: a customer has many small files of some kind - tiny web pages perhaps or other content snippets in XML, text or Microsoft Word files or perhaps even in some bizarre proprietary format - and wants them translated.

Imagine a dictionary project with thousands of words with their definitions, each "entry" being stored in a separate text file. How would you translate that efficiently?

The brute force method of opening and translating each file individually is not very satisfactory. Not only does this take a long time, but when I have tried foolishness like that I tend to overlook some files and spend far too much time checking to ensure that nothing has been overlooked. And QA measures like spellchecking? Let's change the subject....

Some translation tools offer the possibility to "glue" the content of the little files together and then (usually) "unglue" them later to reconstitute the original structure of little files, now translated.

Other tools offer various ways to combine content in "views" to allow translation, editing, searching and filtering in one big pseudofile. This is very convenient, and this is the method I use most often in my work with memoQ or SDL Trados Studio after learning its virtues earlier as a Déjà Vu user.

Unusual file formats can often be dealt with the same way after some filter tweaking or development. But sometimes....

... there are those projects from Hell where you have to ask yourself what the customer was smoking when he structured his data that way, because some other way would be so much more practical and convenient... for you. Ours is generally not to question why some apparently insane data structure was chosen but to deal with the problem as efficiently as possible within budget and charge appropriately for any extra effort incurred. Hourly fees for translation rather than piece rates certainly have a place here.

Sometimes there is a technical solution, though it may not be obvious to most people. For example, in the case presented to me by a colleague on Christmas Eve

the brief was to write the translation in language XX in the empty cell in that columnof the 3x2 table embedded in a DOCX file. There were hundreds of these files, each containing a single word to translate.

If these were Excel or delimited text files, a simple solution would have been to use the Multilingual Delimited Text Filter for memoQ and specify that the first row is a header. But that won't fly (yet) for MS Word files of any kind.

In the past when I have had challenging preparation to do in RTF or Microsoft Word formats - such as when only certain highlighted passages are to be translated and everything else is ignored - I have created macros in a Microsoft Office application to handle the job.

But this case was a little different. The others were always single files, or just a few files where individual processing was not inconvenient. And macro solutions often suffer from the difficulty that most mere mortals fear to install macros in Microsoft Word or Excel or simply have no idea how to do so.

So some kind of bulk external processing is called for. In this case, probably with a custom program of some kind.

I usually engineer such solutions with a simple scripting language - a dialect of the BASIC language which I learned some 45 years ago - using a free feature which is part of the Microsoft Windows operating system: Windows Scripting Host. And one-off, quick-and-dirty solutions with these tools do not require a lot of skill. The components of many solutions can be found on Microsoft Help pages or various internet forums with a little research if you have only a vague idea of what to do.

In this case, the tasks were to

Select the files to process (all 272 of them)
Open each file, copy the English word into the empty cell next to it
Hide all the other text in the file so that it can be excluded from an import into a working tool like Déja Vu, memoQ or SDL Trados Studio (using the options for importing Microsoft Word files in this case; the defaults usually ignore hidden text on import)

After that the entire folder structure of files could be imported into most professional translation support environments and all 300 or so words to translate could be dealt with in a single list view.

A more detailed definition of the technical challenge would include the fact that to manipulate data in some way in a Microsoft Office file format, the object model for the relevant program would probably have to be used in programming (for XML-based formats there are other possibilities that some might prefer).

Microsoft kindly makes the object models of all its programs available, usually for free, and there is a lot of documentation and examples to support work with them. That may in fact be a problem: there is a lot of information available, and it is sometimes a challenge to filter it all intelligently.

In this case, I needed to use the Microsoft Word object model. It also conveniently provided the methods I needed to create the selection dialog for my executable script file. The method I knew from the past and wanted to use at first is only available to licensed developers, and I am not one of these any more.

It is easy to find examples of table manipulation and text alteration techniques in Microsoft Word using its object model in VBScript or some other Microsoft Basic dialect like Visual Basic for Applications (VBA). The casual dabbler in such matters might run into some trouble using these examples if there is no awareness of differences between these dialects; trouble is often found where VBA examples that declare variables by type (example: "Dim i as Integer") occur. Declarations in VBScript must be untyped (i.e. "Dim i"), so a few changes are needed.

In this case, the quick and simple solution (' documentary comments are delimited by apostrophes and marked green) to make the files import-ready was:

' We have a folder full of DOCX files, each containing
' a three-column table where COL1 ROW2 needs to be copied to COL2 ROW2
' and then the COL1 ROW2 and other content needs to be hidden.

Option Explicit

Dim fso
Dim objWord
Dim WshShell
Dim File
Dim objFile
Dim fileCounter
Dim wrd ' Word app object
Dim oFile ' Word doc object
Dim oCell1 ' first cell of interest in the table
Dim oCell2 ' second cell of interest in the table
Dim oCellx1 ' other uninteresting text
Dim oCellx2 ' other uninteresting text
Dim oCellx3 ' other uninteresting text
Dim oCellx4 ' other uninteresting text

fileCounter = 0

'set the type of dialog box you want to use
'1 = Open
'2 = SaveAs
'3 = File Picker
'4 = Folder Picker
Const msoFileDialogOpen = 1

Set fso = CreateObject("Scripting.FileSystemObject")
Set objWord = CreateObject("Word.Application")
Set WshShell = CreateObject("WScript.Shell")

'use the path selected in the SelectFolder method
'set the dialog box to open at the desired folder
objWord.ChangeFileOpenDirectory("c:\")

With objWord.FileDialog(msoFileDialogOpen)
'set the window title to whatever you want
.Title = "Select the files to process"
.AllowMultiSelect = True
'Get rid of any existing filters
.Filters.Clear
'Show only the desired file types
.Filters.Add "All Files", "*.*"
.Filters.Add "Word Files", "*.doc;*.docx"

'-1 = Open the file
' 0 = Cancel the dialog box
'-2 = Close the dialog box
'If objWord.FileDialog(msoFileDialogOpen).Show = -1 Then 'long form
If .Show = -1 Then 'short form
'Set how you want the dialog window to appear
'it doesn't appear to do anything so it's commented out for now
'0 = Normal
'1 = Maximize
'2 = Minimize
'objWord.WindowState = 2

'the Word dialog must be a collection object
'even with one file, one must use a For/Next loop
'"File" returns a string containing the full path of the selected file

For Each File in .SelectedItems 'short form
'Change the Word dialog object to a file object for easier manipulation
Set objFile = fso.GetFile(File)
Set wrd = GetObject(, "Word.Application")
wrd.Visible = False
wrd.Documents.Open objFile.Path
Set oFile = wrd.ActiveDocument

Set oCell1 = oFile.Tables(1).Rows(2).Cells(1).Range ' EN text
oCell1.End = oCell1.End - 1
Set oCell2 = oFile.Tables(1).Rows(2).Cells(2).Range ' Target (XX)
oCell2.End = oCell2.End - 1
oCell2.FormattedText = oCell1.FormattedText ' copies EN>XX
oCell1.Font.Hidden = True ' hides the text in the source cell

' hide the other cell texts (nontranslatable) now
Set oCellx4 = oFile.Tables(1).Rows(2).Cells(3).Range
oCellx4.Font.Hidden = True
Set oCellx1 = oFile.Tables(1).Rows(1).Cells(1).Range
oCellx1.Font.Hidden = True
Set oCellx2 = oFile.Tables(1).Rows(1).Cells(2).Range
oCellx2.Font.Hidden = True
Set oCellx3 = oFile.Tables(1).Rows(1).Cells(3).Range
oCellx3.Font.Hidden = True

wrd.Documents.Close
Set wrd = Nothing

fileCounter = fileCounter + 1
Next
Else
End If
End With

'Close Word
objWord.Quit

' saying goodbye
msgbox "Number of files processed was: " & fileCounter

The individual files look like the above screenshot (all text in the top row is hidden, so the entire row is invisible, including its bottom border line) after processing with the script, which is saved in a text file with a *.vbs extension (it can be launched under Windows by double-clicking):

Of course the script could be made much shorter by declaring fewer variables and structuring in a more efficient way, but this was a one-off thing where time was of the essence and I just needed to patch something together fast that worked. If this were a routine solution for a client I would be a bit more professional, lock the screen view, change to some sort of "wait cursor" during processing or show a progress bar in a dialog and all the other trimmings that one expects from professional software these days. But professional software development is a bit of a bore after so many decades, and I haven't got the patience to see the same old stupid mistakes and deceits practiced by yet another generation of technowannabe world rulers, I just want to solve problems like this so I can get back to my translations or go play with the dogs and feed the chickens.

But before I could do that I had to save my friend from the Hell of manually unhiding all that table text after his little translation was finished, so I put another 5 minutes (or less) of effort into the "unhiding" script:

Option Explicit

Dim fso
Dim objWord
Dim WshShell
Dim File
Dim objFile
Dim fileCounter
Dim wrd
Dim oFile
Dim oCell1 ' source text cell in the table
Dim oCellx1 ' other uninteresting text
Dim oCellx2 ' other uninteresting text
Dim oCellx3 ' other uninteresting text
Dim oCellx4 ' other uninteresting text

fileCounter = 0

Const msoFileDialogOpen = 1

Set fso = CreateObject("Scripting.FileSystemObject")
Set objWord = CreateObject("Word.Application")
Set WshShell = CreateObject("WScript.Shell")

objWord.ChangeFileOpenDirectory("c:\")

With objWord.FileDialog(msoFileDialogOpen)
.Title = "Select the files to process"
.AllowMultiSelect = True
.Filters.Clear
.Filters.Add "All Files", "*.*"
.Filters.Add "Word Files", "*.doc;*.docx"
If .Show = -1 Then
For Each File in .SelectedItems
Set objFile = fso.GetFile(File)
Set wrd = GetObject(, "Word.Application")
wrd.Visible = False
wrd.Documents.Open objFile.Path
Set oFile = wrd.ActiveDocument
Set oCell1 = oFile.Tables(1).Rows(2).Cells(1).Range
oCell1.Font.Hidden = False
Set oCellx4 = oFile.Tables(1).Rows(2).Cells(3).Range
oCellx4.Font.Hidden = False
Set oCellx1 = oFile.Tables(1).Rows(1).Cells(1).Range
oCellx1.Font.Hidden = False
Set oCellx2 = oFile.Tables(1).Rows(1).Cells(2).Range
oCellx2.Font.Hidden = False
Set oCellx3 = oFile.Tables(1).Rows(1).Cells(3).Range
oCellx3.Font.Hidden = False

wrd.Documents.Close
Set wrd = Nothing

fileCounter = fileCounter + 1
Next
Else
End If
End With

objWord.Quit
msgbox "Number of files processed was: " & fileCounter

Jan 13, 2014

Locking out other languages in memoQ source texts

One of the interesting and useful results of Kilgray introducing document language recognition features in memoQ 2013 R2 is the ability to identify and exclude segments in other languages. I see this sort of thing from time to time in German patent dispute documents which quote English patent texts extensively or in texts to translate where new source language material may have been added to an existing translation. In the past, I prepared such texts for translation by hiding the text which is already in the target language or is in a language I cannot translate (such as French) or I locked it manually, which can be time-consuming to do in a long text. Now the task of preparing such tasks for translation is a little easier.

The screenshot above shows a patchwork document with German and English. The hundreds of segments in this job were a wild mix of the two languages with unfortunately few coherent blocks of the source language (German). To save time in preparation, I selected the option in the Operations menu to lock the segments:

The result of the locking procedure looked like this:

Most of the English segments were copied source to target and locked. The differentiation of languages is performed using statistics and is rather good but not perfect. In slightly under 400 segments, there were 5 or 6 that were not correctly identified and locked. Several of these were in the bibliography and consisted of long string of names and one or two short English words or abbreviations. I saw no false positives (source language misidentified and locked), though I did hear a report of some from another translator working from Dutch to English with a very large mixed document. Discussions with Kilgray Support revealed that a "failure rate" of about 1-2% may be experienced for this feature.

So what good is it? A lot, really. It enabled me to do a quick estimate of effort and separate the two languages so I could make a reasonable assessment of the separate efforts for proofreading the English and translating the German. Obviously, if I were a project manager preparing file for somebody else to translate, I would need to do manual checking of the segments to correct any errors of identification. But this feature would still often save me a great deal of time in preparing the file, an manual checking is important to do anyway to ensure that there are no segmentation problems which may cause difficulties in translation.

Do you work with mixed language documents where this feature might be relevant? If you do, have you tried this yet? What has your experience been with your language pair(s)?

Oct 21, 2012

Put OCR in Your Business Model

This article originally appeared on an online translators portal four years ago and was long overdue for removal there. Here is an update.

*****

Optical character recognition (OCR) software is discussed often online and at translators' events, usually in the context of how to deal with PDF files. Hector Calabia, Peter Linton and others have made a useful technical contributions on this subject in articles and forums and at various conferences. However, it is useful to consider OCR software in a broader translation business context. Document conversion is often very useful for translation purposes and greatly facilitates automated quality checks of the draft, for example, but OCR can also generate additional income for your business and reduce quotation risk.
OCR for translation
There are a number of programs available for this purpose, and which one is best for your purposes may depend on the language combinations you deal with and other factors. For years now I have used Abbyy FineReader, because years ago it gave the best test results for the particular set of European languages one of our clients offered. It is also relatively inexpensive (I paid about 100 euros for FineReader 11) and easy to use.

Many OCR conversions of TIFF, JPEG and PDF documents which I receive from agencies are difficult to use for translation purposes and require significant modification - if they can be used at all. Particularly in cases where TM tools are to be used or target texts differ significantly in length (especially when they are longer) there may be problems. The best ways to avoid these problems are

avoid automatic settings for OCR conversions; use zone definitions instead
avoid saving the converted texts with full formatting in most cases
use a suitable post-OCR workflow to clean up the converted document by joining broken sentences, removing superfluous characters, fixing conversion errors, etc.

If the idea of doing individual zone definitions on each page of a 100 page document is intimidating, take heart. Programs such as Abbyy FineReader often allow you to define layout templates, speeding up the work considerably. One translator I know became so skilled at the use of these OCR templates and was so good with his conversions that agencies hire him just to do high-quality OCR work for them. Which brings me to….

OCR as an income-generating activity for the translator or agency
Hardcopy, scanned documents, faxes and PDF documents generally require more work for translators than electronically editable documents and require different, sometimes more fallible quality control measures than a typical workflow for a translator using original electronic documents in a translation memory system. If no conversion is performed, it is more time-consuming to check terminology or use concordances during the translation, and it is also unfortunately too easy for eyes to skip over bits of text. Under time pressure this can lead to very serious problems. Even with conversion, the OCR text requires careful checking against the original document to identify and correct any errors introduced (and there will be some at times with even the best OCR software). So it is not at all unreasonable for a translator to charge a higher rate for dealing with hardcopy, scanned documents, faxes and PDF documents.

There are a number of ways to incorporate these higher charges into your business model. The two obvious ways are a premium (surcharged) word/line/page rate and hourly service charges. I usually offer both options to my clients, with the word/line rate surcharge representing the “fixed” rate and the hourly rate the “flexible” rate where I make an non-binding estimate and they may end up paying more or less according to the actual effort. For pure OCR conversion jobs where I am not doing the translating, I charge a typical proofreading rate or a bit more, because I go through the entire document and see that it is correctly formatted for translation work and that obvious errors are fixed (i.e. basic spellcheck, etc.).

Sometimes I hear that “the client doesn’t want to pay for that”. Well, that’s OK, too. The client has the option of doing the work and doing it right and saving me the effort. The recognition that there is additional effort involved and that this effort should be compensated is important. But usually there is a way to sugar-coat the "bitter" cost pill, and this is where your marketing savvy comes into play. Some win-win arguments you might present include:

the availability of an editable source text the client can use for future versions;
the ability to create TM resources using the OCR text (which can save time/money later);
potentially better quality assurance, especially with tight deadlines.

Returning a clean, nicely formatted OCR of the source document is often good "advertising". End clients may appreciate how this saves time and allows them to use the original text in a variety of ways (attorneys may like to quote arguments from the opposing side, and copy/paste beats retyping). Discriminating agencies may recognize your skill at creating documents that don’t go crazy when edited (because of screwy text boxes, bad font definitions and other format errors) and offer you more work. If your language pair is in low demand or is very competitive, this may be one more way of distinguishing yourself from the pack.
I got started doing OCR work and charging for it after suffering through the conversion of several long PDF documents by more manual methods. I finally wised up, bought FineReader and started to use it with most of the hardcopy, scanned documents, faxes and PDF documents I received simply because it enabled me to use my TM tools and do better quality checks. I started sending the cleaner-looking source texts converted with OCR along with the target text translations, and soon I started getting requests for paid OCR work. A number of my agency clients then began to buy OCR tols and use them with varying degrees of success. Even if they do all the conversion work, I still win if they do it right, because I save time for what I enjoy more – the translation.

OCR as tool for quotation
Some people I know still haven’t learned to do a high-quality OCR (or they don’t care to), but they still use the software effectively in a very important area of their business: quotation and risk limitation.

There are lots of good tools out there for text counting, which is important to many methods of costing and time planning in the translation business. Some people even still do it manually, which, though time consuming, is not a bad way of checking the numbers from an electronic estimate. A number of factors can result in text counts being too low – embedded objects, such Excel tables or PowerPoint slides in a Microsoft Word documents, or graphics with text - or even too high (as is the case with at least one CAT tool counting RTF and MS Word files). Keep using whichever method you prefer - I won't try to persuade you that any one approach is best. I use a number of methods myself.

When translating larger documents, however, or documents with a complex structure, it is often useful to have a “sanity check” for your text counts. On a number of occasions I have received translation jobs from agency clients where the text count was given a X words, where in fact there were quite a few more words embedded in Excel objects, bitmap graphics, Visio charts, etc. which had not been measured by the method used. In a few cases these clients had to take a loss on the job after giving a fixed price bid to the end client. Using OCR to check your estimates can prevent such an unfortunate scenario.

To do this, print the document (whatever it is) to a PDF file. Then run the PDF file through an OCR program with automatic settings (to save time – you don’t need to translate this OCR). Save the text and count it. There will probably be a bit more text due to headers or footers or perhaps garbage from graphics, but the results should be close to your other estimate. (You can always subtract an appropriate factor for the text count in headers and footers to improve your OCR estimate.) If there is a major deviation, this is a clear sign that you should take a much closer look at the document(s) before quoting the job.

Searchable scanned documents
Another use I have found for OCR in recent years is creating searchable "text-on-image" documents from scanned PDFs, TIFF files and other bitmap formats. Although I have used these searchable PDFs mostly for reference while I work (searching for bits of text while viewing the original, unadulterated context) and supplied them to clients on only a few occasions, the potential for an additional value-added service is fairly obvious in this case.

Conclusion
OCR software is an essential tool for the work of many translators today, even more so than CAT software in many cases. Not just a tool for recovering “lost” electronic documents or making legacy typed material more accessible for translation work, it also offers possibilities for generating additional projects and income, differentiating one’s services and reducing risks when quoting large jobs. Key features of whatever OCR you choose should include the ability to select text areas for conversion and to determine their sequence in the converted text (using user-defined zones). Various options for saving the converted text (full page format, limited text formatting and no formatting) are also very helpful. Most important of all, though, is a good quality-checking workflow for your OCR documents (possibly including formatting) to avoid difficulties in the translation process and ensure that your work has a polished, professional appearance.

OCR software is another good tool for improving your visibility with clients and making your work processes easier in an age when many archiving and ERP systems are focused on the retention of PDF documents or TIFFs and even actively discourage saving original formats. The major providers of this software often have free, functional demonstration versions to use before making a purchase decision. Try several options and choose the best one for you. You won’t be sorry.

Jun 27, 2012

OTM integration for SDL & more ahead

Some months ago, I dropped into the offices at LSP.net for a chat and was surprised by a white board covered with interesting scribbles that hinted at the most fundamental change of design yet for the translation workflow management tool OTM. In the two and a half years I have used the platform, I have seen many positive developments, which have scratched off nearly everything on my initial wish list (with other plans that may take care of the few remaining items of streamlining for freelancers by sometime late this year or early next year), but I really didn't expect to see significant integration with translation environment tools except perhaps a bit of analysis log reading for quotation purposes.

I was wrong. A few days ago, the official announcement of the integration plans for OTM and SDL Trados Studio hit my inbox, and discussions with the system's architect have made it clear that there's more ahead, though schedules are a bit vague at present. It's not clear from the post on the LSP.net blog, but the both the server and desktop (freelance) versions of SDL Trados will be integrated with the new middleware component, though there will be some differences in function.

I'm not a huge Trados fan as much as I respect some of the developments in recent years, but I'm excited by hints that the online workflow tool may make it easier to use SDL Trados to generate project file collections not only for translators using Trados but for those with other tools as well. If this really happens, I think it will be a great boost for interoperability and reduce the perceived "bother" of working with translators who use tools with which the project manager may not be competent.

Another thing that I like about the development plans for CAT integration with OTM is that the middleware component is actually vendor-neutral. This means that it can facilitate integration at the server and desktop level with more than just SDL Trados in the future. Where this may lead is unclear right now, but I think that the changes announced for memoQ version 6 for Microsoft Office file handling in the server API make that an obvious candidate, though the lack of a client API for memoQ still means that there is more potential benefit to Trados users.

Although I'm quite encouraged by this new direction for my workflow tool, I probably won't use these functions myself, mostly because Trados is not my tool of choice for translation, though it is an important part of some of my preparation, data migration and terminology output workflows. OTM has an enormous range of features - it's designed to run a medium-sized or larger translation agency with widespread global interests - and some of the presentations of this enormous range of functions can be quite intimidating. But like other tools I use that have many function - Microsoft Word and memoQ, for instance - I need only a small fraction of those functions for my routine work, and the underlying simplicity of the environment enables me to use the parts I need with great efficiency and security for my clients. And on those rare occasions when I need to work as more than just the Lone Translator, I can draw on whatever I need.

The pace of development for OTM has slowed since I first became involved with it as the issues identified in the pilot phase were solved, often in some surprisingly useful and original ways. But the SaaS solution continues to grow in very practical ways, such as improved security, and its development is still driven by the same basic needs that led to its creation in the first place - the needs of its users to "punch above their weight" profitably in competitive project management.

May 10, 2012

memoQfest 2012: practical outsourcing with memoQ desktop editions

This week I'm in Budapest for Kilgray's memoQfest, the annual gathering of users and curious bystanders as well as CAT tool competitors who want to learn how to get the technology right.

This morning I gave a presentation on a topic which has become a regular part of my consulting with colleagues and clients... or better said, has been a part of my work with them from the beginning of my association with the language services industry. Every time I hear a translation agency or corporate translation consumer say that a qualified translator with the subject expertise needed cannot be used because he or she doesn't work with the "right" translation environment tool, I am saddened by the foolishness of that statement or the lack of understanding it reveals. In mature IT sectors interoperability and lossless data exchange have been common for decades, though sometimes one must be clever to get there. But the cottage industries for languages are, in many ways, stuck in the IT mentality of the early 1980s despite the fact that the actual technology today is more like Y2K. Gut Ding braucht Weil as the Germans say.

At a memoQ master class yesterday, Kilgray COO István Lengyel stated that "Interoperability is the art of compromise." True, but if you keep your wits about you and apply them, the compromises are usually not as awful as originally assumed.

memoQ is distinguished by the great ease with which it can manage data to be used in translation with nearly any other translation environment. SDL Trados Studio actually does a few things better, but overall, the utility and ease of use for memoQ is far greater in most cases. It's like a Swiss Army knife of translation integration, with one or more reasonable workflows for almost anything.

Today's talk was a 45 minute distillation of a workshop I usually deliver in half a day. It was a bit of a challenge to pare it down in the limited time available with last-minute translation projects having more content than planned and late nights talking to colleagues from around the world. For experienced users, most of what I had to say was "old hat"; some new memoQ users were surely overwhelmed by a flood of "new" information. I hope that each listener was able to leave the session with at least one useful idea to apply and improve their business. For those who fell asleep and couldn't take notes or anyone else who likes to play "guess the context" with lecture slides, here is a link to the slides from the talk. Questions are welcome on whatever appears mysterious; most of it is somewhere on this blog in great detail. Re-use is permitted for any morally acceptable purpose (with attribution please). When Kilgray puts the video of the talk online, I'll post it here so those slides make more sense.

May 4, 2012

Preparing MS Word text with a specific highlight color

If the Catholic Church decides its needs an official backup for Jerome as the patron saint of translators (and in these times of tribulation, one cannot have enough divine help I suppose), Dave Turner of ASAP Traduction gets my vote. His CodeZapper macros for Microsoft Word have saved us so many thousands of hours of grief dealing with rogue tags in RTF and MS Word files, which screw up TM and termbase matches and make work very difficult, and other more recent contributions also offer useful support. He is the first one in my mind when I see a problem and think "There ought to be a macro for that!"

Dave's latest contribution was part of an old discussion about preparing texts for translation in which the text to translate was marked by a highlight color. As I remember the original discussion, there were several highlight colors, and only one was to be chosen for work. Usually, if text of a certain color is to be hidden or shown in preparing a file for translation with a CAT tool which filters out hidden text, I use the search and replace function in Microsoft Word. That does not work for selecting a highlight of a specific color. You need to use a macro for that, and I no longer have the VBA skills to handle that myself. I can adapt a working macro, but no way I can manage a good one from scratch unless I spend a few days or more re-learning the skills of a decade ago.

So I was very happy when I saw his answer to the problem in the memoQ yahoogroups forum, which I have reproduced here with just a minor change to reflect the usual highlighting I encounter:

Sub HideExceptYellow()
'
' Translation assistance macro
' by Dave Turner
' http://asap-traduction.com/

Dim rDcm As Range
ActiveDocument.Range.Font.Hidden = True
Set rDcm = ActiveDocument.Range
With rDcm.Find
.Text = ""
.Highlight = True
.Forward = False
While .Execute
If rDcm.HighlightColorIndex = wdYellow Then
rDcm.HighlightColorIndex = wdNoHighlight
rDcm.Font.Hidden = False
rDcm.Collapse Direction:=wdCollapseStart
rDcm.Start = ActiveDocument.Range.Start
End If
Wend
End With
Set rDcm = ActiveDocument.Range
Options.DefaultHighlightColorIndex = wdYellow
With rDcm.Find
.Text = ""
.Font.Hidden = False
.Forward = False
.Replacement.Highlight = True
.Execute Replace:=wdReplaceAll
End With
End Sub

In MS Word 2003 and earlier versions, the macro can be created under Tools > Macros > View Macros. Name the macro, then click the Create button to paste in the code. The Run button will execute an existing macro if it is selected.

In MS Word 2007/2010 the same functionality is accessed on the View ribbon with the Macros icon or Alt+F8.

Here's a short video showing the procedure to copy the code into the Normal global template in Microsoft Word, where it is available to all open documents in Word:

To adapt this for another highlight color, just rename the macro and change the designation of the color (wdYellow). The macro can be adapted to deal with combinations of highlight colors as well, and similar methods can be used to deal with text colors, though these can be handled by the search and replace dialog.

Dec 27, 2011

Wordfast Pro: Translating memoQ bilingual RTF tables

After a recent crisis an agency friend of mine experienced when a translator did a large job of some 22,000 words and was unable to incorporate corrections by a reviewer (resulting in a rather creative but stressful rescue effort involving memoQ LiveDocs), I resolved to have a look at WordFast Pro myself and see if there wasn't some better, easier way to work with translators who have this tool.

The current version of WordFast Pro doesn't support XLIFF, so that's out as a possibility. However, it does read RTF files, so I tried the same techniques which have recently proved successful for improving the interoperability workflows with Trados TagEditor and SDL Trados Studio among others. And indeed this approach was successful.

A view of the memoQ bilingual RTF file imported into Wordfast Pro for translation.
By hiding the red tags with the mqInternal style, the tag content is protected in Wordfast Pro

To prepare content in memoQ for translation in WordFast Pro, do as follows:

Copy the source to the target for the entire text.
Export a bilingual RTF file.
Hide all the content of the RTF file which is not to be translated.
Use the search and replace function in your word processor to hide the dark red text for the tags, which are marked with the mqInternal style. The settings for the dialog in Microsoft Word are show below and are set using the Font... option (marked with a red arrow in the screenshot) in the Format dropdown menu of the Replace dialog.

The font color to hide will be found under More Colors... in the font colors of the font properties dialog:

In this way, the translation can proceed without the risk of damaging the text constituting a tag, and the QA features of Wordfast Pro can be used to do a tag check before delivery.

After the translation is completed and the tags have been checked, export the RTF file and unhide all the text. If there is a comments column available, any comments which are added to the table will be importe back to memoQ for feedback.

SDL Trados Studio: Translating memoQ bilingual RTF files

Some time ago, I noted that SDL Trados Studio experiences difficulties importing XLIFF files in which the sublanguages are not exactly specified if the default languages are not set to the same major language. So if you plan to translate an XLIFF from memoQ or another tool in SDL Trados Studio, it is necessary to ask the one generating the file to specify the sublanguages or, if that is not practical, use the workaround described here. I discovered this bug before the release of the 2011 version of Studio and spoke to SDL development and management staff specifically about this at the TM Europe conference in Warsaw, but apparently this is not a priority to fix compared to other issues, and it may be a while before SDL Trados Studio users can work with client XLIFF files without coping with this headache.

Several of my client agencies using memoQ for project management have quite a number of freelance translators using various Trados versions and who have no intention to stop doing so. It's important to work smoothly with these resources in a compatible way, which also protects the data and formats. In a recent article on processing memoQ content with Trados TagEditor, I published a procedure I developed which enables the memoQ tags in the text of the bilingual RTF table export to be protected as tags when working in SDL Trados TagEditor. Now I would like to present a similar approach for Trados Studio users, which can serve as an alternative to XLIFF exchange.

If the bilingual RTF table is created in memoQ specifying the mqInternal style for tags

this style setting can be specified as non-translatable in SDL Trados Studio. To do this, select the menu choice Tools > Options, and in the dialog which appears under File Types, add the mqInternal style to the list of styles to be converted to internal tags in the appropriate formats (RTF, and just in case the file gets re-saved as a Microsoft Word document, for Microsoft Word 200-2003 and Microsoft Word 2007-2010 as well):

SDL Trados Studio dialog for setting RTF, DOC and DOCX styles as "non-translatable" (converting to tags)

Once the mqInternal style has been entered this way in SDL Trados Studio, the prepared bilingual RTF file can be imported. "Preparation" for import includes copying the source text to the target and hiding all the text you do not intend to translate (the file header, the source column, and the comments and status columns if present). The result will look something like this:

The prepared memoQ bilingual RTF file imported to SDL Trados Studio. Note that the bold and
italic type are displayed normally as in memoQ, which offers the translator greater working ease.

Please note that the same procedure described for working with these files in TagEditor (hiding the red text of the tags, see the TagEditor article for details) also works for SDL Trados Studio, but this method involving the mqInternal style saves a few steps.

Clean up the tag mess with CodeZapper for all CAT tools

Readers of this blog probably know by now that I am a Dave Turner fan. His CodeZapper macros have probably saved me hundreds of hours of wasted time over the years (not an exaggeration), and I think there are a lot of other translators and project managers with similar experiences. It doesn't solve every problem with superfluous tags, but it solves a lot of them, and Mr. Turner works steadily at improving the tool. I blogged the release of the latest version not long ago; it is now available directly from him for a modest fee of 20 euros (see the link to the release announcement for a contact link). That means it pays for itself in far less than an hour of saved time.

Over the past few days I have been updating some training documentation and running a lot of tests on tagged files as part of this. During this work, I have been struck time and again by the differences in the tags "found" by different tools working with the same file. Sometimes one tool looks better than another, but the patterns are not always consistent. What is most consistent is the ability of CodeZapper to clean up the files in various versions of Microsoft Word and make the tag structures appear a little more uniform.

Here's an example of the same DOCX file "unzapped" in several tools:

Import into memoQ 5, as-is, no tag clean-up. Previous versions of the same file showed more tags in places.

SDL Trados Studio 2009 before tag clean-up.

TagEditor in SDL Trados 2007 before tag clean-up

Initially, OmegaT would not import that particular DOCX without a tag cleanup. I reported the problem to the developers, who upgraded the filter to handle a previously unfamiliar character in internal paths of the ZIP file (DOCX is actually just a renamed ZIP package like many other file types). See http://tech.groups.yahoo.com/group/OmegaT/message/23931 for information on the new release. Opening, editing and re-saving the troublesome file enabled it to be imported after all without the latest version bugfix. So users should keep that trick in mind perhaps if a similar problem is encountered. I've had to do similar actions in the past with other tools, so this is probably a good general tip to keep in mind regardless of what tool you use. When I downloaded an tested the latest standard release of OmegaT (2.3.0_4), the tag structure looked fine - no zapping of the DOCX was necessary in this case.

After treatment with CodeZapper, the file looked the same in memoQ (where the extra tags weren't present in the first place, though one can't count on things always being this way). The view in Trados Studio and TagEditor improved significantly, though there were still more tags, and OmegaT accepted the DOCX after tag cleaning.

SDL Trados Studio 2009 import of the DOCX file after tag cleanup with CodeZapper

SDL Trados 2007 TagEditor import of the DOCX file after tag cleanup with CodeZapper

OmegaT import of the DOCX file after tag cleanup with CodeZapper (OmegaT 2.3.0_3)

It is important to consider that superfluous tags mean wasted work time with formatting and QA corrections, perhaps even a higher risk of file failure (such as the inability to import the file at all into one tool). This is why for some time now, I and others have advocated modifying the costing of volume-based translation work to include the amount of tags. This requires, of course, that you have access to a counting tool which reports the number of tags (SDL Trados Studio does this - Atril's Déjà Vu has long offered this feature, and memoQ even allows you to assign a word or character "weight" for counting purposes). This is the only fair way I know of to account for the extra work (beside time-based charges). Consider that everyone is affected: translators, reviewers and project managers! I've had to talk more than one of the last group through "tag rescue" techniques after hours.

Perhaps it is worth considering as well that cleaner tagging will also improve "leverage" (match quality) in translation memories. So if a tool does offer cleaner tag structures (fora variety of source formats) consistently, working with that tool efficiently to manage projects will save time and money as well on top of the time and money saved with the use of CodeZapper macros in MS Word files.

Dec 25, 2011

Trados TagEditor: Optimal translation of memoQ bilinguals

With the growing number of translation agencies, direct clients and outsourcing translators adopting Kilgray's memoQ as a working platform for managing translation project content, it is particularly important for these new memoQ users and their partners to understand the best approaches to working together with persons who use other tools. One tool which is still commonly found is SDL Trados TagEditor. Compared to the other "classic" Trados tool, the Workbench macros for Microsoft Word, TagEditor has the advantage of enabling many different file formats to be processed while protecting their formatting elements (also known as "tags").

SDL Trados TagEditor can work with two types of "bilingual" files prepared in memoQ: XLIFF (*.xlf) files and bilingual RTF tables. Each approach will be presented here along with some suggestions for best practice.

XLIFF files
TagEditor comes with a default INI file for translating XLIFF, typically found at the path C:\ProgramData\SDL International\Filters\XLIFF.ini.This INI enables the contents of the target segments from the memoQ XLF file to be translated as the source in TagEditor. Thus for this approach to work, the source must be copied completely to the target in memoQ before the bilingual XLIFF is created using the Export bilingual function of the Translations page. This makes pretranslation undesirable in most cases, because the source text for matches will not be accessible and the translator will end up with a very screwy TM. Data for the TM should be supplied to the translator as TMX; be aware that match rates for the segments in TagEditor will differ significantly in some cases.

The memoQ XLIFF files will have a lot of "junk" at the top of the file when viewed in TagEditor:

Skip the content between the mqfilterinformation tags and do not change it in any way. Place the cursor below that to start working. If you prefer not to see that information at all, use the XLIFF INI for TagEditor which I modified for use with memoQ XLF files. Then the XLIFF will look a bit cleaner with the header information filtered out:

Astute observers may have noticed, however, that all is not really well with the tag structures in the views above. I think there is problem with the way that memoQ is generating the XLIFF files, with some tag structures being replaced by entities. (You see this if you open the XLIFF from memoQ in a text editor.) This causes consistent problems like the following in TagEditor:

This will require a lot of tag fixing. Thus I really can't recommend the XLIFF method at this point, not for my simple little test file in any case. The methods using the bilingual RTF tables with memoQ tag protection are safer and the structures that result are much simpler.

But if you do use this method, when the translation is complete, clean the TTX file using Trados Workbench or use the menu option File > Save Target As... in TagEditor to create an XLIFF file to return with the translated content. If the content inside the mqfilterinformation tags has not been segmented, an accurate count of the words translated will be shown in Trados Workbench upon cleaning the TTX (as accurate as that tool is given its limitations with numbers, dates, etc.)

Bilingual RTF tables
There are created in memoQ using the Two-column RTF option of the Export bilingual function. Technically speaking, the files have more than two columns (source and target, index numbers and possibly columns for a second target text, comments and status). Good practice for working with these files in TagEditor and many other tools also requires the source to be copied to the target column. This can be done in memoQ or later in a word processor. The table might look like this, for example:

For best results in TagEditor, it is important that this file be generated with the "mqInternal" style selected for tag formatting. The dark red color imparted to the tags with this option means that proofreading in a word processor is easier, and it also enables the text of the tags to be selected and hidden using a search and replace function. If the RTF file is then saved as a Microsoft Word file, the memoQ tags in the table will then be protected in TagEditor!

If the "full text" option for tags is selected, this makes little or no difference in the TagEditor view.

Here's a quick look at what the protected memoQ tags look like in TagEditor and what can happen without protection:

One possible workflow for memoQ RTF tables in SDL Trados TagEditor consists of the following steps:

Copy the source text to the target in memoQ
Export a bilingual "two-column" RTF file with the mqInternal style option selected for the tags
Re-save the RTF as a DOC or DOCX file! This is necessary so that TagEditor will use the right filter.
Select and hide all the text in the file
Select only the text to translate in the target column and unhide it
Using search and replace, hide all the dark red text. The settings for the dialog are show below and are set using the Font... option (marked with a red arrow in the screenshot) in the Format dropdown menu of the Replace dialog.

The font color to hide will be found under More Colors... in the font colors of the font properties dialog:

Launch TagEditor and open the Microsoft Word file with your content to translate. All the hidden text will be protected in tags. Translate the accessible text.
Create a target MS Word file from your TTX as described above for the XLIFF files translated in TagEditor.
Open the target file and unhide all the text.
(Optional) When reviewing the text in the word processor, comments may be added if there is a comments column. These will be imported back into memoQ and can serve as valuable feedback.
Re-save the target file as an RTF
Re-import the RTF with the translated table into memoQ. The target text will be updated to include the translation.
A QA check for tags, terminology, etc. should be performed in memoQ before exporting the final file for delivery. If an external reviewerr is used, another bilingual file in an appropriate format can be generated in memoQ for that work.

Steps 4 to 6 can be performed using a macro for convenience.

The procedure described above can, of course, be abbreviated considerably by simply copying the source text cells into a new Microsoft Word document, doing the search and replace to hide the dark red text for the tags, then processing the file in TagEditor. After translating, unhide the text in your working file, then paste the cells over the target cells in the RTF file.

Here's a look at the test file translated in TagEditor (with a comment added as shown by the dark speech balloon icon) after it was re-imported to memoQ:

And here's the translated file itself:

Search me!