Translation Tribulations: CodeZapper

Showing posts with label CodeZapper. Show all posts

Feb 23, 2014

Cleaning up a crappy OCR job for translation

It's a sad fact in the professional work of translators that a lack of understanding on how to deal effectively with various PDF formats causes enormous loss of productivity and results which are not really fit for purpose. The aggressive insistence of many colleagues possessed of a dangerous Halbwissen on using half-baked methods and inappropriate tools contributes to the problem, but, bowing to the wisdom about arguing with fools, I now mostly sit back with a bemused and amused smile and watch the tribulations of those who believe in salvation by PDF import filters and cheap or free OCR. "TANSTAAFL" is a true as it ever was.

Just before the weekend I got an inquiry from an agency client I rather like. Nice people, good attitude, but struggling sometimes trying to find their way with technology despite some in-country "expert" training. This inquiry looked a bit like ripe fish at first glance. The smell got stronger after I was told that because the corporate end client had converted the PDF for their annual report and begun to edit the mess (and comment it heavily too) in the OCR file that this would be all there was to work with. It was a thoroughly appetizing sight when imported into a translation environment:

There are so many issues in that tossed salad of translation terror that I don't even know where to start describing them.

The screenshot above was in memoQ. How does it look in SDL Trados Studio? Often just as messy. In this case, this was the result in an older version of Studio:

SDL Trados Studio choked and refused to import the file!

I do have the latest version of SDL Trados Studio 2014, but unfortunately it's on a system that does not yet Microsoft Office, because I refuse to bow to Microsoft's insistence that I must buy a Portuguese version of that software. No MS Office, no file import in this case with SDL Trados Studio. memoQ fortunately has not needed MS Office to import its old file formats since the release of memoQ 6.0.

Ugly OCR trash like this file is all too common at this time of year, and as I am busy compiling the syllabus for the workshop I want to do on better living with well-used technology for legal and financial translators, I felt obliged to take this one on as a teaching example. It's actually not as bad as it looks. On the other hand, the best approach may not always be obvious, and the best solution for one document may not apply as well or at all to another.

My first approach was to use Dave Turner's CodeZapper macros. This isn't as straightforward as it used to be since I downgraded from Microsoft Office 2003 to later versions; for some reason the toolbar refuses to stay loaded between work sessions, and there's no way I can keep track of all the abbreviations for macros on it.

I can't deal with anything more complicated than clicking the "CZL" option for "Code Zapper lite", which did a rather decent job on the heavy mess above:

But all was not quite as well as it seemed:

Text in the header and footer remained trashed, and the heavy use of comments and tabbed lists meant that there were plenty of legitimate tags to deal with which were just too confusing with the DVX-like mess of memoQ's default import and display for an RTF file.

So I went for a kinder, gentler approach. I changed my import filter settings in memoQ:

There is actually seldom any good reason to import an RTF or DOC file into memoQ using the default filter settings. And marking those two little checkboxes at the bottom often accomplishes much of what CodeZapper does. Sometimes less. A bit more in this case.

The header and footer texts were absolutely clean. Don't let the extra tags in this sample fool you: overall, there were fewer than in the code-zapped file. Now there are still a number of issues to be seen in the screenshot above, including paragraph breaks in the middle of a sentence and awful manual hyphenation (many instances of that in the whole text) and joys like badly placed comments and links which mess up the text and prevent term identification by the software:

Source editing features of memoQ (F2) enable issues like the two above to be dealt with easily:

After a bit of repair like this in the memoQ environment (where it is really much, much easier to fix the problems of bad comment and link placement), I copied the entire source text to the target to enable me to export a cleaner source text file. I then opened this file in Microsoft Word and used various search and replace operations to fix the bad hyphenation and other problems like excess spaces. Replacing the hyphens had to be done occurrence-by-occurrence, because the style of writing in German meant that there were many legitimate instances of hyphens followed by spaces.

After all was done, the "before and after" looked like this:

BEFORE

AFTER

The remaining tags were all legitimate formatting tags for comments, hyperlinks, tabs after section numbering, etc. These do, of course, require attention and add complexity to the work still, so they must be included in the charges for the job. memoQ makes this calculation particularly simple by allowing weighting factors to be specified in the analysis. These are the settings I typically use for a German source text:

I find this usually represents a fair minimum for the additional effort in translation and quality assurance that tags require. In this case, of course, time charges for the cleanup apply, but as you can probably guess from comparing the two analysis tables above, the customer is actually saving a lot of money by paying me to clean up the mess, and the results will be a lot more usable. My cleaned-up version of the source text will also be returned in case the authors intend to make more revisions in the source - this will save more time and money by avoiding redundant cleanup in that case.

May 4, 2012

Preparing MS Word text with a specific highlight color

If the Catholic Church decides its needs an official backup for Jerome as the patron saint of translators (and in these times of tribulation, one cannot have enough divine help I suppose), Dave Turner of ASAP Traduction gets my vote. His CodeZapper macros for Microsoft Word have saved us so many thousands of hours of grief dealing with rogue tags in RTF and MS Word files, which screw up TM and termbase matches and make work very difficult, and other more recent contributions also offer useful support. He is the first one in my mind when I see a problem and think "There ought to be a macro for that!"

Dave's latest contribution was part of an old discussion about preparing texts for translation in which the text to translate was marked by a highlight color. As I remember the original discussion, there were several highlight colors, and only one was to be chosen for work. Usually, if text of a certain color is to be hidden or shown in preparing a file for translation with a CAT tool which filters out hidden text, I use the search and replace function in Microsoft Word. That does not work for selecting a highlight of a specific color. You need to use a macro for that, and I no longer have the VBA skills to handle that myself. I can adapt a working macro, but no way I can manage a good one from scratch unless I spend a few days or more re-learning the skills of a decade ago.

So I was very happy when I saw his answer to the problem in the memoQ yahoogroups forum, which I have reproduced here with just a minor change to reflect the usual highlighting I encounter:

Sub HideExceptYellow()
'
' Translation assistance macro
' by Dave Turner
' http://asap-traduction.com/

Dim rDcm As Range
ActiveDocument.Range.Font.Hidden = True
Set rDcm = ActiveDocument.Range
With rDcm.Find
.Text = ""
.Highlight = True
.Forward = False
While .Execute
If rDcm.HighlightColorIndex = wdYellow Then
rDcm.HighlightColorIndex = wdNoHighlight
rDcm.Font.Hidden = False
rDcm.Collapse Direction:=wdCollapseStart
rDcm.Start = ActiveDocument.Range.Start
End If
Wend
End With
Set rDcm = ActiveDocument.Range
Options.DefaultHighlightColorIndex = wdYellow
With rDcm.Find
.Text = ""
.Font.Hidden = False
.Forward = False
.Replacement.Highlight = True
.Execute Replace:=wdReplaceAll
End With
End Sub

In MS Word 2003 and earlier versions, the macro can be created under Tools > Macros > View Macros. Name the macro, then click the Create button to paste in the code. The Run button will execute an existing macro if it is selected.

In MS Word 2007/2010 the same functionality is accessed on the View ribbon with the Macros icon or Alt+F8.

Here's a short video showing the procedure to copy the code into the Normal global template in Microsoft Word, where it is available to all open documents in Word:

To adapt this for another highlight color, just rename the macro and change the designation of the color (wdYellow). The macro can be adapted to deal with combinations of highlight colors as well, and similar methods can be used to deal with text colors, though these can be handled by the search and replace dialog.

Apr 16, 2012

Another approach to OCR for translation

OCR is often a touchy subject for translators. There is unfortunately too little expertise in this area, though the practice of converting scanned text for translation is now quite common. And recent developments in tools such as ABBYY FineReader have catered to the worst of the idiocy I have seen, starting processes automatically which are best executed manually with greater care.

Too many people rely on automatic settings for OCR conversion and save the result in a format (usually an MS Word or RTF file) which more or less preserves the look of the original. The result may look pretty to the ignorant eye, but when the translator begins work, a host of problems may arise. In CAT tools, there are usually innumerable superfluous tags (which sometimes even CodeZapper cannot clean up), even embedded in the middle of words, which prevents matching with TM and glossary entries. Kiss consistency and quality control features goodbye in such cases. In older versions of Trados, font changes and other format trashing are common. Disappearing text in ill-defined text boxes and column is often a problem, even for those who do not use translation environment tools.

For these reasons and others, I have long been an advocate of manual zone definition and (where necessary) the use of templates to achieve the best conversion results, and I usually save the results as naked text or, at most, preserving some font formatting (in which case further adjustments by mass selection are usually necessary to ensure that body text is, for example, consistently 10 point and not 9 point or 10.5 point in some spots due to image distortions).

If the result of a translation will be given to a graphic artist for subsequent layout, you do in fact perform a good deed by avoiding the "save with layout" options for your OCR text. A document with a straight text flow is much easier to import into the layout environment (such as InDesign). Of course, where such things are to be done, it is often best to try to get the content in that environment's format in the first place, though in such cases, certain clean-up (of hyphenation, kerning and column breaks, for example) is necessary to avoid problems and tag checks are essential after translation.

However, when you work with OCR texts, no matter how they are created for translation, some errors are almost inevitable. When the OCR engine has a spellchecker and "intelligent" correction features, some of these errors may even be plausible, just wrong, so beware! It is vital to have a copy of the original scanned document as a printout or perhaps on a second working screen for reference. I have followed this approach for years. But when I have a good conversion, I might translate happily for a number of pages before encountering a whiskey tango foxtrot moment in which I must consult the original to see what the text really says. This is a real problem for me in a multi-column patent with small type, and I often spent quite a bit of time looking in the scanned PDF for the relevant passage.

That was the case at least until one day, the light went on, and I realized that making a searchable PDF from the original scan could enable me to find the relevant text faster. If this searchable PDF is made from the same OCR process used to create the text to translate, then any errors will be the same, and by putting in the questionable text, you can go precisely to the right place in the document! My original post on this subject on another blog was primarily about making searchable PDFs for reference documents (to find terms and usage easily in documents not intended for translation), but I actually use this error-finding technique more often in my work lately.

Dec 27, 2011

Clean up the tag mess with CodeZapper for all CAT tools

Readers of this blog probably know by now that I am a Dave Turner fan. His CodeZapper macros have probably saved me hundreds of hours of wasted time over the years (not an exaggeration), and I think there are a lot of other translators and project managers with similar experiences. It doesn't solve every problem with superfluous tags, but it solves a lot of them, and Mr. Turner works steadily at improving the tool. I blogged the release of the latest version not long ago; it is now available directly from him for a modest fee of 20 euros (see the link to the release announcement for a contact link). That means it pays for itself in far less than an hour of saved time.

Over the past few days I have been updating some training documentation and running a lot of tests on tagged files as part of this. During this work, I have been struck time and again by the differences in the tags "found" by different tools working with the same file. Sometimes one tool looks better than another, but the patterns are not always consistent. What is most consistent is the ability of CodeZapper to clean up the files in various versions of Microsoft Word and make the tag structures appear a little more uniform.

Here's an example of the same DOCX file "unzapped" in several tools:

Import into memoQ 5, as-is, no tag clean-up. Previous versions of the same file showed more tags in places.

SDL Trados Studio 2009 before tag clean-up.

TagEditor in SDL Trados 2007 before tag clean-up

Initially, OmegaT would not import that particular DOCX without a tag cleanup. I reported the problem to the developers, who upgraded the filter to handle a previously unfamiliar character in internal paths of the ZIP file (DOCX is actually just a renamed ZIP package like many other file types). See http://tech.groups.yahoo.com/group/OmegaT/message/23931 for information on the new release. Opening, editing and re-saving the troublesome file enabled it to be imported after all without the latest version bugfix. So users should keep that trick in mind perhaps if a similar problem is encountered. I've had to do similar actions in the past with other tools, so this is probably a good general tip to keep in mind regardless of what tool you use. When I downloaded an tested the latest standard release of OmegaT (2.3.0_4), the tag structure looked fine - no zapping of the DOCX was necessary in this case.

After treatment with CodeZapper, the file looked the same in memoQ (where the extra tags weren't present in the first place, though one can't count on things always being this way). The view in Trados Studio and TagEditor improved significantly, though there were still more tags, and OmegaT accepted the DOCX after tag cleaning.

SDL Trados Studio 2009 import of the DOCX file after tag cleanup with CodeZapper

SDL Trados 2007 TagEditor import of the DOCX file after tag cleanup with CodeZapper

OmegaT import of the DOCX file after tag cleanup with CodeZapper (OmegaT 2.3.0_3)

It is important to consider that superfluous tags mean wasted work time with formatting and QA corrections, perhaps even a higher risk of file failure (such as the inability to import the file at all into one tool). This is why for some time now, I and others have advocated modifying the costing of volume-based translation work to include the amount of tags. This requires, of course, that you have access to a counting tool which reports the number of tags (SDL Trados Studio does this - Atril's Déjà Vu has long offered this feature, and memoQ even allows you to assign a word or character "weight" for counting purposes). This is the only fair way I know of to account for the extra work (beside time-based charges). Consider that everyone is affected: translators, reviewers and project managers! I've had to talk more than one of the last group through "tag rescue" techniques after hours.

Perhaps it is worth considering as well that cleaner tagging will also improve "leverage" (match quality) in translation memories. So if a tool does offer cleaner tag structures (fora variety of source formats) consistently, working with that tool efficiently to manage projects will save time and money as well on top of the time and money saved with the use of CodeZapper macros in MS Word files.

Oct 7, 2011

New version of CodeZapper

While I was traveling this week, our esteemed colleague Dave Turner released version 2.8 of his CodeZapper macros for Microsoft Word. I have written about these before; they are among the finest tools I know for cleaning up messy RTFs and MS Word formats that make work with translation environment tools Hell because of superfluous and disruptive tags.

CodeZapper can be a big help with any translation environment which displays tagging in some way. These include OmegaT, Déjà Vu, memoQ and the various Trados instances. For whatever reason, no tools vendor has seen fit to create a quality management tool of this same caliber, though Kilgray at least partially addressed this with a memoQ filter option that often does help with trash tags.

Version 2.8 of CodeZapper is currently available by direct request to the author, Dave Turner. There is now a separate "read me" file explaining the functions of the macro buttons in some detail.

If you benefit from this tool, please support its creator. I do. He has saved me many, many hours of tribulation in translation, far more value given that the little money he has received from me. Here is the first part of Mr. Turner's documentation to give more background on this useful tool:

What is “CodeZapper”?

"CodeZapper" is a set of Word macros (programs written in VBA to automate operations in applications) designed to “clean up” Word files before being imported into a translation environment program such as Deja Vu DVX, memoQ, SDL Trados Studio, TagEditor, Swordfish, OmegaT, etc.

Word documents are often strewn with junk or “rogue” tags (so-called “smart tags”, language tags, track changes tags, soft hyphenations, scaling and spacing changes, redundant bookmarks, etc.).

This tagged information shows up in the DVX or MemoQ grid as spurious {1}codes{2} around, or even in the mid{3}dle of, words, making sentences difficult to read and translate and generally negating many of the productivity benefits of the program.

OCR’d files or files converted from PDF are even worse.

CodeZapper tries to remove as many of these tags as possible while retaining formatting and layout. It also contains a number of other macros which may be useful before and after importing files into DVX or MQ (temporarily transferring bulky images (photos, etc.) out of a file, to speed up import, and then back in the right place after translation, moving footnotes to a table at the end of the document and back after translation, for example).

Is it freeware?

No. To help ensure its continued availability and improvement, there is now a one-time, 20 euro charge for the program. This will entitle you to free future upgrades.

Is it risk free?

Although it’s been fairly extensively tested on a range of files, you should obviously only use it on a backup copy of your files and at your own risk.

How do I install it?

CodeZapper come in the form of a Word template (.dot file) with a custom toolbar which you can either copy to the Word startup directory (following the path in Tools/Options/File Locations/Startup) in which case it will be enabled on starting Word. or to the “Template” directory containing Normal dot and other Word templates (following the path in Tools/Options/File Locations/User templates). You then enable it by selecting it in Tools/Templates and Add-ins, as and when needed.

May 27, 2009

Escape from Code Hell!

Although I've never met Dave Turner nor have I ever corresponded with him, there are days that I think of him as one of my best friends. Why is that? Because he has made my working life a lot easier. And he has done the same for many other users of CAT tools like Déjà Vu, MemoQ and Trados with his CodeZapper macro collection. Many of us have suffered with "rogue" codes (tags for you Trados users) in RTF and MS Word documents. A typical mess I see almost every day looks like this:

There are many different strategies for cleaning this up, but it is difficult to find one that works in every case. I often run the document through the OpenOffice word processor, but this is a bad idea for complex documents, because the format often gets trashed. Other methods like copying and pasting the content into a new document also have drawbacks. Dave's macros are about the easiest, most reliable way I know of escaping Code Hell most of the time. After running CodeZapper, the sample text above looked like this:

Now that's much easier to translate, isn't it? With results like that, which save massive amounts of working grief, you might wonder how much a cool solution like this will cost you. Update: A mere 20 euros. See the latest post on CodeZapper for information on where to get it.

This is one tool that is definitely worth adding to your bag of tricks.

Search me!