Dec 27, 2011

Clean up the tag mess with CodeZapper for all CAT tools

Readers of this blog probably know by now that I am a Dave Turner fan. His CodeZapper macros have probably saved me hundreds of hours of wasted time over the years (not an exaggeration), and I think there are a lot of other translators and project managers with similar experiences. It doesn't solve every problem with superfluous tags, but it solves a lot of them, and Mr. Turner works steadily at improving the tool. I blogged the release of the latest version not long ago; it is now available directly from him for a modest fee of 20 euros (see the link to the release announcement for a contact link). That means it pays for itself in far less than an hour of saved time.

Over the past few days I have been updating some training documentation and running a lot of tests on tagged files as part of this. During this work, I have been struck time and again by the differences in the tags "found" by different tools working with the same file. Sometimes one tool looks better than another, but the patterns are not always consistent. What is most consistent is the ability of CodeZapper to clean up the files in various versions of Microsoft Word and make the tag structures appear a little more uniform.

Here's an example of the same DOCX file "unzapped" in several tools:

Import into memoQ 5, as-is, no tag clean-up. Previous versions of the same file showed more tags in places.
SDL Trados Studio 2009 before tag clean-up.

TagEditor in SDL Trados 2007 before tag clean-up

Initially, OmegaT would not import that particular DOCX without a tag cleanup. I reported the problem to the developers, who upgraded the filter to handle a previously unfamiliar character in internal paths of the ZIP file (DOCX is actually just a renamed ZIP package like many other file types). See for information on the new release. Opening, editing and re-saving the troublesome file enabled it to be imported after all without the latest version bugfix. So users should keep that trick in mind perhaps if a similar problem is encountered. I've had to do similar actions in the past with other tools, so this is probably a good general tip to keep in mind regardless of what tool you use. When I downloaded an tested the latest standard release of OmegaT (2.3.0_4), the tag structure looked fine - no zapping of the DOCX was necessary in this case.

After treatment with CodeZapper, the file looked the same in memoQ (where the extra tags weren't present in the first place, though one can't count on things always being this way). The view in Trados Studio and TagEditor improved significantly, though there were still more tags, and OmegaT accepted the DOCX after tag cleaning.

SDL Trados Studio 2009 import of the DOCX file after tag cleanup with CodeZapper

SDL Trados 2007 TagEditor import of the DOCX file after tag cleanup with CodeZapper

OmegaT import of the DOCX file after tag cleanup with CodeZapper (OmegaT 2.3.0_3)

It is important to consider that superfluous tags mean wasted work time with formatting and QA corrections, perhaps even a higher risk of file failure (such as the inability to import the file at all into one tool). This is why for some time now, I and others have advocated modifying the costing of volume-based translation work to include the amount of tags. This requires, of course, that you have access to a counting tool which reports the number of tags (SDL Trados Studio does this - Atril's Déjà Vu has long offered this feature, and memoQ even allows you to assign a word or character "weight" for counting purposes). This is the only fair way I know of to account for the extra work (beside time-based charges). Consider that everyone is affected: translators, reviewers and project managers! I've had to talk more than one of the last group through "tag rescue" techniques after hours.

Perhaps it is worth considering as well that cleaner tagging will also improve "leverage" (match quality) in translation memories. So if a tool does offer cleaner tag structures (fora variety of source formats) consistently, working with that tool efficiently to manage projects will save time and money as well on top of the time and money saved with the use of CodeZapper macros in MS Word files.


  1. Perhaps it's worth adding that CodeZapper is now integrated into DVX2. So Dave's genius is now enhanced by Daniel B's brilliance, and I get the best of both.

  2. Thanks for confirming that, Victor. I had heard a rumor about its integration recently, and I think a move like that is long overdue. In fact, I think it would be an excellent idea in general for translation environment tools which import MS Office files to enable any MS Word VBA macro(s) to be run for pre-processing. This would simplify the handling of "external views" by other environments.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)