Jun 4, 2010

Cleaning up superfluous tags in DOCX files with memoQ

"Rogue codes", "junk tags" - whatever you choose to call them, they can be a real nuisance for those attempting to work with Microsoft Word and RTF formats in translation environment tools. Previously, I described the use of Dave Turner's Code Zapper macros to clean up junk in DOC and RTF files. Now, for those who work with Microsoft Word 2007 DOCX format, there's another option with memoQ. In the current version of memoQ (4.2.8), there's an option in the "Add document as..." dialog that enables minor format change tags to be ignored. In a practical sense, this works much like Code Zapper.

The relevant option is marked here with a red box:

If that option is not selected, here's what a DOCX file made from the OCR of a PDF file might look like:

If the option is selected as shown in the screenshot of the dialog above, here's what the result looks like:

That's much easier to deal with. There are so many interesting and useful little things in the latest release of memoQ that I probably won't find them all before the next major upgrade. It's not easy keeping up with the progress of an active, dynamic development group like the one at Kilgray.

If you have a "dirty" DOC file with a lot of unwanted, superfluous tags, an obvious strategy is to save it as DOCX (also possible using the 2007 compatibility pack for MS Office 2003), follow the procedure described above, then after export re-save the DOCX file as DOC again if necessary.

No comments:

Post a Comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)