May 28, 2022

Filtering formatted text in Microsoft Office files

 Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

 After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.


2 comments:

  1. Hi Kevin, thanks for this video, very interesting! This opens up a lot of interesting opportunities.... I was just wondering: are you not experiencing issues re-zipping the file back to docx format? Because I did run into that issue when I tried... However, I did find quite an easy fix to that online (see the reply to https://stackoverflow.com/questions/26924974/reconstructing-docx-from-xml-files).

    Anyway, I also wanted to ask: did you maybe find a way of doing the same thing with XPath instead of regex? I was just wondering...

    ReplyDelete
    Replies
    1. I didn't look into other options, Joop. It was one of those typical instances of a panicked support call and a short time to devote to resolving matters.

      I didn't have problems with re-zipping - the DOCX was fine when I used 7zip. But I do remember from my old work involving embedded objects in MS Office documents that the choice of tool can matter.

      Delete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)