Jun 3, 2010

Dealing with embedded XML and HTML in an Excel file

One of the occasionally gratifying aspects of translation for an IT geek like me is that IT challenges continue to follow me. Actually, that's one of the things about the current state of the profession that I hate too. (I'm a not-so-closeted Luddite.)

This week's challenge was more of a fun puzzle, because it wasn't my problem, but rather someone else's. An agency owner friend sent me an Excel file that was driving him nuts; his localization engineer, a former star at a Top Ten agency, had pronounced the task of filtering the data in a useful way to be impossible. I love it when engineers say something is impossible; it usually means there is a simple solution at hand if one gives the matter a little real thought.

The file structure looked something like this:

Only the yellow columns were to be translated; some had plain text content (with line beaks in some cases), other yellow columns had XML or HTML content.

Just for fun, I fired off a quick support request to Kilgray along with a copy of my test file, because I thought maybe there was a cascading filter feature I might have overlooked. (There isn't, but the idea was noted as a good one, so maybe we'll see it in the future.) In any case, Denis Hay offered a creative suggestion as he almost inevitably does:
Hi Kevin,

While waiting for "cascading filters" (which I also find a great idea), what you could do is simply copy these Excel columns to a Word table, than use either Tortoise Tagger, or preferably the +Tools from the Wordfast website to tag the HTML/XML content. Import that tagged word file into memoQ, and you should get what you wanted.

Once translated, just paste back to Excel.

Kind regards,
Denis Hay
Technical consulting and training
Kilgray Translation Technologies

There's another way I discovered by the time Denis' suggestion arrived. It works well manually, but it can also be automated with macros if you're dealing with content management system exports where the structure recurs and you'll be doing a lot of this.

Do the following:
  1. Copy each individual Excel column of interest (or at least the ones with XML/HTML) into a plain text file.
  2. In the case of the text files with tagged content (i.e. XML or HTML), change the file extension to fit the content (i.e "text2.txt" becomes "test2.xml", etc.).
  3. Translate the text files with your favorite translation environment tool, using the filters appropriate for each type of content.
  4. After exporting the files from your working environment, copy and paste the text file content back into the corresponding columns of the original Excel file. Note taht if there are line breaks somewhere, your row positions may get screwed up. This can be solved by performing this operation in OpenOffice Calc. (Maybe there's an appropriate setting for Excel to avoid this problem, but I don't know it.)
The key to sorting this puzzle out was to consider the discrete parts (i.e. the individual yellow columns) of the entire data set as separate collections of data. Dividing a problem up into its constituent parts is often a good way to find an easy solution.


  1. Hi Kevin,

    I have made positive experiences solving this scenario using one of two options, or a combination thereof: a) using the Excel "Save As" option "Microsoft XML spreadsheet" (available in XL 2003 and 2007, at least) - the resulting XML is not exactly compliant, but at least it's not native XLS - and b) a customized filter using the Okapi Framework (okapi.sourceforge.net). Takes some trial & error and some regular expression work, but then works like a charm. No copying & pasting required.


  2. Christian, in that scenario, how would you exclude the columns you don't want to translate?

  3. Kevin, that depends... Your question only affects output of part a) of my suggestion. The MS XML spreadsheet is a strange beast as it doesn't really work on columns, but rows, and that in a way that takes some getting used to. (If a row contains a cell that holds no value, XL doesn't save an empty cell inside that row - it saves no cell at all - so just by parsing the row, you would never even know that a column existed. This is what makes it hard.)

    The easiest scenario would be where one column contains only IDs, ideally all of them following a uniform naming scheme beginning or ending with the same string. In that case a script can be run that 'hides' any cell containing that string expression from the TMS. (In Trados-speak, you would essentially externalize the content.)

    Where that doesn't hold, you would probably have to consider gutting the entire column from the XLS and inserting it again after you're done.

    Not pretty and I'm breaking my own rule of not copying and pasting :).


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)