Mar 18, 2014

The curious case of crappy XML in memoQ

Recently one of my collaboration partners sent me a distressed e-mail asking about a rather odd XML file he received. This one proved to be a little different than the ordinary filter adaptation challenge.

The problem, as it was explained to me, seemed to involved dealing with the trashed special characters in the German source text:


"Special" characters in German - äöüß - were all rendered as entities, which makes them difficult to read and screws up terminology and translation memory matches among other things. Entities are simply coded entries, which in this case all begin with an ampersand (&) and end with a semicolon. They are used to represent characters that may not be part of a particular text encoding system.

At first I thought the problem was simply a matter of adjusting the settings of the XML filter. So I selected Import with options... in my memoQ project and had a look at what my possibilities were. The fact that the filter settings dialog had an Entities tab seemed like a good start.


This proved to be a complete dead end. None of the various options I tried in the dialog cleared up the imported garbage. So I resolved to create a set of "custom" entities to handle with the XML filter, and used the translation grid filter of memoQ to make an inventory of these.

Filtering for source text content in the memoQ translation grid
That's when I noticed that the translatable data in this crappy faux XML file was actually HTML text. So I thought perhaps the cascading filters feature of memoQ might help.

Using all the defaults, I found that the HTML was fixed nicely with tags, but I did not want the tags that were created for non-breaking spaces ():


So I had another look at the settings of the cascaded HTML filter:


I noticed that if the option to import non-breaking spaces as entities is unmarked (it is selected by default), these are imported quite properly:

 
Now the text of some 600 lines was much easier to work with - ordinary readable document with a few protected HTML tags.

I'll be the first one to admit that the solution here is not obvious; in fact, apparently one of my favorite Kilgray experts took a very different and complex path using external tools that I simply understand. There are many ways to skin a cat, most of them painful - at least for the cat.

As I go through and update various sections of my memoQ tips guide, I'll probably expand the chapters on cascading filters and XML to discuss this case. But I haven't quite figured out a simple way to prepare the average user for dealing with a situation like this where the problem is not obvious. One thing is clear however - it pays to look at the whole file in order to recognize where a different approach may be called for.

Maybe a decision matrix or tree would do the trick, but probably not for many people. In this case the file did not have a well-designed XML structure, and that contributed to the confusion. My colleague is an experienced translator with good research skills, and he scoured the memoQ Help and the Kilgray Knowledgebase in vain for guidance. Our work as translators poses many challenges. Some of these are old, familiar ones, repackaged in new and confusing ways, as in this case. So we must learn to look beyond mere features and instead observe carefully what we are confronted, using that wit which distinguishes us from the dumb machines which the dummies fear might replace them.

3 comments:

  1. Actually there was a lot more to be processed here.

    The best I would recommend for this very speciific file wouldn't be XML+HTML, but XML+XML, setting &nbsp as a custom entity. :)

    ReplyDelete
    Replies
    1. Interesting suggestion; I'll have another look, but from what I saw the HTML in the cascade seemed to take care of everything I needed. Recognizing the non-breaking spaces as entities was actually unwanted, because this turned them into tags with the HTML filter. I'm curious to see what the XML options do.

      Delete
  2. Actually, the reason of that recommandation is for the entities. What you see in the HTML settings above work in that case, and in that one only. If you're having trouble with anything non-standard which is not a space.... then you're toast (here for insatnce, you're converting entities at import but not restoring them at export). With XML+XML, you'll quikcly see a problem if things have not been converted first (you'll see ugly entities) and you're sure they'll be restored afterwards. But both options are valid if you're paying close attention.

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)