Apr 4, 2018

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:


Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:


That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:



That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:


The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:


Then choose that custom configuration when you import the file using Import with Options:


This cascaded configuration can also be saved using the corresponding icon button.


This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:



If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

3 comments:

  1. Good approach, but you were lucky there were no actual translations in the target segments that you would have to retain. If you do have to deal with the real bilingual file, there are three approaches you can try:
    1. Use memoQ's multilingual XML filter. It's decent and easy to modify to various languages once you have basic rules, plus you can import multiple languages at once if your XML contains more than two.
    2. Use automated pre-processing with project template. Relatively simple script could add missing content and convert source file into real XLIFF (this trick can be used on multitude of bilingual source content).
    3. Use Okapi-based Rainbow to generate XLIFF files - somewhat similar to approach 2, but with different tool and may be easier in some cases.

    ReplyDelete
    Replies
    1. Marek, I don't think it's luck in this case. When I've seen bad files from CMS systems like this, it's typically a case of some IT worker who doesn't understand processes and standard interfaces very well and clumsily patches together his own hardly maintainable solution. At least this one wasn't using Excel as its exchange format. Converting to and from real XLIFF wouldn't offer me much in this case except an engineered process that I would have to pass on to a client who is just starting to come to grips with the technical aspects of memoQ filter customization, so inflicting Rainbow on them would not be kind. My personal geeky preference would be XSLT for the conversion, but that would be even less kind to the client.

      I agree that the Multilingual XML filter in memoQ might also ultimately be the route to go here, and the client was informed of such. However, that approach seemed overengineered for what I think was a first job with a new corporate client and no information about additional languages or other plans for the future. (In the meantime I have some info, and it sounds like an interesting grab bag of formats ahead.) My main objective in this case (the way I approached the filtering and the fact that I am blogging about it) is to start creating more public examples which include the other important aspect here: the need for multiple filters in sequence. These cases are ending up in my inbox with increasing frequency, and what seems to cause the most trouble in general for some of my agency clients these days is PMs or technical support staff getting their heads around this multiple filter problem. Individual translators have the same issue too, of course, but so far the cases with the most "layers" seem to hit agencies mostly. So I think this kind of case should be emphasized somewhere in the training workflows.......

      Delete
  2. Thank you for this. I recently had this problem with some files sent by one client. This was really helpful

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)