Translation Tribulations: Complicated XML in memoQ: a filtering case example

Apr 4, 2018

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:

Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:

That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:

That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:

The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:

Then choose that custom configuration when you import the file using Import with Options:

This cascaded configuration can also be saved using the corresponding icon button.

This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:

If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

3 comments:

WasatyApril 04, 2018 8:45 AM
Good approach, but you were lucky there were no actual translations in the target segments that you would have to retain. If you do have to deal with the real bilingual file, there are three approaches you can try:
1. Use memoQ's multilingual XML filter. It's decent and easy to modify to various languages once you have basic rules, plus you can import multiple languages at once if your XML contains more than two.
2. Use automated pre-processing with project template. Relatively simple script could add missing content and convert source file into real XLIFF (this trick can be used on multitude of bilingual source content).
3. Use Okapi-based Rainbow to generate XLIFF files - somewhat similar to approach 2, but with different tool and may be easier in some cases.
ReplyDelete
Replies
AnonymousAugust 04, 2020 9:13 PM
Thank you for this. I recently had this problem with some files sent by one client. This was really helpful
ReplyDelete
Replies

Add comment

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)

Search me!

Apr 4, 2018

Complicated XML in memoQ: a filtering case example

3 comments: