Translation Tribulations: HTML

Showing posts with label HTML. Show all posts

Apr 4, 2018

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:

Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:

That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:

That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:

The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:

Then choose that custom configuration when you import the file using Import with Options:

This cascaded configuration can also be saved using the corresponding icon button.

This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:

If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

Sep 15, 2015

A quick trip to LiveDocs for EUR-Lex bilingual texts

Quite a number of friends and respected colleagues use EUR-Lex as a reference source for EU legislation. Being generally sensible people, some of them have backed away from the overfull slopbucket of bulk DGT data and built more selective corpora of the legislation which they actually need for their work.

However, the issue of how to get the data into a usable form with a minimum of effort has caused no little trouble at times. The various texts can be copied out or downloaded in the languages of interest and aligned, but depending on the quality of the alignment tool, the results are often unsatisfactory. I've been told that AlignFactory does a better job than most, but then the question of how best to deal with the HTML bitexts from AlignFactory remains.

memoQ LiveDocs is of course rather helpful for quick and sometimes dirty alignment, but if the synchronization of the texts is too many segments off, it is sometimes difficult to find the information one needs even when the (bilingual) document is opened from the context menu in a concordance window.

EUR-Lex offers bi- or tri-lingual views of most documents in a web page. The alignments are often imperfect, but the synchronization is usually off by only one or two segments, so finding the right text in a document's context is not terribly difficult. So these often imperfect alignments are usually quite adequate for use as references in a memoQ LiveDocs corpus. Here is a procedure one might follow to get the EUR-Lex data there.

The bilingual text of a view such as the one above can be selected by dragging the cursor to select the first part of the information, then scrolling to the bottom of the window and Shift+clicking to select all the text in both columns:

Copy this text, then paste it into Excel:

Then import the Excel file as a file for "translation" in a memoQ project with the right language settings. Because of quirks with data access in LiveDocs if the target language variants are specified and possibly not matched, I have created a "data conversion project" with generic language settings (DE + EN in my case as opposed to my usual DE-DE + EN-US project settings) to ensure that data stored in LiveDocs will be accessed without trouble from any project. (This irritating issue of language variants in LiveDocs was introduced a few version ago by Kilgray in an attempt to placate some large agencies, but it has caused enormous headaches for professional translators who work with multiple sublanguage settings. We hope that urgent attention will be given to this problem soon, and until then, keep your LiveDocs language data settings generic to ensure trouble-free data access!)

When the Excel file is added to the Translations file list, there are two important changes to make in the import options. First, the filter must be changed from Microsoft Excel to "multilingual delimited text" (which also handles multilingual Excel files!). Second, the filter configuration must be "changed" to specify which data is in the columns of interest.

The screenshot above shows the import settings that were appropriate for the data I copied from EUR-Lex. Your settings will likely differ, but in each case the values need to be checked or set in the fields near the arrows ("Source language" particularly at the top and the three dropdown menus by the second arrow below).

Once the data are imported, some adjustments can be made by splitting or joining segments, but I don't think the effort is generally worth it, because in the cases I have seen, data are not far out of sync if they are mismatched, and the synchronization is usually corrected after a short interval.

In the Translations list of the Project home, the bilingual text can be selected and added to a LiveDocs corpus using the menus or ribbons.

The screenshot below shows the worst location of badly synchronized data in the text I copied here:

This minor dislocation does not pose a significant barrier to finding the information I might need to read and understand when using this judgment as a reference. The document context is available from the context menu in the memoQ Concordance as well as the context menu of the entry appearing in the Translation results pane.

A similar data migration procedure can be implemented for most bilingual tables in HTML files, word processing files or other data sources by copying the data into Excel and using the multilingual delimited text filter.

Mar 18, 2014

The curious case of crappy XML in memoQ

Recently one of my collaboration partners sent me a distressed e-mail asking about a rather odd XML file he received. This one proved to be a little different than the ordinary filter adaptation challenge.

The problem, as it was explained to me, seemed to involved dealing with the trashed special characters in the German source text:

"Special" characters in German - äöüß - were all rendered as entities, which makes them difficult to read and screws up terminology and translation memory matches among other things. Entities are simply coded entries, which in this case all begin with an ampersand (&) and end with a semicolon. They are used to represent characters that may not be part of a particular text encoding system.

At first I thought the problem was simply a matter of adjusting the settings of the XML filter. So I selected Import with options... in my memoQ project and had a look at what my possibilities were. The fact that the filter settings dialog had an Entities tab seemed like a good start.

This proved to be a complete dead end. None of the various options I tried in the dialog cleared up the imported garbage. So I resolved to create a set of "custom" entities to handle with the XML filter, and used the translation grid filter of memoQ to make an inventory of these.

Filtering for source text content in the memoQ translation grid

That's when I noticed that the translatable data in this crappy faux XML file was actually HTML text. So I thought perhaps the cascading filters feature of memoQ might help.

Using all the defaults, I found that the HTML was fixed nicely with tags, but I did not want the tags that were created for non-breaking spaces ():

So I had another look at the settings of the cascaded HTML filter:

I noticed that if the option to import non-breaking spaces as entities is unmarked (it is selected by default), these are imported quite properly:

Now the text of some 600 lines was much easier to work with - ordinary readable document with a few protected HTML tags.

I'll be the first one to admit that the solution here is not obvious; in fact, apparently one of my favorite Kilgray experts took a very different and complex path using external tools that I simply understand. There are many ways to skin a cat, most of them painful - at least for the cat.

As I go through and update various sections of my memoQ tips guide, I'll probably expand the chapters on cascading filters and XML to discuss this case. But I haven't quite figured out a simple way to prepare the average user for dealing with a situation like this where the problem is not obvious. One thing is clear however - it pays to look at the whole file in order to recognize where a different approach may be called for.

Maybe a decision matrix or tree would do the trick, but probably not for many people. In this case the file did not have a well-designed XML structure, and that contributed to the confusion. My colleague is an experienced translator with good research skills, and he scoured the memoQ Help and the Kilgray Knowledgebase in vain for guidance. Our work as translators poses many challenges. Some of these are old, familiar ones, repackaged in new and confusing ways, as in this case. So we must learn to look beyond mere features and instead observe carefully what we are confronted, using that wit which distinguishes us from the dumb machines which the dummies fear might replace them.

Oct 5, 2013

Two years with an e-book reader

Author = NotFromUtrecht (see link). This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Nearly two years ago I acquired my first e-book reader, an Amazon Kindle like the one shown here. I had various thoughts of using it professionally but was in any case delighted with the fact that I could read text without eyestrain on it, even without my reading glasses. Some colleagues shared their experiences, and one was kind enough to mention Calibre, which I use periodically to convert file formats for better use in e-book readers or other media.

So what's the score after two years? On the professional side not so hot, because other distractions have prevented me from exploring all the possibilities of converting reference data for use on my reader. It's possible, but I'm still tweaking the technology to get exactly what I want with formatted, searchable RTF and HTML from terminology exports from my many CAT tool termbases. I could do that all along without much trouble using SDL Trados MultiTerm and various XSLT scripts, but I went down the rabbit hole of trying to make these solutions more accessible to colleagues who don't like a lot of technical fiddling, and though I think the problems are solved, I haven't had time to share most of the solutions or implement them on a large scale myself.

I do read literature related to the translation profession with some frequency. Found in Translation by Jost Zetzsche and Nataly Kelley gave me many pleasant, entertained hours with its Kindle version, attempts to read texts in PDF format by others have been less successful because of display issues, and the current version of my own book with memoQ tips is not a happy experience on a small black-and-white ebook reader. The latter has me thinking about what information might work in formats for e-book readers and smartphones, and the latter has been one of the motivations for my recent experimentation with short video tutorials on YouTube. Not only should we consider the current trends in media such as e-book readers, tablets, smartphones and whatnot for our own professional leaning and teaching needs, but also how our clients and prospects may use these media to create content which we might be asked to translate. This has already begun to happen with me in a small way, and those projects were possible only because of things I learned in my teaching experiments shortly before.

I also copy web pages into text or HTML files "to go" when I want to read up on a subject in the park while my dogs play or in a local café somewhere. My reader has a web browser, but many sites are difficult to view in a way that is friendly to a smaller screen. It's easier to grab what I want in separate files and organize these into a "collection" I can refer to easily later.

I never have done any proofreading or review with my Kindle, though I have used texts on it to translate manually (in a separate notebook) on occasion. However, that's not really compatible with most of the texts I work on.

What I have done most with my e-book reader is carry a growing library of world literature with me, familiar and unfamiliar old works and some new. I still hear some people talk about how they could not imagine reading without the heft of the book and the feel of the paper pages turned by their fingers. I'm just as caught up in the sensuality of a dusty old library as any other obsessive bibliophile, but the heft and feel don't mean much when accumulated nerve damage means that the book is more a source of pain than pleasure after ten minutes in your hands, and my once excellent eyesight has now decided that its term is served and I can find my own way with small type and lousy lighting conditions: there, the e-book reader is gift of great value.

Most important to me, however, are the words. The finest binding, gold-edged pages and elegant type mean nothing if the words mean nothing. Words of beauty and power are worth straining to read in weathered stone inscriptions, on crumbled clay tablets written before the founding of Rome or on crumbling acid-paper pages in books forgotten in an attic. How much better then to have these same words in a legible format on your reader in minutes after a short search in an online database and a quick download or a purchase and transfer.

The Velveteen Rabbit had the same nursery magic on the Kindle in the cantinho last night as it would on the delicate old pages of the original edition, but I didn't have to worry about spilling my sangria on it. In the two years since I received my Kindle I have re-read many books that were lost as my library of thousands was slowly dispersed in my many relocations. Hundreds of new books from classic literature in two languages have come to me, go with me in my small, black volume with its Cloud-based backup, and this library will likely not be lost again wherever I go and no matter how lightly I travel.

Search me!