Jun 17, 2018

Ferramentas de Tradução - CAT Tools Day at Universidade Nova de Lisboa

The Faculty of Sciences and Humanities held its first "CAT Tools Day" on June 16, 2018 with a diverse program intended to provide a lusophone overview of current best practices in the technologies to support professional translation work. The event offered standard presentation and demonstrations in a university auditorium with parallel software introduction workshops for groups of up to 18 persons in an instructional computer lab in another building.

The day began with morning sessions covering SDL Trados Studio and various aspects of speech recognition.

Dr.  Helena Moniz explains aspects of speech analysis.

I found the presentation by Dr. Helena Moniz from the University of Lisbon faculty to be particularly interesting for its discussion of the many different voice models and how these are applied to speech recognition and text-to-speech synthesis. David Hardisty of FCSH at Universidade Nova also gave a good overview of the state of speech recognition for practical translation work, including his unobtrusive methods for utilizing machine pseudo-translation capabilities in dictated translations.

Parallel introductory workshops for software tools included memoQ 8, SDL Trados Studio 2017 and ABBYY FineReader - two sessions for each.

Attendees learned about ABBYY FineReader, SDL Trados Studio and memoQ in the translation computer lab

The ABBYY FineReader session I attended gave a good overview in Portuguese of basics and good practice, including a discussion of how to avoid common mistakes when converting scanned documents in a number of languages.

The afternoon featured several short, practical presentations by students, discussions by me regarding the upcoming integrated voice input solution for memoQ and the preparation of PDF files for reference, translation, print deadline emergencies and customer relations.

Rúben Mata discusses Discord

The final session of the day was a "tools clinic" - an open Q&A about any aspect of translation technology and workflow challenges. This was a good opportunity to reinforce and elaborate on the many useful concepts and practical approaches shown throughout the day and to share ideas on how to adapt and thrive as a professional in the language services sector today.

Hosts David Hardisty and Marco Neves of FCSH plan to make this an annual event to exchange knowledge on technology and best practices in translation and editing work in discussions between practicing professionals and academics in the lusophone community. So watch for announcements of the next event in 2019!

Some of the topics of this year's conference will be explored in greater depth in three 25-hour courses offered in Portuguese and English this summer at Universidade Nova in Lisbon. On July 9th there will be a thorough course on memoQ Basics and workflows, followed by a Best Practices course on July 19th, covering memoQ and many other aspects of professional work. On September 3rd the university will offer a course on project management skills for language services, including the memoQ Server, project management business tools, file preparation and more. It is apparently also possible to get inexpensive housing at the university to attend these courses, which is quite a good thing given the rapidly rising cost of accommodation in Lisbon. Details on the housing option will be posted on this blog when I can find them.

iPhone Google Maps in translation

When I first moved to Portugal I had a TomTom navigation system that I had used for a few years when I traveled. Upon crossing a border, I would usually change the language for audio cues, because listening to street names in one language pronounced badly in another was simply too confusing and possibly dangerous. Eventually, the navigation device died as crappy electronics inevitably do, and I changed over to smartphone navigation systems, first Apple Maps on my iPhone and, after I tired of getting sent down impossible goat trails in Minho, Google Maps, which generally did a better job of not getting me lost and into danger.
For the most part, the experience with Google Maps has been good. It's particularly nice for calling up restaurant information (hours, phone numbers, etc.) on the same display where I can initiate navigation to find the restaurant. The only problem was that using audio cues was painful, because the awful American woman's voice butchering Portuguese street names meant that my only hope of finding anything was to keep my eyes on the actual map and try to shut out (or simply turn off) the audio.

What I wanted was navigation instructions in Portuguese, at least while I am in Portugal; across the border in Spain it would be nice to have Spanish to avoid confusion. Not the spoken English voice of some clueless tourist from Oklahoma looking to find the nearest McDonald's and asking for prices in "real money". But although I found that I could at least dictate street names in a given language if I switched the input "keyboard" to that language, the app always spoke that awful, ignorant English.

And then it occurred to me: switch the entire interface language of the phone! Set your iPhone's language to German and Google Maps will pronounce German place names correctly. Same story for Portuguese, Spanish, etc. Presumably Hungarian too; I'll have to try that in Budapest next time. And that may have an additional benefit: fewer puzzled looks when someone asks where I'm staying and I can't even pronounce the street name.

It's a little disconcerting now to see all my notifications on the phone in Portuguese. But that's also useful, as the puzzle pieces of the language are mostly falling into place these days, and the only time I get completely confused now is if someone drops a Portuguese bomb into the middle of an English sentence when I'm not expecting it. Street names make sense now; I'm less distracted by the navigation voice when I drive.

And if some level of discomfort means that I use the damned smartphone less, that's a good thing too.

(Kevin Lossner)

Jun 15, 2018

Better WordPress translation with memoQ

Translating websites is mostly a royal pain in the tush. And I avoid it most of the time. Why? Several reasons.
  • Those who request website translations often have no idea what platform is used nor do they really know how much content is present.
  • They have very little understanding of the technical details or importance of translatable information hidden in tag attributes, selection lists, etc. and so there are often misunderstandings about the true volume to be translated.
  • There are a lot of sloppy cowboys slogging through the bog, glibly bidding low rates to translate sites they neither understand nor truly care about, and their victims... uh, prospects, customers, whatever... usually lack the expertise or the patience to understand the difference between a wild-ass lowball guess from someone lacking the skills and tools to do the job right and a carefully researched, reasonably accurate estimate of time and effort from a professional.
Shopping for "quotes" when neither you nor the one submitting a "bid" actually understand the technical basis of the project is a process with no guarantee of a satisfactory outcome. And too often this process turns out badly.

These days, many small companies use the popular content management system WordPress to manage their web sites. It may not be the best by some technical standards, but sometimes it is better to define "best" according to the likelihood of finding someone to provide services involving a platform and of there being such experts available not only now but for a reasonable amount of time in the future. I think it is fair to say that WordPress has met that standard for some time and will probably do so for some time more.

I have had a good number of requests for translating WordPress content in the past, but none of the estimates given were accepted, because typically the content to be translated was an order of magnitude greater than the client realized or nobody could commit to a clear decision on what parts were to be translated and what parts were unimportant. And then we have the problem that many sites use themes which are poorly designed as multilingual structures.

The WordPress Multilingual Plug-in (WPML) makes sensible, professional management of websites with content in more than one language much easier. When I learned about this technology more than a year ago, I suggested its use to the person who requested a quotation for translation services, but that suggestion is probably still echoing somewhere out there in the Void.

At memoQ Fest 2018 this year in Budapest, I had the pleasure to attend a superb presentation by Stefan Weimar on how to cope with the translation of Wordpress sites and some of what you need to know to use the WPML technology right. I was inspired and hoped to have an opportunity to look at things more closely some day.

That day turned out to be a week later. Funny how that goes.

Three or four years ago I translated a small web site for a friend's company. At the time, the site used the Typo3 content management system, which proved to be troublesome. Not so much because of the technology, but because of the service provider using it, who rejected any suggestions for providing the content to be translated in a form that would not require his manual intervention at the text level. He copied, pasted and improved (German: verschlimmbesserte) my translation as only a German with full confidence in his grade school English skills could. It was... not what anyone had hoped for, and I never  found the heart to mention all the mistakes in the final result.

So now, when someone asked me to have a look at their new site, I felt a bit queasy. Nunca mais, I thought. No way, José. Or Wolfgang as it were. But in the meantime, unbeknownst to me, he had switched service providers and CMS platforms, and the new provider managing his web content is a professional with a professional understanding of sites for international clients in many languages. And he uses WPML. The right way!

So now it was up to me to figure out what's what in memoQ. So first I used the memoQ XLIFF filter on all the little XLIFFs supported by the plug-in. I quickly saw that a few other things were needed, like a cascaded HTML filter...

Somewhat messy, but doable once the HTML tags get properly protected by a chained filter.

Then I tried again, this time applying memoQ's WordPress (WPML) filter. And this was the result:

That was easy. Hmm. I think I know which method I prefer.

So for translations of WordPress websites properly configured to use WPML technology, the new memoQ filter looks like a winner!

Jun 14, 2018

Translating Wordfast GLP packages... elsewhere.

One reason to keep  translation environment tool licenses up to date is that new formats continue to appear. New formats for translatable files as well as new file formats for the tools that help to process files for translation. Very often I have heard some "professional" say "I'm a translator, not a [fill in the blank]. If the client wants this translated, I'll have to get it in a Microsoft Word file." Or something like that.

Let's get real for a moment.

  • That attitude is simply lazy and disrespectful toward translation consumers who would like to make use of one's services and
  • a lot of money is being left on the table here in many cases. I built a huge clientele at the start of the last decade, because my use of translation environment tools like Trados, Déja Vu, STAR Transit and Wordfast enabled me as an individual to tackle translation challenges that many agencies at the time had no concept of how to cope with.
As translation agencies have acquired more technical tools, most of them still remain unfortunately unaware of how to use them properly or plan more than the simplest workflows well, but that's a subject for another day. Also...
  • ... by using tools and techniques that are compatible with what your clients require for a final format, you can save your client a lot of time and money for further layout work - and probably avoid the introduction of errors in your translation work in its final format as well.
  • And in my experience, showing technical and process competence to benefit clients usually leads to greater trust and better work together.
So what has all this got to do with Wordfast?

Well... I didn't like the Wordfast brand for a very long time. Its various incarnations were perhaps the weakest of the popular tools in a technical sense, and inevitably when agency friends called me, desperate to fix some massive translator screw-up (usually by somebody in France), Wordfast "Pro" was often involved in the disaster.

I looked at the "newer" Wordfast versions a number of times over the years, and honestly they always seemed like lobotomized wannabe tools. This was about the time that many other toolmakers were trying to decide if they should support XLIFF.

Well, a lot has changed since then. I became aware of the changes the other day when somebody posted a question in a social media forum for memoQ asking how to handle Wordfast Pro 5 GLP packages. I had never heard of these, so of course I was curious and decided to take a look. This finally led me to download a 30-day trial of the latest Wordfast Pro software to evaluate its potential for interoperable work with other translation environments. I see a lot of changes since my last look, and so far I think they are all positive, and along the way I had good cause to look at Wordfast Anywhere, the free web-based CAT tool that I talked some university colleagues into not wasting their time with a while ago. Well, my recommendation in that regard might change, but that and commentary on the latest incarnation of WF Pro will have to wait for another day.

About those GLP packages....

Yes, those. This was the question:

Someone pointed out that GLP files - like every other translation "package" one finds from all the tool providers - are merely ZIP files with particular structure inside and the extension re-named. 

Gotta love Facebook. You'll always get an answer in some group, usually a wrong one. That's why I keep a blog. Good information gets buried in social media noise too often, and good luck finding it in any kind of search. In this case... we don' have no steenkeen TXML files as I learned... that's the old Wordfast Pro....

A colleague in Germany kindly provided me with a little GLP package to examine, which I promptly unzipped. I noticed that at least one tool (7-Zip) sees through the renamed extension nonsense and saved me the usual trouble of renaming it before unpacking.

So far, so good... inside the folder for the unpacked GLP file I found the following:

The test package was an English to Portuguese project. But source? Hello? Let's have a look there!

Very interesting. The original source files (English) came along for the ride. This is good, because I often like to translate source files in memoQ - taking advantage of the preview there for many file types - and then use the translation memory to translate the file that is created by other other tool (usually SDL Trados SDLXLIFF files in my work). Now let's have a look inside the pt target folder. There's actually another folder named txlf inside that one. And there I found:

No TXML files! TXLF is a new instance of the rather ubiquitous XLIFF files one finds in the translation world, some of which have some rather bothersome "extensions" that may require special handling in the translation process. In the simple test I performed, none of that was apparent; an ordinary XLIFF filter seemed to work well. Future tests will show me if there are any quirks I hope, but so far, so good.

So one strategy, with pretty much any CAT tool, would be to unpack the GLP file, get at those TXLF files and then bring them into another working environment using an XLIFF filter. Maybe also use my approach with the source files too, which will ensure that you can deliver a good target file even if quirky tags in the XLIFF lead you to produce less than an optimal result there. 

The current version of memoQ (8.4) does not recognize the TXLF extension, so as in all such cases, the All files option must be used and the correct filter applied in a later dialog. Unlike with some other tools, memoQ cannot be "trained" by the user to recognize new extensions as far as I know.

But what about importing the GLP files directly to memoQ? Wouldn't that be nice? And I thought it might be possible using the ZIP file filter recently introduced (and the same All files trick to get the GLP file and apply the ZIP filter later). Well...

It looked promising.

So much so that I even optimistically named and saved a custom configuration for the ZIP filter. All I need to do now is cascade an XLIFF filter!

Ack. Sooooo close. I've been here before. There are more things in heaven and down-to-earth cascading formats, Kilgray, than are dreamt of in your philosophy! Please, please expand the list of possible cascaded formats sensibly to make better use of this lovely new ZIP filter!

So for now, that's a no-go, but soon? Who knows? If you bother and tell the memoQ team how helpful it would be, maybe this and similar problems can be solved with relative ease.

In any case, for now it seems that the unpack-and-do-the-XLIFF approach will work for most anyone with a modern CAT tool. And that's good news, because in today's fast-changing technology environment for translation, interoperability of CAT tools is increasingly important. It is a foolish waste of time to translate in a large number of CAT tools and probably a bad idea to do so in two or three according to my old research. I've usually found that such JOATs are, professionally, often stupid goats who lack the depth in a single major environment or two, which could allow them to get the most out of their tools and serve their clients in the best way with their linguistic skills and subject matter knowledge.

So is the latest Wordfast a tool worth checking out? I don't know yet. But it may be used by colleagues and clients with whom I like to work, and understanding how to share projects and project resources in painless ways will benefit all of us, no matter what our tool preferences may be. Wordfast seems to be developing very much in that spirit, so I will revisit it for more collaboration scenarios in the future.

Jun 3, 2018

Survey for Translation Transcription and Dictation

The website with the survey and short explainer video is
The idea is to build a human transcription service. We just need a few translators per language that want to work with a transcriptionist due to RSI, productivity etc. and we can use that data to build an ASR system for that language. There is also a good chance the ASR system will be accurate for domain-specific terminology and accents as it will be adaptive and use source language context. 
Take the Sight CAT survey - click here
Click on the graphic to go to the survey
memoQ Fest 2018 was, among other things, a good opportunity as always to spend time discussing things with some of the best and most interesting consultants, teachers, creative developers and brainstormers I know in the translation profession. One of these was my friend and colleague, John Moran, whose work on iOmegaT introduced me to the idea that properly designed, translator-controlled (voluntary) data logging could be a great boon to feature research and software development investment decisions. Sort of like SpyGate in translation, except that it isn't.

John and I have been talking, brainstorming and arguing about many aspects of translation technology for years now, dictation (voice recognition, ASR, whatever you want to call it) foremost among the topics. So I was very pleased to see him at the conference in Budapest last week, where he spoke about logging as a research tool in the program and a lot about speech recognition before and after in the breaks, bars, coffee houses and social event venues.

I think that one of the most memorable things about memoQ Fest 2018 was the introduction of the dictation tool currently called hey memoQ, which covers a lot of what John and I have discussed until the wee hours over the past four years or so and which also makes what I believe will be the first commercial use of source text guidance for target text dictation (not to mention switching to source text dictation when editing source texts!). John introduced that to me years ago based on some research that he follows. Fascinating stuff.

One of the things he has been interested in for a while for commercial, academic and ergonomic reasons is support for minor languages. Understandable for a guy who speaks Gaelic (I think) and has quite a lot of Gaelic resources which might contribute to a dictation solution some day. So while I'm excited about the coming memoQ release which will facilitate dictation in a CAT tool in 40 languages (more or less, probably a lot more in the future), John is thinking about smaller, underserved or unserved languages and those who rely on them in their working lives.

That's what his survey is about, and I hope you'll take the time to give him a piece of your mind... uh, share your thoughts I mean :-)

The Great Dictator in Translation.

I have no need for words. memoQ will have that covered in quite a few languages.

This is not your grandfather's memoQ!

May 25, 2018

Getting on Better Terms with memoQ

The pre-event webinar for the terminology workshop in Amsterdam on June 30th was held yesterday; for those who missed it, the recording is here, with a few call-outs added toward the end to make it easier to find information on other matters mentioned:

On the whole, I think the new presentation platform I'm using Zoom – works well. I was particularly happy to discover when my Internet router died suddenly and mysteriously about 50 minutes into the talk that the recording was not lost, and when the talk resumed a few minutes later with a different router connection, a recording of that part was made in a separate folder, so the parts could be joined later in a video editor without much ado.

I've held perhaps a dozen online meetings with clients since I licensed Zoom recently, and I'm very pleased with its flexibility, even though the many options have tripped me up in embarrassing ways a few times. So, time permitting, I'll try to do at least one talk like this each month on some aspect of translation technology in the interests of promoting better practice. The next will be held on June 21st and will cover various PDF workflows using iceni technology. Suggestions for later presentations are welcome.

The talk yesterday on terminology in memoQ was just a quick (ha ha - one hour) overview of possibilities; much more detail on these matters will be provided in the Amsterdam workshop and summer courses at Universidade Nova de Lisboa, which are open to the public at very reasonable rates (about €130 for 25 hours of instruction). There will be a lot more in this topic area in the future in various venues; right now there are some very interesting developments afoot with memoQ and other matters at Kilgray, and other providers also have good things in the works. So stay tuned.

May 21, 2018

Best Practices in Translation Technology: summer course in Lisbon July 16-21

As usual each year, the summer school at Universidade Nova de Lisboa is offering quite a variety of inexpensive, excellent intensive courses, including some for the practice of translation. This year includes a reprise of last year's Best Practices in Translation Technology from July 16th to 21st, with some different topics and approaches.

Centre for English, Translation and Anglo-Portuguese Studies

The course will be taught by the same team as last year – yours truly, Marco Neves and David Hardisty – and cover the following areas:
  • Good translation workflows.
  • Using voice recognition in translation.
  • Using machine translation in a humane, intelligent way.
  • Using checklists to improve communication in translation.
  • Using glossaries, bilingual texts and other references in multiplatform environments.
  • Good practices for using terminology and reference texts in the target language.
  • Planning and creating lists for auto-translation rules and the basics of regular expressions for filters.

Some knowledge of the memoQ translation environment and translation experience are required.

The course is offered in the evening from 6 pm to 10 pm Monday (July 16th) through Friday (July 20th), with a Saturday (July 21st) session for review and exams from 9 am to 2 pm. This allows free days to explore Lisbon and the surrounding region and get to know Portugal and its culture.

Tuition costs for the general public are €130 for the 25 hours of instruction. The university certainly can't be accused of price-gouging :-) Summer course registration instructions are here (currently available only in Portuguese; I'm not sure if/when an English version will be available, but the instructors can be contacted for assistance if necessary).

Two other courses offered this summer at Uni Nova with similar schedules and cost are: Introduction to memoQ (taught by David and Marco – a good place to get a solid grounding in memoQ prior to the Best Practices course) from  July 9–14, 2018 and Translation Project Management Tools from September 3–8, 2018.

All courses are taught in English and Portuguese in a mix suitable for the participants in the individual courses.

May 10, 2018

Zooming inside iceni InFix for PDF translation: web meeting on 21 June 2018

Over the course of the last nine years, I have published a few articles about ways that I have found the PDF editor iceni InFix useful for my translation and terminology research work. Throughout that time iceni has continued to improve that product as well as develop other technologies for PDF translation assistance, such as the online TransPDF service now integrated with memoQ.
It's one thing to have a tool and in many cases quite another thing to know how to make the best use of it. This situation is further complicated by the very wide range of scenarios in which an editor like iceni InFix might be useful and the great differences one often finds in the needs and expectations of the clientele from one translator to another. In the product's early days I followed the commentaries of José Henrique Lamensdorf, a Brazilian engineer with long experience in technical translation, desktop publishing and other fields, and while I consider him to be among the most useful sources of good technical information for me in my early days as a commercial translator, his project needs were very different from mine, and most of the things he mentioned a decade or more ago, though very relevant to people heavily involved with publishing, weren't a fit for my clientele.
That changed as iceni expanded the feature set over the years and I began to encounter many cases where OCR and a full Adobe Acrobat license did not quite do what I needed in a simple way.

Some weeks ago I had an online meeting scheduled with a client company to discuss the advantages of certain support technologies with that company's translation and project management staff. We tried to use TeamViewer for the discussion, but unfortunately my license could not accommodate the 6+ people involved, and I was reluctant to fork over the extra cash needed for a 15 or 25 participant license, especially because some other clients had issues with TeamViewer which I never clearly understood, leading their IT departments to ban it. And the TVS recording files, while generally quite decent for viewing and of manageable size due to an excellent compression CODEC, are a nightmare to convert cleanly to MP4 or other common video formats. Just as I was caught in this dilemma, my esteemed Portuguese to UK English translation colleague and gifted instructor at Universidade Nova de Lisboa, David Hardisty, enthusiastically re-introduced me to Zoom videoconferencing.

I had seen Zoom before briefly when IAPTI decided to ditch the Citrix conferencing solutions and use it for webinars and staff meetings, but at the time I was too distracted by other matters to remember the name or notice the details. And, as we know, there one finds the Devil.

Zoom is powerful and flexible. For about €13 a month for my Pro license, I can invite up to 100 people for a web meeting, with quite a few useful options that I am still getting a grip on. Being used to the relative simplicity of TeamViewer, I am a little overwhelmed sometimes, and I have had a few recorded client meetings where the video was flawed because I got the screen sharing options mixed up. But the basics are actually dead simple if one pays a bit of attention.

A Zoom "web meeting", by the way, is what I would call a webinar, but that term means something else in Zoomworld, involving up to 50 speakers and something like 10,000 participants for some monthly premium. Not my thing. If the crowd is bigger than 10 in an online or a face-to-face class, I start to feel the constrictions of time and individual attention like an unruly anaconda around me.

But in any case, for someone who has spent many years looking for better teaching tools, Zoom is looking pretty good right now. And it enables me to share what I hope is useful professional information without dealing with the organizational nonsense and politics often associated with platforms licensed by some companies and professional associations. All for the monthly price of a cheap lunch.

So I've decided to do a series of free public talks using Zoom, not only to share some of a considerable backlog of new and exciting technical matters for translators, translation project managers and support staff and language service consumers, but also to get a better handle on how I can use this tool to support friends, colleagues and students around the world. Previously I announced a terminology talk (on May 24th, mostly about memoQ); now I have decided to share some of the ways that iceni InFix helps me in my work and what it might do for you too.

Soon Thursday, June 21st at 16:00 Central European Time (15:00 Lisbon time) I'll be talking about how you can get your fix of useful PDF handling for a variety of challenging situations. You are welcome to join me for this.

The registration link is here.

May 1, 2018

All-round Translator Terminology Workshop Pre-event Webinar

Link to registration for the webinar

As previously announced, on June 30th in Amsterdam, the All-round Translator (ART) is offering the workshop "Coming to Terms: Mining & Management" covering a range of practical topics for applied corpus linguistics, optimizing terminology management and efficient sharing of terms in teams. This technical workshop will cover a range of tools and techniques as described on the ART event page.

A month before the Amsterdam workshop I will be presenting a free webinar offering an overview of some of the material planned for June as well as related topics with a particular focus on one of the tools I use most - memoQ - with highlights of recent improvements in its terminology features with memoQ versions 8.3 and 8.4.

This webinar is open to anyone interested regardless of whether or not they plan to attend the June workshop. The talk will use Zoom, which I adopted for remote teaching of corporate clients and others because of its greater versatility and superior recording facilities compared to my old favorite, TeamViewer. (Technically it's also a "meeting", not a "webinar" in Zoom-speak, but that's a distinction without a difference for people who don't feel up to 50 simultaneous speakers and 10,000 viewers.) The platform is free for participants to use, and if I'm not mistaken, a web browser can also be used, though interaction is more limited via that medium. (Don't ask me how, I am still gathering experience with this tool and its myriad options.)

The presentation (approximately one hour, starting at 4 pm Central European Time = 3 pm Lisbon time on May 24th) is free, but registration is required: the link for that is here.

Apr 14, 2018

memoQ filter for MS Outlook e-mail

A few days ago I was preparing screenshots in memoQ for lecture slides. As I tried to select a PDF file to import, the defective trackpad on my laptop caused a file farther down in the list to be selected, and I got a surprise. Not believing my eyes, I tried again and saw that, yes, what I saw was indeed possible...

... saved Microsoft Outlook MSG files (e-mail) are imported to memoQ with all their graphics and attachments! Kilgray created a filter some time ago and simply forgot to document its existence publicly. As of the current versions of memoQ you won't see this in the documentation or the filter lists of the interface, but memoQ can "see" MSG files, and if they are selected, this hidden filter will appear in the import dialog.

And this also works for LiveDocs.

At the time of this discovery, I was working on a little job for a friend's agency, and her project manager had sent me a list of abbreviations in an e-mail. I was too lazy to make the entries in my termbase, so I simply imported the mail to the LiveDocs corpus I maintain for her shop so that it would show up in concordance searches:

So when people tell you memoQ is good, don't believe them. It's actually better than that, but the truth is a well-kept secret :-)

Apr 4, 2018

New in memoQ 8.4: easy stopword list creation!

This wasn't really on Kilgray's plan, but hey - it's now possible, and that makes my life easier. An accidental "feature".

Four years ago, frustrated by the inability of memoQ to import stopword lists obtained from other sources to memoQ, I published a somewhat complex workaround, which I have used in workshops and classes when I teach terminology mining techniques. For years I had suggested that adding and merging such lists be facilitated in some way, because the memoQ stopword list editor really sucks (and still does). Alas, the suggestion was not taken up, so translators of most source languages were left high and dry if they wanted to do term extraction in memoQ and avoid the noise of common, uninteresting words.

Enter memoQ version 8.4... with a lot of very nice improvements in terminology management features, which will be the subject of other posts in the future. I've had a lot of very interesting discussions with the Kilgray team since last autumn, and the directions they've indicated for terminology in memoQ have been very encouraging. The most recent versions (8.3 and 8.4) have delivered on quite a number of those promises.

I have used memoQ's term extraction module since it was first introduced in version 5, but it was really a prototype, not a properly finished tool despite its superiority over many others in a lot of ways. One of its biggest weaknesses was the handling of stopwords (used to filter out unwanted "word noise". It was difficult to build lists that did not already exist, and it was also difficult to add words to the list, because both the editor and the term extraction module allowed only one word to be added at a time. Quite a nuisance.

In memoQ 8.4, however, we can now add any number of selected words in an extraction session to the stopword list. This eliminates my main gripe with the term extraction module. And this afternoon, while I was chatting with Kilgray's Peter Reynolds about what I like about terminology in memoQ 8.4, a remark from him inspired the realization that it is now very easy to create a memoQ stopword list from any old stopword lists for any language.

How? Let me show you with a couple of Dutch stopword lists I pulled off the Internet :-)

I've been collecting stopword lists for friends and colleagues for years; I probably have 40 or 50 languages covered by now. I use these when I teach about AntConc for term extraction, but the manual process of converting these to use in memoQ has simply been too intimidating for most people.

But now we can import and combine these lists easily with a bogus term extraction session!

First I create a project in memoQ, setting the source language to the one for which I want to build or expand a stopword list. The target language does not matter. Then I import the stopword lists into that project as "translation documents".

On the Preparation ribbon in the open project, I then choose Extract Terms and tell the program to use the stopword lists I imported as "translation documents". Some special settings are required for this extraction:

The two areas marked with red boxes are critical. Change all the values there to "1" to ensure that every word is included. Ordinarily, these values are higher, because the term extraction module in memoQ is designed to pick words based on their frequencies, and a typical minimum frequency used is 3 or 4 occurrences. Some stopword lists I have seen include multiple word expressions, but memoQ stopword lists work with single words, so the maximum length in words needs to be one.

Select all the words in the list (by selecting the first entry, scrolling to the bottom and then clicking on the last entry while holding down the Shift key to get everything), and then select the command from the ribbon to add the selected candidates to the stopword list.

But we don't have a Dutch stopword list! No matter:

Just create a new one when the dialog appears!

After the OK button is clicked to create the list, the new list appears with all the selected candidates included. When you close that dialog, be sure to click Yes to save the changes or the words will not be added!

Now my Dutch stopword list is available for term extraction in Dutch documents in the future and will appear in the dropdown menu of the term extraction session's settings dialog when a session is created or restarted. And with the new features in memoQ 8.4, it's a very simple matter to select and add more words to the list in the future, including all "dropped" terms if you want to do that.

More sophisticated use of your new list would include changing the 3-digit codes which are used with stopwords in memoQ to allow certain words to appear at the beginning, in the middle, or at the end of phrases. If anyone is interested in that, they can read about it in my blog post from six years ago. But even without all that, the new stopword lists should be a great help for more efficient term extractions for your source languages in the future.

And, of course, like all memoQ light resources, these lists can be exported and shared with other memoQ users who work with the same source language.

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:

Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:

That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:

That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:

The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:

Then choose that custom configuration when you import the file using Import with Options:

This cascaded configuration can also be saved using the corresponding icon button.

This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:

If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

Apr 3, 2018

Dealing with tagged translatable text in memoQ

Lately I've been doing a bit of custom filter development for some translation agency clients. Most of it has been relatively simple stuff, like chaining an HTML filter after an Excel filter to protect HTML tags around the text in the Excel cells, but some of it is more involved; in a few cases, three levels of filters had to be combined using memoQ's cascading filter feature.

And sometimes things go too far....

A client had quite a number of JSON files, which were the basis for some online programming tutorials. There was quite a lot of non-translatable content that made it past memoQ's default JSON filter, much of which - if modified in any way - would mess up the functionality of the translated content and require a lot of troublesome post-editing and correction. In the example above, Seconds in a day: is clearly translatable text, but the special rules used with the Regex Tagger turned that text (and others) into protected tags. And unfortunately the rules could not be edited efficiently to avoid this without leaving a lot of untranslatable content unprotected and driving up the cost (due to increased word count) for the client.

In situations like this, there is only one proper thing to do in memoQ: edit the tags!

There are two ways to do this:

  • use the inline tag editing features of memoQ or
  • edit the tag on the target side of a memoQ RTF bilingual review file.
The second approach can be carried out by someone (like the client) in any reasonable text editor; tags in an RTF bilingual are represented as red text:

If, however, you go the RTF bilingual route, it's important to specify that the full text of the tags is to be exported, or all you'll get are numbers in brackets as placeholders:

Editing tags in the memoQ working environment is also straightforward:

On the Edit ribbon, select Tag Commands and chose the option Edit Inline Tag

When you change the tag content as required, remember to click the Save button in the editing dialog each time, or your changes will be lost.

These methods can be applied to cases such as HTML or XML attribute text which needs to be translated but which instead has been embedded in a tag due to an incorrectly configured filter. I've seen that rather often unfortunately.

The effort involved here is greater than the typical word- or character-based compensation schemes can justly compensate and should be charged at a decent hourly rate or be included in project management fees. 

A lot of translators are rather "tag-phobic", but the reality of translation today is that tags are an essential part of the translatable content, serving to format translatable content in some cases and containing (unfortunately) embedded text which needs to be translated in other (fortunately less common) cases. Correct handling of tags by translation service providers delivers considerable value to end clients by enabling translations to be produced directly in the file formats needed, saving a great deal of time and money for the client in many cases.

One reasonable objection that many translators have is that the flawed compensation models typically used in the bulk market bog do not fairly include the extra effort of working with tags. In simple cases where the tags are simply part of the format (or are residual garbage from a poorly prepared OCR file, for example), a fair way of dealing with this is to count the tags as words or as an average character equivalent. This is what I usually do, but in the case of tags which need editing, this is not enough, and an hourly charge would apply.

In the filter development project for the JSON files received by my agency client, the text used was initially analyzed at
14,985 words; 111,085 characters; 65 tags
and after proper tagging of the coded content to be protected it was
8766 words; 46,949 characters; 2718 tags.
The reduction in text count more than covered the cost of the few hours needed to produce the cascading filter needed for this client's case and largely ensured that the translator could not alter text which would impair the function of the product.