Jun 4, 2014

OmegaT’s Growing Place in the Language Services Industry

Guest post by John Moran

As both a translator and a software developer, I have much respect for the sophistication of the well-known proprietary standalone CAT tools like memoQ, Trados, DejaVu and Wordfast. I started with Trados 2.0 and have seen it evolve over the years. To greater and lesser extents these software publishers do a reasonable job at remaining interoperable and innovating on behalf of their main customers - us translators. Kudos in particular to Kilgray for using interoperability standards to topple the once mighty Trados from its monopolistic throne and forcing SDL to improve their famously shoddy customer support. Rotten tomatoes to Across for being a non-interoperable island and having a CAT tool that is unpopular with most (but curiously not all) of the freelance translators I work with in Transpiral.

But this piece is about OmegaT. Unlike some of the other participants in the OmegaT project, I became involved with OmegaT for purely selfish reasons. I am currently in the hopefully final stage of a Ph.D. in computer science with an Irish research institute called the Centre for Next Generation Localisation (www.cngl.ie). I wanted to gather activity data from translators working in a CAT tool for my research in a manner similar to a translation process research tool called TransLog. My first thought was to do this in Trados as that was the tool I knew best as a translator but Trados’ Application Programming Interface did not let me communicate with the editor.

Thus, I was forced to look for an open-source CAT tool. After looking at a few alternatives like the excellent Virtaal editor and a really buggy Japanese one called Benten I decided on OmegaT. 

Aside from the fact that it was programmed in Java, a language I have worked with for about ten years as a freelancer programmer, it had most of the features I was used to working with in Trados.  I felt it must be reliable if translators are downloading it 4000 times every month. That was in 2010. Four years later that number is about to reach 10,000. Even if most of those downloads are updates, it should be a worrying trend for the proprietary CAT tools. Considering SDL report having 135,000 paid Trados licenses in total - that is a significant number.

Having downloaded the code, I added a logging feature to it called instrumentation (the “i” in iOmegaT) and programmed a small replayer prototype. Imagine pressing a record button in Trados and later replaying the mechanical act of crafting the translation as a video, character-by-character or segment-by-segment, and you will get the picture. So far we use the XML it generates mainly to measure the impact of machine translation on translation speed relative to not having MT. Funnily enough, when I developed it I assumed it would show me that MT was bunk. I was wrong. It can aid productivity, and my bias was caused by the fact that I had never worked with useful trained MT. My dreams of standing ovations at translator association meetings turned to dust.

If I can’t beat MT I might as well join it. About a year and a half ago, using a government research commercialization feasibility grant, I was joined by my friend Christian Saam on the iOmegaT project. We studied computational linguistics in Ireland and Germany on opposite sides of an Erasmus exchange programme, so we share a deep interest in language technology and a common vocabulary. We set about turning the software I developed in collaboration with Welocalize into a commercial data analysis application for large companies that use MT to reduce their translation costs.

However, MT post-editing is just one use case. We hope to be able to use the same technique to measure the impact of predictive typing and Automatic Speech Recognition on translators. I believe these technologies are more interesting to most translators as they impose less on word order.

At this point I should point out that CNGL is a really big research project with over 150 paid  researchers in areas like speech and language technology. Localization is big business in Ireland. My idea is to funnel less commercially sensitive translator user activity data securely, legally, transparently and, in most cases anonymously from translators using instrumented CAT tools into a research environment to develop and, most importantly, test algorithms to help improve translation productivity. Someone once called it telemetry for offline CAT tools. My hope is that though translation companies take NDAs very seriously, it is also a fact that many modern content types like User Generated Content and technical support responses appear on websites almost as soon as they are written in the source language, so a controlled but automated data flow may be feasible. In the future it may also be possible to test algorithms for technologies like predictive typing without uploading any linguistic data from a working translator’s PC. Our bet is that researchers are data-tropic. If we build it they will come.

We have good cause to be optimistic. Welocalize, our industrial partner, is an enlightened kind of large translation company. They have a tendency to want to break down the walls of walled gardens. Many companies don’t trust anything that is free, but they know the dynamics of open-source. They had developed a complex but powerful open-source translation memory system called GlobalSight, and its timing was precipitous.

It was released around the same time SDL announced they were mothballing their newly acquired Idiom WorldServer systemtheir system to replace it with the newly acquired Idiom WorldServer (now SDL WorldServer). This panicked a number of corporate translation buyers, who suddenly realized how deeply networked their translation department was via its web services and how strategically important the SDL TMS system was. As the song goes, "you don’t know what you’ve got till its gone" – or, in this case, nearly gone.

SDL ultimately reversed the decision to mothball TMS WorldServer and began to reinvest in its development, but that came too late for many some corporates who migrated en-masse to GlobalSight. It is now one of the most implemented translation management systems in the world in technology companies and Fortune 500’s. A lot of people think open-source is for hippies, but for large companies open-source can be an easy sell. They can afford engineering support, department managers won’t be caught with their pants down if the company doing the development ceases to exist, and most importantly their reliance on SDL’s famously expensive professional services division is reduced to zero. If they need a new web-service, they can program it themselves. GlobalSight is now used in many companies who are both customers of Welocalize and companies like Intel who are not. Across should pay heed. At a C-Suite level corporates don’t like risk.

However, GlobalSight had a weakness. Unlike Idiom WorldServer it didn’t have its own free CAT tool. Translators had a choice of download formats and could use Trados but Trados licenses are expensive and many translators are slow to upgrade. Smart big companies like to have as much technical control of their supply-chain as possible so Welocalize were on the lookout for a good open-source CAT tool. OpenTM2 was a runner for a while but it proved unsuitable. In 2012 they began an integration effort to make OmegaT compatible with GlobalSight. When I worked with Welocalize as an intern I saw wireframes for an XLIFF editor on the wall but work had not yet started. Armed with data from our productivity tests and Didier Briel, the OmegaT project manager, who was in Dublin to give a talk on OmegaT, I made the case for integrating OmegaT with GlobalSight. It was a lucky guess. Two years later it works smoothly and both applications benefit from each other.

What did I have to gain from this? Data.

So why this blog? Next week I plan to present our instrumentation work at the LocWorld tradeshow and I want Kilgray to pay heed. OmegaT is a threat to their memoQ Translator Pro sales and that threat is not going to reduce with time. Christian and I have implemented a sexy prototype of a two-column working grid, and we can do the same trick importing SDL packages with OmegaT as they do with memoQ. Other large LSPs are beginning to take note of OmegaT and GlobalSight.

However, I am a fan of memoQ, and even though the poison pill has been watered down to homeopathic levels, I also like Kilgray’s style. The translator community has nothing to gain if a developer of a good CAT tool suffers poor sales. This reduces manpower for new and innovative features. Segment-level A/B testing using time data is a neat trick. The recent editing time feature is a step in the right direction, but it could be so much better. The problem is that CAT tools waste inordinate amounts of translator time, and the recent trend towards CAT tools connected to servers makes that even worse. Slow servers that are based on request-response protocols instead of synchronization protocols, slow fuzzy matches, bad MT, bad predictive typing suggestions, hours wasted fixing automatic QA to catch a few double spaces. These are the problems I want to see fixed using instrumentation and independent reporting.

So here is my point in the second person singular. Kilgray – I know you read this blog. Listen! Implement instrumentation and support it as a standard. You can use the web platform Language Terminal to report on the data or do it in memoQ directly. On our side, we plan to implement an offline application and web-application that lets translators analyse that data by manually importing it so they can see exactly how much they earn per hour for each client in any CAT tools that implement that standard. €10 says Trados will be last. A wise man once said you get the behavior you incentivize, and the per-word pricing model incentivizes agencies to not give a damn about how much a translator earns per hour. The important thing is to keep the choice about sharing translation speed data with the translator but let them share it with clients if they want to.  Web-based CAT tools don’t give them that choice, so play to your strengths. Instrumentation is a powerful form of telemetry and software QA.

So to summarize: OmegaT’s place in the language services industry is to keep proprietary CAT tool publishers on their toes!


See also the CNGL interview with Mr. Moran....


  1. John, I heartily agree that implementing a logging protocol such as you describe would provide valuable hard data to facilitate the evaluation of tools and features for real productivity. However, I think that you're mistaken that SDL Trados would come in last in an earnings per hour comparison. That honor will more likely be had by that virtual concentration camp of translation - Across. It was interesting to see the comments from consultants in a Facebook discussion that mentioned Across. None could refute the points about how Across damages the translator's productivity compared to other tools - indeed there was strong agreement - but they felt the need to pander to that cursed technology for the sake of corporate client interests and the chance of a few euros in their pockets. Even the fact that Across damages corporate adopters' interests makes no difference if they can pocket a little money today.

    I'm not sure what you mean by Kilgray's "terminal server web application". Do you mean Language Terminal? That has nothing to do with a terminal server, but it would be a good venue for the reporting for any tool's results I think.

    It's interesting how quick various people who have little or no direct experience with translation are to praise web-based CAT tools as the "future" of translation despite their miserable ergonomics. These tools exist for the convenience of the project managers, and I don't believe that in a single case the work output of the translator is increased by comparison with respectable desktop-based tools. Translators would be idiots to waste their time with any tool which will slash their effective hourly earnings as drastically as a browser-based tool can for significant volumes of work. So for a piece rate economy they are definitely a losing proposition, and instrumentation such as you describe could document that clearly enough. For translation buyers with tight deadlines and high volume needs, data logging such as you describe could steer them away from technologies which would lead to harmful production bottlenecks. Altogether a fine thing.

  2. I had this conversation with John before so I will allow myself to repeat my doubt about MT and productivity gains, at least as a universal determination. I think that there are many variables to factor in in order to make sure that one compares apples with apples and not with something else. But this is besides the point at the moment.

    I agree that implementing a standardized logging protocol that will produce reliable, comparable results is very important. First, for the translators using the tools as it will allow them to better quantify their hourly productivity and earning (an area that is sorely lacking at the moment); Second, it is important to have some kind of standardized benchmark to assess how technology really affects productivity, but I can also imagine how this type of benchmark or comparative tool can be abused in the commercial market. Furthermore, the same issue of data diversity probably applies here as well because not all users are equal in terms of experience, expertise, workflow, work efficient, quality standards, etc., and while it is relatively easy to measure the effect on productivity, it is more difficult to measure how a specific technology affects quality. For example, there are already orders of magnitude more poor human translation/PEMT being produced each day - quick and on the cheap - compared to high quality human translation, but this doesn't mean that this workflow and business model are better just because they might seem more "productive".

    Still, I'm all in support for a logging protocol/mechanism.

    As for SDL Trados Studio, I'm not sure which version you have referred to in the article, but I think that the APIs of Studio 2014 should allow some access to the editor, albeit it might be indeed too limited. I'm curious to know.

  3. This post should really be seriously discussed, although as reality for many translators might look different (see also Kevin's point on across). And just a small note on Benten: this tool (development seems to have stopped in 2010) is (according to Wikipedia) a fork of OmegaT.

  4. Our student Miquel Esplà-Gomis has recently released an alternative, free/open-source session logging plugin that works with any OmegaT 3.0 and higher: https://github.com/mespla/OmegaT-SessionLog/releases (see also https://groups.yahoo.com/neo/groups/omegat/conversations/messages/31522) . We are currently working on mining the enormous amount of data each session produces to actually look for productivity indicators as part of a project funded by the Spanish Ministry of Economy and Competitivity.

  5. Thank you for a very interesting post.It is going to be interesting to see how logging features for CAT Tools develop, whether as a standard or plug-in feature. Shai Navé..... Studio Time Tracker, released yesterday, is a (free) plug-in for SDL Trados 2014 to log time spent on projects and then use this with other input (hourly rates, task - translating, proofreading, client info) for invoicing purposes. A more commercially than academically oriented logging feature, but certainly providing a feature to log time spent and enable users to invoice in terms of time rather than text. More info at https://www.youtube.com/watch?v=kNtVpqwfRlU&feature=youtu.be and http://www.translationzone.com/openexchange/app/studiotimetracker-591.html?action=Download

  6. If I remember well, Benten was an attempt by a Japanese government agency to fund a "local" CAT tool. Benten used a number of OmegaT libraries and as such is not a fork of OmegaT. In fact it is the only project out of the 4 listed on Wikipedia that is not a fork. And besides for Autshumato, all the projects are not under development anymore (as far as one can see from their latest release dates).

  7. Just a quick comment, on a personal level rather than officially from SDL. When SDL purchased WorldServer I don't believe there was ever any intention to mothball SDL TMS. Anyone with any insight to these customers would know this is complete nonsense, as they would also know there was no migration en-masse... probably no migration full stop! Goodness knows where the author of this article got that from... maybe it's wishful thinking on his part!
    I also think there is a complete lack of recognition over the APIs available with Studio which is of course the reason why many organisations use this product in the first place; especially as they can do the work themselves without requiring any additional support from SDL. I think the sort of things being discussed here around MT post editing, predictive typing and speech recognition would be handled at desktop level anyway, so I'm not sure what SDL TMS or SDL WorldServer has to do with this? We're already seeing applications being written that can measure productivity, identify changes and record the work being carried out. Some of them are in the public domain already such as Post-Edit Compare and Studio Time Tracker for example. Imagine the possibilities when you put these two things together... and as the developer of these two publicly available applications is actually the same person you can imagine where it's going next.
    I found this an interesting article, but thought it was spoiled by the inaccuracy and infantile comments about SDL.

  8. Paul - First off allow me to apologise for my factual error. The announcement was to mothball WorldServer when SDL bought it. I'll touch base with Kevin to have that corrected as soon as possible. This was a genuine error and I am suitably embarrassed.

    The reason I mentioned it is that GlobalSight now works well with GlobalSight and the work we have done on instrumentation works in a version of OmegaT we call iOmegaT. I mentioned it to highlight the fact that an open-source CAT tool now has the potential to gain traction in corporate environments. It was a salvo aimed squarely at Across but also to highlight the fact that OmegaT is not just a toy for language geeks or a solution for cash-strapped translators who have to use a CAT tool. It is stable and well-liked by a growing proportion of the translator community.

    In fact the $10 bet was based on a private conversation you and I had at the last ELIA conference on Malta. You told me SDL would never make it possible to measure translation speed in Trados so I was betting that SDL would be the last offline CAT tool to implement instrumentation. It was in no way meant to suggest that translators who use Trados are unproductive (when they can work!).

    Just to be clear, I have no religious bias against any piece of software. I am not an open-source fanboy. Though I can translate IT texts from German to English, I am mainly a developer. I have programmed both offline and web-based applications so I have no bias in that regard either. I think web-based tools are fine for short marketing type jobs but longer complex work over days or weeks requires the sophistication and ergonomic comfort of tools like OmegaT, Trados, MemoQ or DejaVu.

    Overall, my feeling is that in general translators are most productive in the CAT tool they are most used to working with (assuming predictive typing and ASR are well supported).

    What I am trying to do with this instrumentation initiative is make it easier for translators to see how technologies like EMBT, SMT, predictive typing and ASR impact on their hourly earnings. I think if more people could see this more would be done by researchers in centres like CNGL to improve them.

    John Moran

  9. Hi John,
    I'm fascinated by your final paragraph in the last comment:
    "What I am trying to do with this instrumentation initiative is make it easier for translators to see how technologies like EMBT, SMT, predictive typing and ASR impact on their hourly earnings. I think if more people could see this more would be done by researchers in centres like CNGL to improve them."

    I'm wondering why you are focusing so much on hourly earnings? When I translate I don't focus on how much I earn per hour. It's a matter of getting the job done by the deadline and the LSP does not know how many hours I spend on a particular job. In fact, as you say in the main text, they have no interest in that. So two jobs of equal word count may take different amounts of time. The freedom of freelancing means I may work 15 hours on one day and not at all on the following 3.

    What's wrong with word-based rates? They provide easy calculations and the linguist sets the limits for how much they want to earn and/or how much they want to work.

  10. Hi Gillian - there is nothing wrong with word based rates. I don't see them going away. The point behind CAT tool instrumentation is to be able to accurately assess the impact of various technologies on translator productivity. For example, imagine predictive typing turning itself on and off at random for segments and then being able to see a report that tells you the impact it had on your seconds per word translation speed after a few days working on a regular account. If you see you are faster maybe you will invest more time in adding words and phrases.

    The point is that as translators we are too busy translating to see exactly how fast we are working under different circumstances and there is not enough visibility on how features like autosuggest aid us in our work. Many people just switch it off because it annoys them. Others report 1000+ word per day improvements but these are guesses.

    The second point I want to make (and apologies for hijacking your question) is that productivity data will likely be used by certain types of LSPs to negotiate lower rates on material that can be translated fast. For that reason, as a buyer of the CAT tool it is important that it is at your discretion with whom that data is shared.

    My motivation for wanting to see this kind of data in CAT tools is that I know it could be used by researchers but I would also find it useful as a translator (e.g. to select an MpT engine that saves me the most time).


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)