Nov 27, 2008

Practical use of corpora in acquiring or enhancing a translation specialty

For a long time now, I've had an interest in the use of text collections in particular subject domains as a means of developing specialized terminologies for my own use or the use of my clients. When I found my first practical guide on the use of bilingual corpora for terminology development, I immersed myself in the topic with delight, bought myself a license for Trados MultiTerm Extract and began to mine a lot of my past work to upgrade the termbases I had been creating for years.

There are, however, other useful applications of corpus linguistics for working translators. One interesting approach is described by Maher, Waller and Kerans in the July 2008 issue of the Journal of Specialised Translation in an article titled "Acquiring or enhancing a translation specialism: the monolingual corpus-guided approach".

The article discusses absolutely practical ways in which working translators can use concordancers and desktop-based indexers to acquire or enhance linguistic expertise for special subjects in their target languages. The target readers for the article are
  • novice translators seeking to specialize
  • experienced generalists who want to go "up market" with a specialty
  • translators who wish to enhance their subject-area expertise for a special client
  • translators working in a team who need to harmonize their use of language
The basic differences between concordancers and indexers are explained, and specific tools are mentined (the freeware AntConc concordancer from Laurence Anthony and the commercial, but inexpensive desktop indexer Archivarius). A concordancer looks up and aligns keywords (generally in the KWIC or keyword-in-context view) to allow you to see how they are used and identify patterns; frequency assessments and other functions are also available. A desktop indexer works like Google or any other search engine, except that it allows specific folders on the user's computer to be selected, so special text collections can be searched in a focused way.

After reading the article, I downloaded the tools and tested them. I was very impressed. Archivarius is much better than Copernic, which I have used for some time - a key difference is that it can deal with morphology in 18 languages. I personally only care about two of these, but it ought to make many translators happy. A 30-day fully functional trial with 99 launches is available, and individual licences range between about 20 euros for students and 45 euros for businesses. (Maybe a freelancer qualifies for the 30 euro "personal" license - that wasn't clear to me when I looked at the web site. I'll find out, however, because I will license this tool!) Dealing with morphology means, for example, that I can search "gleich" and get "gleiche", "gleicher", "gleichen" and "gleiches" in German.

The article has a nice discussion of access to free, readily-available texts. I also discovered in my research that there are large corpora covering specialist domains available for free in some languages. The American National Corpus is one example - I found a Berlitz travel corpus there with over a million words. Not my interest, but for someone who specializes in tourism or wants to, this is probably useful. The authors put together a special corpus for corporate financial reports using publicly available documents, and other examples were given.

The discussion of sampling adequacy is very valuable in my opinion. This is a question which has nagged me for a long time; the several books on the subject of corpus linguistics which are in my library dance around this issue and never commit to hard numbers that I can do something with. I am grateful to the authors for sticking their necks out and saying, for example, that while 40,000 words might be an adequate basis for a language teacher wanting to get started in a specialist area, a translator's linguistic questions probably won't be usefully addressed with less than about 250,000 words, with 500,000 being the point where things really start to get good.

The authors use a practical model of a high-quality base or "substrate corpus", which is carefully selected, cleaned of reference lists, non-linguistic content, extra spaces (these screw up frequency counts for phrases not to mention their identification) and maintained plus Q&D (quick-and-dirty) corpora, which cover specific topics for a current job, etc. Q&D corpora of a million words or more can be assembled in minutes using online corpus collectors, such as the Sketch Engine. The article gives good, practical advice on blancing these two types of resources and how they can and should be stored on your hard drive.

The discussion of "fair use" is thoughtful. I agree with it, but others, including some lawyers, may not. These topics get debated in public forums a lot, and have been the subject of articles in professional journals as well where intellectual property issues regarding translations and translation memories are raised. For those with an interest in such topics, there is enough out there to keep you busy reading for months. I tend to be cautious and share resources only when I am sure no legal objections will be raised.

The authors offer practical advice or storing and organizing corpora, including the importance of naming conventions for files and maintaining a log of corpora. This advice should be read carefully, as it reflects some hard-won experience.

In their conclusions, the authors emphasize that this appoach not only has value for compensating uneven or insufficient knowledge of a field, genre or register, but it can also be important for counteracting source language interference for people like me who live in the country where the source language is spoken and do not have daily contact with speakers and the culture of the target language. That's a valuable point, because many of us have observed such problems with ourselves or others. (If you haven't, you're either a hermit or just incredibly dense.)

Given how often the question of specialization and how to acquire it is raised on forums like Translator's Café or ProZ, I think the article can help an enormous number of translators improve their situation. I particularly appreciated the clear language of the article and its example-based, practical advice. Many people, especially those coming to translation from other career or edicational backgrounds than languages or linguistics, who try to investigate the topic of corpus linguistics get snowed under to quickly in a blizzard of academic bullshit. This is an article that you can read in an hour and apply in a useful way in the next hour.


  1. The importance of acquiring, maintaining and consulting monolingual corpora is seldom emphasized by schools and mostly overlooked by freelancers. To the sources mentioned in this post I would add the free corpora by Mark Davies ( and to the tools, I would add DocFetcher ( which supports regular-expressions for searches and Logiterm Pro ( which is commercial software with many capabilities (and a huge price tag).

  2. Hector, have you seen the article on the NIFTY method? In it I refer to a book by Lynne Bowker, which I find to be one of the most accessible treatments of practical corpus linguistics I have encountered. Since this article was written, the introduction of LiveDocs in Kilgray's memoQ has added other interesting aspects to the use of monolingual corpora in translation, and with predictive typing features some rather nice support is available for acquiring an authentic "voice" in the target language.

    The usual focus on bilingual corpora by novice translators is unfortunate; an understanding of the power of good monolingual corpora can usually take them much further. There needs to be a lot more teaching on this topic.


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)