May 13, 2013

Corpus terminology workshop in the Netherlands

The professional translators' group Stridonium is organizing a networking event on June 17, 2013 in Holten (NL) to teach translators in legal, financial and other domains the effective use of text collections (corpora) for identifying important terminology.

The NIFTY corpus methodology uses specialized texts compiled by translators themselves to find appropriate terms in the target language, in particular types of text (such as joint venture agreements, offering circulars, divorce decrees or any other). The methodology applies to all language pairs and has been developed for efficiency (requiring on average 30 minutes) to meet the needs of working translators.

Further details on the workshop and registration are available here.


  1. Hi Kevin,

    Maybe a little off topic, but I just came across an interesting corpus search website which searches 29 different corpora!

    It searches the following corpora:

    1k Graded Corpus (530,000)
    2000 List Corous (240,000)
    2k Graded Corpus (920,000)
    AA Academic Abstracts
    Academic Abstracts (174,000)
    BNC Commerce (3.8 million)
    BNC Humanities (3.3 million)
    ***BNC Law (2.2 million)***
    BNC Med (1.4 million)
    BNC speech (10 million)
    BNC Spoken (1 million)
    BNC Written (1 million)
    Brown (1 million wds)
    Brown + BNC Written (2+ m)
    Call of the Wild (24,000)
    Focus on Vocab (82,300)
    JPU Learner (300,000)
    NNS-Ts in Korea (123,000)
    NS-Ts in Korea (124,000)
    Presidential speeches (1.98 million)
    RAC Academic (103,000)
    RAC Research Articles Corpus (HK, 132,000 wds)
    TC Learner (Student) (150,000)
    TC Learner (Teacher) (61,000)
    TESL Prog (3,400)
    Univ. Word List (550,000)
    US TV Talk (2 million)
    V - Marlise
    Yenny Korean EFL teachers corpus


  2. Michael, what I particularly like about the method taught in this workshop is its focus on careful text selection with manageable scope in a specific specialist area. These large "bucket" corpora are more general in scope and likely less suited to making the sort of distinctions we would need.

  3. I poked around a bit in the legal and medical corpora - not bad for examples of general vocabulary - and then I discovered the link to bilingual dictionaries at the top of the concordance hitlist. Dangerous stuff in the hands of the ignorant. The English>German dictionary search for a medical term I was looking at pulled up hits that mostly had to do with traffic :-) It seems ill-advised to link dictionaries with no consideration of context.

  4. Hi Kevin,
    Yes, having attended Juliette's workshop at the Legal Translators' Conference in Portugal, I can confirm that this method promotes "developing *specific* domain terminologies in an efficient manner".

  5. Hi Kevin,

    Yes, that is of course always a danger of letting someone else build your corpus. I am currently trying to build a few of my own with tlCorpus (which btw now accepts PDFs!), but my time is limited and these ready-made online ones sure are a lot easier to set up;)

    Incidentally, I couldn't find those links to bilingual dictionaries you mentioned. Where exactly did you see them?


    1. The dictionaries are at the top of the concordance hitlists.

      PDF isn't much cause for excitement I think, given the usual range of difficulties with that format. You've been able to read text-extractable PDFs into memoQ corpora for a long time now, though the word order for a complex layout gets garbled. With many of the texts that would interest me I would have to do a decent OCR to get the words in a proper, usable sequence.

      Limited time is a factor that is taken into account very well in the approach promoted by this workshop. I've used it to create corpora for corporate sustainability reports and numerous specialized domains of technical marketing such as fire safety or security, and building and indexing a usable corpus for the area of specific interest for a project takes well under an hour typically. And of course I add to it as new, relevant material is found. The only great divergence for me from the NIFTY approach is that I use memoQ LiveDocs as my repository, and with the improvements in concordance search in the next version (now in beta) the few disadvantages of doing so have been reduced. (I'm thinking particularly of researching collocations, which will now be easier in memoQ if not as nice as in many dedicated concordancing tools.)


Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)