Translation Tribulations: termbases

Showing posts with label termbases. Show all posts

Nov 26, 2023

memoQ term base "roundup" chat

The last of the planned office hour discussions for the self-guided online course "memoQuickies Resource Camp" will be held on November 27th at 17:00 CET (8:00 PST). For those not already registered for the Zoom chats, the link to do that is HERE. If you have done so previously, the access URL is the same. The chat is open to everyone, regardless of whether they are registered or not in the online course.

We'll begin with open Q&A time on any of the course sections in the term base unit or any other presentations of the subject matter by me on this blog or my YouTube channel. Afterward, I will show my general method for editing or updating term base content via Microsoft Excel exports brought in to the memoQ working grid, which facilitates certain kinds of changes or actions involving regular expression use. This goes beyond the possibilities of the integrated memoQ term base editor, which will also be presented briefly.

A recording of the talk will be made available later in the course structure.

In December the course will move on to discussions of QA profiles and other aspects of quality assurance in memoQ, with some live discussion possibilities to be announced. All course material will remain online for access until the end of March. More information on the memoQuickies Resource Camp can be found HERE.

Nov 25, 2023

Book review: "Terminology Extraction for Translation and Interpretation Made Easy"

A few months ago, I received a pre-release copy of this book as a courtesy from the author, terminologist Uwe Muegge, with a request to give a quick language check to the English used by its native German author. As I expected, there wasn't much to complain about, because he has lived in the US for a long time and taught at university there as well as been involved in important corporate roles. I was particularly pleased by his disciplined style of writing, the plain, consistent English of the text and the overall clarity of the presentation. Anyone with good basic English skills should have no difficulty understanding and applying the material.

At the time I read the draft, I was completely focused on language use and style, but I found his approach and suggestions interesting, so I looked forward to "field testing" and direct comparisons with my usual approach to terminology mining with that feature in memoQ. About a day's worth of tests shows very interesting potential for applying the ChatGPT section of the book and also made the context and relevance of the other two sections clearer, I will discuss those sections first before getting to the part that interests me the most.

Uwe presents three approaches:

ChatGPT (https://chat.openai.com/)
OneClick Terms (https://terms.sketchengine.eu/)
Wordlist (https://www.webcorp.org.uk/live/wdlist.jsp)

I wouldn't really call these three approaches alternatives as the book does, because all three operate in very different ways and are fit for different purposes. That didn't register fully in my mind when I was in "editor mode", although the first part of the book made the differences, advantages and disadvantages clear enough, but as soon as I began using each of the sites, the differences were quite apparent as were the similarities to more familiar tools like memoQ's term extraction module.

Wordlist from Webcorp is functionally similar to Laurence Anthony's AntConc or memoQ's term extraction. It's essentially useful for getting frequency lists of words, but the inability to use my own stopword lists for filtering out uninteresting common vocabulary makes me prefer my accustomed desktop tools. However, the barriers to first acquaintance or use are lower than for AntConc or memoQ, so this would probably be a better classroom tool for introducing concepts of word frequencies and identifying possible useful terminology on that basis.

OneClick Terms was interesting to me mostly because friends and acquaintances in academia talk about Sketch Engine a lot. The results were similar to what I get with memoQ, including similar multiword "trash terms". I found the feature for term extraction from bilingual texts particularly interesting, and the fact that it can work well on the TMX files distributed by the Directorate-General for Translation (DGT) of the European Commission suggests that it could be an efficient tool for building glossaries to support translation referencing EU legislation, for example, though I expect only slight advantages over my usual routine with memoQ. These advantages are not worth the monthly subscription fee to me. However, for purposes of teaching and comparison, the inclusion of this platform in the book is helpful. I see more value for academic institutions and those rare large volume translation companies that do a lot of work with EU resources.

ChatGPT was an interesting surprise. I have a very low opinion of its use as a writing tool (mediocre on its best day, clumsy and boring in nearly all its output) or for regex composition (hopelessly incompetent for what I need, and anything it does right for regex is newbie stuff for which I need no support). However, as a terminology research tool I have found excellent potential, though formatting the results can be problematic.

My testing was done with ChatGPT 3.5, not a Professional subscription with access to version 4.0. However, I am sorely tempted to try the subscription version to see if it is able to handle some formatting instructions (avoiding unnecessary capitalization) more efficiently. No matter how carefully I try to stipulate no default capitalization of the first letter of every expression, I inevitably have to repeat the instruction after a list of improperly capitalized candidate terms is created.

I keep an e-book copy of Uwe's book in the Kindle app on my laptop, so I can simply copy and paste his suggested prompts, then add whatever additional instructions I want.

The prompt

Please examine the text below carefully and list words or expressions which may be difficult to translate, but when writing the list, do not capitalize any words or expressions which don't require capitalization.

is too long, and only the part marked red is executed correctly, but this follow-up prompt will fix the capitalization in the list:

Please re-examine that text and this time when writing the list, do not capitalize any words or expressions which do not require capitalization.

Further tests involved suggesting translations for the expressions, with or without a translated text and building tables with example sentences:

Other prompt variations, for example to write terms bold in the example sentences, worked without complications.

What about the quality of the selections? Well, I used memoQ's term extraction module on the same text I submitted to ChatGPT for term extraction in order to compare something with which I am quite familiar with this new process.

memoQ identified a few terms based on frequency, which ChatGPT ignored, but these were arguably terms that a qualified specialist would have known anyway. And ChatGPT did a superior job of selecting multi-word expressions with no "noise". It also selected some very relevant single-occurrence phrases which might be expected to arise more in later, similar texts.

Split screen review of memoQ extraction vs. ChatGPT results

The split-screenshot is an intermediate result from one of my many tests. The overlayed red box was intended to show a conversation partner the limits of ChatGPT's "alphabetizing skill", and the capitalization of the German is not correct after a prompt to correct the capitalization of adjectives misfired. It is not always trivial to get formatting exactly as I want it. However, looking at the results of each program side-by-side like this showed me that ChatGPT had in fact identified the nearly all the most relevant single words and phrases in my text. And for other texts with dates or citation formats, these were also collected by ChatGPT as "relevant terms", giving me an indication of what legislation I might want to use as reference documents and what auto-translation rules might also be helpful.

I also found that the split view as above helped me to work my way through the noise in the memoQ term candidate list much faster and make decisions about which terms to accept. The terms of interest found in memoQ but not selected by ChatGPT were few enough that I am not at all tempted to suggest people follow my traditional approach with the memoQ term extraction module and skip the work with ChatGPT.

My preferred approach would be to do a quick screening in ChatGPT, import the results into a provisional (?) term base and then, as time permits, use that resource in a memoQ term extraction to populate the target fields in the extraction grid. With those populated terms in place, I think the review of the remaining candidates would proceed much more efficiently.

All in all, I found Uwe's book to be a useful reference for teaching and for my personal work; it is one of the few texts I have seen on LLM use which is sober and modest enough in its claims that I was inspired to test them. The sale price is also well within anyone's means: about $10 for the e-book and $16 for the paperback on Amazon. For the "term curious" without access to professional grade tools, it's a great place to get started building better glossaries and for more seasoned wordworkers it offers interesting, probably useful suggestions.

The book is available HERE from Amazon.

Sep 4, 2023

New online course: "memoQuickies Resource Camp"

Summer is almost over, but technically, "camping season" will continue in memoQ World until November 30th. Or maybe January 31st, depending on how you count.

Today, a three-month journey of exploration begins, covering six important kinds of resources to make work with the memoQ translation desktop and server environments more pleasant and efficient... and profitable. This self-guided online course will give participants full access to my 14 years of cumulative experience as a memoQ user translating, managing projects and developing hundreds of solutions with this world-leading productivity tool.

Click here or on the icon bar above to have a look at the course description and to see (and maybe download) some of the publicly available information and resources for better work in many language pairs.

The emphasis of teaching will shift to a new resource every two weeks (with auto-translation rules as the main topic for the first two weeks), but throughout the course, information will be added continuously to all topic sections as I trawl through, sort, upgrade and publish the best or most interesting stuff from my archives. And course participants have access to open virtual office hours each week on Thursdays and some other occasions, where any questions can be asked and special requests made.

A special enrollment discount of 40% is available for the first week (code: HALFOFFLAUNCH) until September 10th, but you can join at any time and work with any of the material posted, ask questions and receive feedback. Learning material and downloadable, ready-to-use and -adapt resources will continue to be added until the end of November, and the full course will remain online through January 2024. Enrollment fees and content are subject to change without notice.

Addendum 1: On Thursday afternoon, September 7th, 2023, a presentation was made to introduce the first course topic - "Auto-translation Rules for Everyone". The recording and slides can be found here.

Addendum 2: Payment options for groups and monthly budgets have been introduced now. These options enable teams, departments and organizations to obtain blocks of passes for their members to receive continuing professional education in translation workflow tools. The host site applies VAT and other taxes where relevant and generates appropriate invoices. All relevant information can be found at the bottom of the information and enrollment page.

May 30, 2022

Cleaning up language variants in memoQ term bases

While the idea of using sublanguage variants, such as UK, US or Canadian versions of English, sounds nice in principle, in practice these often create headaches for users of translation environments such as memoQ, particularly when exchanging glossaries with others but also when viewing and editing the data in the built-in editors. Many times I have heard colleagues and clients express a wish to "go back" and work only with generic variants of a language in order to simplify their management of terminology data. In the video below, I share one method to do so.

At 3:08 in the video, I share a little "aside" about how the exported term data can be edited to mark a term as forbidden (for instance, if its use is not desired by the translation buyer). Other changes to the information are also possible at this stage, such as the addition of context and use information for example. Other data fields from the term base can also be included in the export for cleanup if these play an important role in your memoQ term bases.

For years, users have requested an editing feature in memoQ that would make "unifying" language variants possible, but as you can see in this video tutorial, this possibility already exists and is neither difficult nor time-consuming to implement.

If you do not wish to create a new term base to import the cleaned-up data (as shown in the video) but would rather bring it in to the same term base, it is important to configure the settings for your import correctly so that the original data will be overwritten and you won't end up with messy duplication of information. This is achieved with the following setting marked in red:

However, it should be noted that the term base will still have all the now-unused language variants, albeit with no entries for them. These can be removed by unchecking the boxes for the respective language variants in the term base's Properties dialog.

Speaking of the Properties dialog, some may have noted that in recent versions of memoQ there is an automated option for cleaning up those unwanted language variants:

Why bother with the XSLX route then? Well, depending on what version of memoQ you use, you may not have that command available in the dialog. But more importantly, I find that when merging data from various language variants I often want to do additional editing of the term information, and that really isn't possible when merging language variants in the Properties dialog. Doing the edits in Microsoft Excel gives you an overview of the data and the option to make whatever adjustments may be needed. In Excel you can also make further changes, such as altering the match properties for better hit results or more accurate quality assurance.

Jan 6, 2021

Tweeting away....

Got up this morning to not altogether unexpected good news that the Empire of MAGATs has fallen:

Sometimes you wake up in the morning, and sometimes... it seems that a country wakes up with you. From nightmares.#fuckthenazis #fucktrump #MAGA2020 #EatThisMAGAts pic.twitter.com/JsKskALOdP
— Kevin Lossner (@GermanENTrans) January 6, 2021

Yeah. Life is starting to feel normal again despite the usual continued death and destruction. But what does one do with babies if not put them in cages?

A course announcement for terminology users in memoQ (i.e. any sensible user):

Course re-opened for enrollment:#memoQuickies: On Better Terms with #memoQ

The real deal for better #terminology leverage in #translation #xl8 #translate #l10n #Localization #LocalEyes https://t.co/bxpu6FhT1U pic.twitter.com/KtmmqW3OFU
— Kevin Lossner (@GermanENTrans) January 6, 2021

... which leads one to ask: How do I get there? Well, try this:

Check out the CAT skills courses and consulting services at #Translation Tribulations Tech. #1 in #memoQ productivity on desktops and servers.#l10n #xl8 #translate https://t.co/76kcL1UAgK pic.twitter.com/xMa3iBuGNK
— Kevin Lossner (@GermanENTrans) January 6, 2021

Sep 26, 2019

10 Tips to Term Base Mastery in memoQ! (online course)

Note: the pilot phase for this training course has passed, free enrollment has been closed, and the content is being revised and expanded for re-release soon... available courses can be seen at my online teaching site: https://transtrib-tech.teachable.com/

In the past few years I have done a number of long webinars in English and German to help translators and those involved in translation processes using the memoQ environment work more effectively with terminology. These are available on my YouTube channel (subscribe!), and I think all of them have extensive hotlinked indexes to enable viewers to skip to exactly the parts that are relevant to them. A playlist of the terminology tutorial videos in English is available here.

I've also written quite a few blog posts - big and small - teaching various aspects of terminology handling for translation with or without memoQ. These can be found with the search function on the left side of this blog or using the rather sumptuous keyword list.

But sometimes just a few little things can get you rather far, rather quickly toward the goal of using terminology more effectively in memoQ, and it isn't always easy to find those tidbits in the hours of video or the mass of blog posts (now approaching 1000). So I'm trying a new teaching format, inspired in part by my old memoQuickie blog posts and past tutorial books. I have created a free course using the Teachable platform, which I find easier to use than Moodle (I have a server on my domain that I use for mentoring projects), Udemy and other tools I've looked at over the years.

This new course - "memoQuickies: On Better Terms with memoQ! 10 Tips toward Term Base Mastery" - is currently designed to give you one tip on using memoQ term bases or related functions each day for 10 days. Much of the content is currently shared as an e-mail message, but all the released content can be viewed in the online course at any time, and some tips may have additional information or resources, such as videos or relevant links, practice files, quality assurance profiles or custom keyboard settings you can import to your memoQ installation.

These are the tips (in sequence) that are part of this first course version:

Setting Default Term Bases for New Terms
Importing and Exporting Terms in Microsoft Excel Files
Getting a Grip on Term Entry Properties in memoQ
"Fixing" Term Base Default Properties
Changing the Properties of Many Term Entries in a Term Base
Sharing and Updating Term Bases with Google Sheets
Sending New Terms to Only a Specific Ranked Term Base
Succeeding with Term QA
Fixing Terminology in a Translation Memory
Mining Words with memoQ

There is also a summary webinar recorded to go over the 10 tips and provide additional information.

I have a number of courses which have been developed (and may or may not be publicly visible depending on when you read this) and others under development in which I try to tie together the many learning resources available for various professional translation technology subjects, because I think this approach may offer the most flexibility and likelihood of success in communicating necessary skills and knowledge to an audience wider than I can serve with the hours available for consulting and training in my often too busy days.

I would also like to thank the professional colleagues and clients who have provided so much (often unsolicited) support to enable me to focus more on helping translators, other translation project participants and translation consumers work more effectively and reduce the frustrations too often experienced with technology.

May 27, 2017

CAT tools for weapons license study

More than a decade ago I found a very useful book on practical corpus linguistics, which has had perhaps the greatest impact of any single thing on the way I approach terminology. Among other things, it discusses how to create special text collections for particular subjects and then mine these for frequently used expressions in those domains. It has become a standard recommendation in my talks at professional conferences and universities as well as in private consultations for terminology.

Slide from my recent talk at the Buenos Aires University Facultad de Derecho

In the last two weeks I had an opportunity to test my recommendations in a little different way than the one in which I usually apply them. Typically I use subject-specific corpora in English (my native language) to study the "authentic" voice of the expert in a domain that may be related to my own technical specialties but which differs in its use of language in significant ways. This time I used it and other techniques to study subject matter I master reasonably well (the features, use and safety aspects of firearms for hunting) with the aim of acquiring vocabulary and an idea of what to expect for a weapons qualification test in Portugal, where I have lived for several years but have not yet achieved satisfactory competence in the language for my daily routine.

It all started two weeks ago when I attended an all-day course on Portugal's firearm and other weapon laws in Portalegre. Seven and a half solid hours of lecture left me utterly fatigued at the end of the day, but it was an interesting one in which I had a lot of aha! moments as I saw a lot of concepts presented in Portuguese which I knew well in German and English. Most of the time I looked up words I saw in the slides or in the course textbook prepared by the PSP and made pencil notes on vocabulary in my book.

Twelve days afterward I was scheduled to take a written text, and in the unlikely event that I passed it, I was supposed to be subject to a practical examination on the safe use of hunting firearms are related matters.

Years ago when I studied for a hunting license in Germany I had hundreds of hours of theoretical and practical instruction in a nine-month course concurrent with a one-year understudy with an experienced hunter. Participants in a German hunting course typically read dozens of supplemental books and study thousands of sample questions for the exam.

The pickings are a little slimmer in Portugal.

There are no study guides in Portuguese or any other language which help to prepare for the weapons tests that I am aware of except the slim book prepared by the police.

There are, however, a number of online forums where people talk about their experiences in the required courses and on the tests. Sometimes there are sample questions reproduced with varying degrees of accuracy, and there is a lot of talk about things which people found particularly challenging.

So I copied and pasted these discussions into text files and loaded them into a memoQ project for Portuguese to English translation. The corpus was not particularly large (about 4000 words altogether), so the number of candidates found in a statistical survey was limited, but still useful to someone with my limited vocabulary. I then proceeded to translate about half of the corpus into English, manually selecting less frequent but quite important terms and making notes on perplexing bits of grammar or tricks hidden in the question examples.

A glossary in progress as I study for my Portuguese weapons license

The glossary also contained some common vocabulary that one might legitimately argue does not belong in a specialist glossary, but since these were common words likely to occur in the exam and I did not know them, it was entirely appropriate to include them.

Other resources on the subject are scarce; I did find a World War II vintage military dictionary for Portuguese and English which can easily be made into a searchable PDF using ABBYY Finereader or other tools but not much else.

Any CAT tool would have worked equally well for my learning objectives - the free tools AntConc and OmegaT are in no way inferior to what memoQ offered me.

On the day of the test, I was allowed to bring a Portuguese-to-English dictionary and a printout of my personal glossary. However, the translation work that I did in the course of building the glossary had imprinted the relevant vocabulary rather well on my mind, so I hardly consulted either. I was tired (having hardly slept the night before) and nervous (so that I mixed up the renewal intervals for driver's licenses and hunting licenses), and I just didn't have the stamina to pick apart some particularly long, obtuse sentences), but in the end I passed with a score of 90% correct. That wouldn't win me any kudos with a translation customer, but it allowed me to go on to the next phase.

Practical shooting test at the police firing range

In the day of lectures, I dared to ask only one question, and I garbled it so badly that the instructor really didn't understand, so I was not looking forward to the oral part of the exam. But much to my surprise, I understood all the instructions on exam day, and I was even able to joke with the policeman conducting the shooting test. In the oral examination in which I had to identify various weapons and ammunition types and explain their use and legal status, and in the final part where I went on a "hunt" with a police commissioner to demonstrate that I could handle a shotgun correctly under field conditions and respond appropriately to a police check, I had no difficulties at all except remembering the Portuguese word for "trigger lock". All the terms I had drilled for passive identification in the written exam had unexpectedly become active vocabulary, and I was able to hold my own in all the spoken interactions - not a usual experience in my daily routine.

The use of the same professional tools and techniques that I rely on for my daily work proved far better than expected as learning aids for my examination and in a much greater scope than I expected. I am confident that a similar application could be helpful in other areas where I am not very competent in my understanding and active use of Portuguese.

If it works for me, it is reasonable to assume that others who must cope with challenges of a test or interactions of some kind in a foreign language might also benefit from learning with a translator's working tools.

Feb 3, 2014

Colors in memoQ lookup results - which termbase?

A subject that comes up time and again with experienced colleagues is the desire to distinguish more easily where matches come from in memoQ. Of course, clicking on a match in the Translation Results pane of memoQ's working grid provides additional information for each type of resource (the example of the LiveDocs match has different information than one would expect to see from TM hit, a termbase match, a non-translatable or other kind of entry). But many want more obvious information in the working display and bilingual RTF exports to indicate the source of matches.

In the graphic at the left, segment matches from two different translation memories and two different LiveDocs corpora are shown. There is no visual clue to indicate the differences between these corpora and TMs. One would have to click on a particular entry and look at the meta-information at the bottom right to see which data collection the hit came from, the name of the document translated, when the translation unit was created, who wrote it, etc.

With termbases, however, the situation is now different. For those with excellent eyesight (I don't really qualify), there are subtle gradations of color to reflect termbase priority, the higher priority termbases showing darker colors for their hits. This is clever and useful, and unfortunately not able to be customized in a meaningful way at this time as far as I can tell. I might like to set a special match color for a termbase I want to take particular note of but which has a lower (and hard to distinguish) priority. Can you tell how many termbases are showing hits in the screenshot here? Look carefully.

I find the color cues used for matches in the memoQ working window quite helpful in most cases. Although these can be customized under Tools > Options > Appearance > Lookup results, I refuse to do so, because I use these color cues to explain things to other users sometimes, and I cause enough chaos telling them to use my personal customized keyboard shortcuts based on an old version of Déjà Vu, having long forgotten what the default keyboard shortcuts are. I also don't see an easy way to reset the default colors if I mess things up.

I'm grateful for the little bit of help that color differentiation in termbase results provides in recent versions of memoQ, and I hope that Kilgray takes this concept further. Similar gradations of color for TMs and LiveDocs would be helpful, and it would be very nice if custom colors could be assigned temporarily to particular resources of any type where some collections of data require special consideration. And then we need a simple way to reset those temporary color assignments.

If you agree with this or have other ideas for improving the accessibility of match result information, please write to support@kilgray.com and express your thoughts. Too often users remain passive with their frustrations and thoughts about changes or additional features needed. Kilgray tends to be a very responsive solution provider, but if the user community does not express its needs clearly and consistently, it's not reasonable to expect that what we need will happen and it's even less reasonable to be annoyed when it doesn't. In the five years I have used memoQ, things have often taken time to implement, but in that time the developers and product designers have usually given careful thought to matters and mostly exceed my expectations when they do provide the solution.

Search me!