Translation Tribulations: LiveDocs

Showing posts with label LiveDocs. Show all posts

Jun 14, 2024

Substack in #xl8 & #l10n

As related in my last post, I've been testing the Substack platform as an alternative means for distributing and managing the kind of information I've published for many years now on this blog, professional sites and social media, course authoring platforms, YouTube, Xitter, Mastodon and probably a few other channels I've forgotten. And I'm quite surprised to find any expectations I had to be exceeded at this point. On the whole I can do a better job of creating accessible tutorials and other reference information there, and there is far less effort involved with maintenance, style sheet fiddling and whatnot.

All my various projects there can be found here on my profile page.

Currently, these are the

Translation Tribulations Substack, a general blog for the language services sector and the
memoQuickies Substack, with content related to the memoQ TMS platform in one way or another.

Other plans include The Diary and Letters of Charles Berry Senior, a US Civil War veteran from Yorkshire, England, who participated in Sherman's march to the sea and who is my great-great grandfather. Some of his records exist in very deteriorated form in a university archive in the United States and can be found online, but much is missing, and I am in possession of the complete transcript prepared by one of the man's daughters more than 100 years ago when the family feared the information would be lost to the forces of material decay. I'll be preparing clean text from her handwritten record (the typescript done by a cousin about 50 years ago was lost and probably contains more errors) and probably an audio reading. There's some hard stuff there, as well as some surprises and interesting lessons in how our world has changed in the past 160 years.

And at some point I'll probably share some of my culinary obsessions as well as what life was like traveling on the other side of the Iron Curtain sans papers or in dingy Paris bookshops and refugee hotels 40 years ago and more.

This past week, I've done a big blitz on memoQ LiveDocs, for which there are still another dozen or so drafts to be finished, and a lot of stuff from my CAT tools resource online class from last year should be appearing there in updated form in due course. There isn't a lot of translation-related activity that I've found on Substack, at least not for the technical side of things, but a lot of historians and authors I follow are very present there, so I'm hoping the better half of my translation technologist friends will join the Substack party at some point.

Most of my new text, video and teaching content will appear on those Substack channels. It's simply far easier to manage, and there you won't have the same RSS headaches you might have here. And the damned editor of Google Blogger just keeps accumulating bugs I don't have to cope with in Substack. And don't get me started on bloody Wordpress!

I hope to see you in my Substack channels soon! I think there will be something like a memoQ QA course there before the summer is over....

Oct 29, 2019

Bilingual EU legislation the easy way in #xl8

Translators of European languages based in the EU and many others deal often with citations of EU legislation or need to consult relevant EU legislation for terminology in their translations. One popular source of information for that is the EUR-LEX website, which provides a convenient archive of legislation and related information, with the possibility of multilingual text displays, as seen here:

Some years ago, I published a description of how data from these multilingual EUR-LEX displays can be transferred to translation memories or other corpora for reference purposes, and more recently I produced a video showing this same procedure. But some people don't like the paragraph-level alignment format of the EUR-LEX displays, and these can also occasionally be seriously out of sync for some reason, as in this example (or worse):

Now I don't find that much of a nuisance when I use memoQ LiveDocs, because I can simply view the full bilingual document context and see where the corresponding information really is (kind of like leaving alignments in memoQ uncorrected until you actually find a use for the data and determine that the effort is worthwhile), but if you plan to feed that aligned data to a translation memory, it's a bit of a disaster. And many people prefer data aligned at the sentence level anyway.

Well, there is a simple way to get the EU legislation texts you want, aligned at the sentence level, with the individual bitexts ready to import into a translation memory, LiveDocs corpus or other reference tool. See that document number above with the large red arrow pointing to it? That's where you start....

Did you know that much of the information available in EUR-LEX is also available in the publicly available DGT translation memories? These are sentence-level alignments. But most people go about using this data in a rather klutzy and unhelpful way. The "big data" craze some years ago had a lot of people trying to load this information into translation memories and other places, usually with miserable results. These include:

the inability to load such enormous data quantities in a CAT tool's TM without having far more computer RAM than most translators ever think they'll need;
very slow imports, some apparently proceeding on a geological time scale;
data overload - so many concordance hits that users simply can't find the focused information they need; and
system performance degradation, with extremely sluggish responses in a wide variety of tasks.

Bulk data is for monkeys and those who haven't evolved professionally much beyond that stage. Precision data selection makes more sense, and enables better use of the resources available. But how can you achieve that precision? If I want the full bilingual text of EU Regulation No. 575/2013 in some language pair, for example, with sentence-level alignment, how can I find that quickly in the vast swamp of DGT data?

Years ago, I published an article describing how it is better to load the individual TMX files found in the downloadable ZIP archives from the DGT into LiveDocs so that the full document context can be seen from the concordance searches. What I didn't mention in that article is that the names of those individual TMX files correspond to the document numbers in EUR-LEX.

Armed with that knowledge, you can be very selective in what data and how much you load from the DGT collection. For example, if you organize the data releases in folders by year...

... and simply unpack the ZIP files in each year's folder...

... each folder will contain TMX files...

... the names of which correspond to the document number found in EUR-LEX. So a quick search in Windows Explorer or by other means can locate the exact document you want as a TMX file ready to import into your CAT tool:

These TMX files typically contain 24 EU languages now, but most CAT tools will filter just the language pair you want. So the same file can usually give you Polish+French, German+English, Portuguese+Greek or whatever combination you need among the languages present.

I still prefer to import my TMX data into a LiveDocs corpus in memoQ, and there I can use the feature to import a folder structure, and in the import dialog, I simply write the name of the file I want, and all other files (thousands of them) are promptly excluded:

After I enter the file name in the Include files field, I click the Update button to refresh the view and confirm that only the file I want has been selected. Depending on where in memoQ you do the import, you may have to specify the languages (Resource Console) to extract or not (in a project, where the languages are already set). Of course, the data can also be imported to a translation memory in memoQ, but that is an inferior option, because then it is not possible to read the reference document in a bilingual view as you can in a LiveDocs corpus; only isolated segments can be viewed in the Concordance or Translation results pane.

How you work with these data and with what tools is up to you, but this procedure will provide you with a number of options for better data selection and improved access to the reference data you may need for EU legislation without getting stuck in the morass of millions of translation units in a performance-killing megabomb TM.

Apr 12, 2019

memoQ LiveDocs: What Good Is It, Anyway? (webinar, 23 April 2019)

I'll be presenting a free webinar on the uses, advantages, and quirks of memoQ's often underestimated and misunderstood LiveDocs module: everything you always needed to know about its features and corpora but never thought to ask.

When? On 23 April 2019 at 3:00 PM Lisbon time (4:00 PM Central European Time, 7:00 AM Pacific Standard Time); webinar attendance is free, but advance registration is required. The URL to register is:

https://zoom.us/meeting/register/d83de9818dce3cf87c24e00bf0acd2b8

After registering, you will receive a confirmation e-mail containing information about joining the presentation.

Particular questions regarding the subject matter or challenges you face with it are welcome in advance via e-mail, LinkedIn or other social media contact. I'll try to incorporate these in the presentation.

Accessing full bilingual document context in an alignment in LiveDocs from the memoQ Concordance

Dec 5, 2018

Bilingual Excel to LiveDocs corpora: video

A bit over three years ago, I published a blog post describing a simple way to move EUR-Lex data into a memoQ LiveDocs corpus so that the content can be used for matching and concordance work in translation and editing. The particular advantage of a LiveDocs corpus versus a translation memory is that the latter does not allow users to read the document context for concordance hits.

A key step in the import process is to bring the bilingual content in an Excel file into a memoQ project as a "translation document" and then send that content to LiveDocs from the translation files list. Direct import to a LiveDocs corpus of bilingual content using the multilingual delimited text import filter is still not possible despite years of asking the development team for memoQ to implement this.

This is irritating, though in the case of a EUR-Lex alignment which may be slightly out of sync and need fixing, it is perhaps all for the best. And in some other situations, where the content may be incomplete and require filtering (in a View) before sending it to the corpus, it also makes sense to bring the file in as a translation document first to use the many tools available for selecting and modifying content. However, in many situations, it's simply a nuisance that the files cannot be sent directly to a LiveDocs corpus.

In any case, I've now done a short (silent) video to make the steps involved in this import process a little clearer:

Jun 22, 2017

Translation alignment: the memoQ advantage

The basic features for aligning translated texts in memoQ are straightforward and can be learned easily from Kilgray documentation such as the guides or memoQ Help. However, there are three aspects of alignment in memoQ which I think are worth particular attention and which distinguish it in important ways from alignments performed with other translation environment tools or aligners.

Aligning the content of two folders with source and target documents; automatic pairing by name

The first is memoQ’s ability to determine alignment pairs automatically based on the similarity of names. This has the advantage, for example, that large numbers of files can be aligned automatically, with the source and target documents matched based upon the filenames. This can be done with individual files or with entire folders with perhaps hundreds of files. Thus if source files are contained in one folder and the translated files in the target language are in a different folder and the source and target file names are similar, the alignment process for a great number of files can be set up and run in a matter of minutes. Note in the example screenshot above that different file types may be aligned with each other.

The second important difference with alignment in memoQ is that it is really not necessary to feed the aligned content to a translation memory. memoQ LiveDocs alignments essentially function as a translation memory in the LiveDocs corpus, with one important difference: by right-clicking matches in the translation results pane or a concordance hit list, the aligned document can be opened directly and the full context of the content match can be read. A match or concordance hit found in a traditional translation memory is an isolated segment, divorced from its original context, which can be critical to understanding that translated segment. LiveDocs overcomes this problem.

A third advantage of alignment in memoQ is that, unlike environments in which aligned content can only be used after it is fed to a translation memory, a great deal of time can be saved by not “improving” the alignment unless its content has been determined to be relevant to a new source text for translation. If an analysis shows that there are significant matches to be found in a crude/bulk alignment, the specific relevant alignments can be determined and the contents of these finalized while leaving irrelevant aligned documents in an unimproved state. Should these unimproved alignments in fact contain relevant vocabulary for concordance searches, and if a concordance hit from them appears to be misaligned, opening the document via the context menu usually reveals the desired target text in a nearby segment.

Concordance lookup in memoQ with direct access to an aligned document in a LiveDocs corpus

Jun 5, 2017

Technology for Legal Translation

Last April I was a guest at the Buenos Aires University Facultad de Derecho, where I had an opportunity to meet students and staff from the law school's integrated degree program for certified public translators and to speak about my use of various technologies to assist my work in legal translation. This post is based loosely on that presentation and a subsequent workshop at the Universidade de Évora.

Useful ideas seldom develop in isolation, and to the extent that I can claim good practice in the use of assistive technologies for my translation work in legal and other domains it is largely the product of my interactions with many colleagues over the past seventeen years of commercial translation activity. These fine people have served as mentors, giving me my first exposure to the concepts of platform interoperability for translation tools, and as inspirations by sharing the many challenges they face in their work and clearly articulating the desired outcomes they hoped to achieve as professionals. They have also generously and frequently shared with me the solutions that they have found and have often unselfishly shared their ideas on how and why we should do better in our daily practice. And I am grateful that I can continue to learn with them, work better, and help others to do so as well.

A variety of tools for information management and transformation can benefit the work of a legal translator in areas which include but are not limited to:

corpus utilization,
text conversion,
terminology management,
diverse information retrieval,
assisted drafting,
dictated speech to text,
quality assurance,
version control and comparison, and
source and target text review.

Though not exhaustive, the list above can provide a fairly comprehensive basis for education of future colleagues and continued professional development for those already active as legal translators. But with any of the technologies discussed below, it is important to remember that the driving force is not the hardware and software we use in technical devices but rather the human mind and its understanding of subject matter and the needs of the particular task or work process in the legal domain. No matter how great our experience, there is always something more and useful to be learned, and often the best way to do this is to discuss the challenges of technology and workflow with others and keep an open mind for new approaches with promise.

Reference texts of many kinds are important in legal translation work (and in other types of translation too, of course). These may be monolingual or multilingual texts, and they provide a wealth of information on subject matter, terminology and typical usage in particular contexts. These collections of text – or corpora – are most useful when the information found in them can be read in context rather than isolation. Translation memories – used by many in our work – are also corpora of a kind, but they are seriously flawed in their usual implementations, because only short segments of text are displayed in a bilingual format, and the meaning and context of these retrieved snippets are too often obscure.

An excerpt from a parallel corpus showing a treaty text in English, Portuguese and Spanish

The best corpus tools for translation work allow concordance searches in multiple selected corpora and provide access to the full context of the information found. Currently, the best example of integrated document context with information searches in a translation environment tool is found in the LiveDocs module of Kilgray's memoQ.

A memoQ concordance search with a link to an "aligned" translation

A past translation and its preview stored in a memoQ LiveDocs corpus, accessed via concordance search

A memoQ LiveDocs corpus has all the advantages of the familiar "translation memory" but can include other information, such as previews of the translated work as well. It is always clear in which document the information "hit" was found, and corpora can also include any number of monolingual documents in source and target languages, something which is not possible with a traditional translation memory.

In many cases, however, much context can be restored to a traditional translation memory by transforming it into a "document" in a LiveDocs corpus. This is because in most cases, substantial portions of the translation memory will have its individual segment records stored in document order; if the content is exported as a TMX file or tab-delimited text file and then imported as a bilingual document in a LiveDocs corpus, the result will be almost as if the original translations had been aligned and saved, and from a concordance hit one can open the bilingual content directly and read the parts before and after the text found in the concordance search.

Legal translation can involve text conversion in a broad sense in many ways. Legal translators must often deal with hardcopy or faxed material or scanned files created from these. Often documents to translate and reference documents are provided in portable document format (PDF), in which finding and editing information can be difficult. Using special software, these texts can be converted into documents which can be edited, and portions can be copied, pasted and overwritten easily, or they can be imported in translation assistance platforms such as SDL Trados Studio, Wordfast or memoQ. (Some of these environments include integrated facilities for converting PDF texts, but the results are seldom as suitable for work as PDF or scanned files converted with optical character recognition software such as ABBYY FineReader or OmniPage.)

Software tools like ABBYY FineReader can also convert "dead" scanned text images into searchable documents. This will even work with bad contrast or color images in the background, making it easier, for example, to look for information in mountains of scanned documents used in legal discovery. Text-on-image files like the example shown above completely preserve the layout and image context of the text to be read in the best way. I first discovered and used this option while writing a report for a client in which I had to reference sections of a very long, scanned policy document from the European Parliament. It was driving me crazy to page through the scanned document to find information I wanted to cite but where I had failed to make notes during my first reading. Converting that scanned policy to a searchable PDF made it easy to find what I needed in seconds and accurately cite its page number, etc. Where there is text on pictures, difficult contrast and other features this is often far better for reference purposes than converting to an MS Word document, for example, where the layouts are likely to become garbled.

Software tools for translation can also make text in many other original formats accessible to translators in an ergonomically simpler form, also ensuring, where necessary, that no text is overlooked because of a complicated layout or because it is in an easily overlooked footnote or margin note. Text import filters in translation environments make it easy to read and translate the words in a uniform working environment, with many reference tools and other help available, and then render the translated text back into its original format or some more useful bilingual format.

An excerpt of translated patent claims exported as a bilingual table for review

Technology also offers many possibilities for identifying, recording and controlling relevant terminology in legal translation work.

Large quantities of text can be analyzed quickly to find the most frequent special vocabulary likely to be relevant to the translation work and save these in project glossaries, often enabling that work to be organized better with much of the clarification of terms taking place prior to translation. This is particularly valuable in large projects where it may be advisable to ensure that a team of translators all use the same terms in the target language to avoid possible confusion and misunderstanding.

Glossaries created in translation assistance tools can provide terminology hints during work and even save keystrokes when linked to predictive, "intelligent" writing features.

Integrated quality checking features in translation environments enable possible deviations of terminology or other issues to be identified and corrected quickly.

Technical features in working software for translation allow not only desirable terms to be identified and elaborated; they also enable undesired terms to be recorded and avoided. Barred terms can be marked as such while translating or automatically identified in a quality check.

A patent glossary exported from memoQ and then made into a PDF dictionary via SDL Trados MultiTerm

Technical tools enable terminology to be shared in many different ways. Glossaries in appropriate formats can be moved easily between different environments to share them with others on a team which uses diverse technologies; they can also be output as spreadsheets, web pages or even formatted dictionaries (as shown in the example above). This can help to ensure consistency over time in the terms used by translators and attorneys involved in a particular case.

There are also many different ways that terminology can be shared dynamically in a team. Various terminology servers available usually suffer from being restricted to particular platforms, but freely available tools like Google Sheets coupled with web look-up interfaces and linked spreadsheets customized for importing into particular environments can be set up quickly and easily, with access restricted to a selected team.

The links in the screenshot above show a simple example using some data from SAP. There is a master spreadsheet where the data is maintained and several "slavesheets" designed for simple importing into particular translation environment tools. Forms can also be used for simplified data entry and maintenance.

If Google Sheets do not meet the confidentiality requirements of a particular situation, similar solutions can be designed using intranets, extranets, VPNs, etc.

Technical tools for translators can help to locate information in a great variety of environments and media in ways that usually integrate smoothly with their workflow. Some available tools enable glossaries and bilingual corpora to be accessed in any application, including word processors, presentation software and web pages.

Corpus information in translation memories, memoQ LiveDocs or external sources can be looked up automatically or in concordance searches based on whole or partial content matches or specified search terms, and then useful parts can be inserted into the target text to assist translation. In some cases, differences between a current source text and archived information is highlighted to assist in identifying and incorporating changes.

Structured information such as dates, currency expressions, legal citations and bibliographical references can also be prepared for simple keystroke insertion in the translated text or automated quality checking. This can save many frustrating hours of typing and copy revision. In this regard, memoQ currently offers the best options for translation with its "auto-translation" rulesets, but many tools offer rules-based QA facilities for checking structured information.

Voice recognition technologies offer ergonomically superior options for transcription in many languages and can often enable heavy translation workloads with short deadlines to be handled with greater ease, maintaining or even improving text quality. Experienced translators with good subject matter knowledge and voice recognition software skills can typically produce more finished text in a day than the best post-editing operations for machine pseudo-translation, with the exception that the text produced by human voice transcription is actually usable in most situations, while the "gloss" added to machine "translations" is at best lipstick on a pig.

Reviewing a text for errors is hard work, and a pressing deadline to file a brief doesn't make the job easier. Technical tools for translation enable tens of thousands of words of text to be scanned for particular errors in seconds or minutes, ensuring that dates and references are correct and consistent, that correct terminology has been used, et cetera.

The best tools even offer sophisticated tools for tracking changes, differences in source and target text versions, even historical revisions to a translation at the sentence level. And tools like SDL Trados Studio or memoQ enable a translation and its reference corpora to be updated quickly and easily by importing a modified (monolingual) target text.

When time is short and new versions of a source text may follow in quick succession, technology offers possibilities to identify differences quickly, automatically process the parts which remain unchanged and keep everything on track and on schedule.

For all its myriad features, good translation technology cannot replace human knowledge of language and subject matter. Those claiming the contrary are either ignorant or often have a Trumpian disregard for the truth and common sense and are all too eager to relieve their victims of the burdens of excess cash without giving the expected value in exchange.

Technologies which do not assist translation experts to work more efficiently or with less stress in the wide range of challenges found in legal translation work are largely useless. This really does include machine pseudo-translation (MpT). The best “parts” of that swindle are essentially the corpus matching for translation memory archives and corpora found in CAT tools like memoQ or SDL Trados Studio, and what is added is often incorrect and dangerously liable to lead to errors and misinterpretations. There are also documented, damaging effects on one’s use of language when exposed to machine pseudo-translation for extended periods.

Legal translation professionals today can benefit in many ways from technology to work better and faster, but the basis for this remains what it was ten, twenty, forty or a hundred years ago: language skill and an understanding of the law and legal procedure. And a good, sound, well-rested mind.

*******

Further references

Speech recognition

Dragon NaturallySpeaking: https://www.nuance.com/dragon.html
Tiago Neto on applications: https://tiagoneto.com/tag/speech-recognition
Translation Tribulations – free mobile for many languages: http://www.translationtribulations.com/2015/04/free-good-quality-speech-recognition.html
Circuit Magazine - The Speech Recognition Revolution: http://www.circuitmagazine.org/chroniques-128/des-techniques
The Chronicle - Speech Recognition to Go: http://www.atanet.org/chronicle-online/highlights/speech-recognition-to-go/
The Chronicle - Speech Recognition Is in Your Back Pocket (or Wherever You Keep Your Mobile Phone): http://www.atanet.org/chronicle-online/none/speech-recognition-is-in-your-back-pocket-or-wherever-you-keep-your-mobile-phone/

Document indexing, search tools and techniques

Archivarius 3000: http://www.likasoft.com/document-search/
Copernic Desktop Search: https://www.copernic.com/en/products/desktop-search/
AntConc concordance: http://www.laurenceanthony.net/software/antconc/
Multiple, separate concordances with memoQ: http://www.translationtribulations.com/2014/01/multiple-separate-concordances-with.html
memoQ TM Search Tool: http://www.translationtribulations.com/2014/01/the-memoq-tm-search-tool.html
memoQ web search for images: http://www.translationtribulations.com/2016/12/getting-picture-with-automated-web.html
Upgrading translation memories for document context: http://www.translationtribulations.com/2015/08/upgrading-translation-memories-for.html
Free shareable, searchable glossaries with Google Sheets: http://www.translationtribulations.com/2016/12/free-shareable-searchable-glossaries.html

Auto-translation rules for formatted text (dates, citations, etc.)

Translation Tribulations, various articles on specifications, dealing with abbreviations & more:
http://www.translationtribulations.com/search/label/autotranslatables
Marek Pawelec, regular expressions in memoQ: http://wasaty.pl/blog/2012/05/17/regular-expressions-in-memoq/

Authoring original texts in CAT tools

Translation Tribulations: http://www.translationtribulations.com/2015/02/cat-tools-re-imagined-approach-to.html

Autocorrection for typing in memoQ

Translation Tribulations: http://www.translationtribulations.com/2014/01/memoq-autocorrect-update-ms-word-export.html

Oct 15, 2015

The Invisible Hand of memoQ LiveDocs - making "broken" corpora work

Last month I published a post describing the "rules" for document visibility in the list of documents for a memoQ LiveDocs corpus. Further study has revealed that this is only part of the real story and is somewhat misleading.

I (wrongly) assumed that, in a LiveDocs corpus, if a document was visible in the list its content was available in concordance searches or the Translation Results pane, and if it was not shown in the list of documents for the corpus in the project, its content would not be available in the concordance or Translation Results pane. Both assumptions proved wrong in particular cases.

In the most recent versions of memoQ, for corpora created and indexed in those versions, all documents in a corpus shown in the list will be available in the concordance search and the Translation Results pane as expected. And the rules for what is currently shown in the list are described accurately in my previous post on this topic. However,

if there are documents in the corpus which share the same main language (as EN-US and EN-UK both share the main language, English) but are not shown in the list, these will still be used for matching in the memoQ Concordance and Translation Results and
if the corpus was created in an older version of memoQ (such as memoQ 2013R2), documents shown in the list of a corpus may in fact not show up in a Concordance search or in the Translation Results.

This second behavior - documents shown in the list but their content not appearing in searches - has been described to me recently by several people, but it could not be reproduced at first, so I thought they must be mistaken, and statements that "sometimes it works and sometimes it doesn't" made these pronouncements seem even more suspect. Except that they happen to be true and I now (sort of) understand why.

Prior to publishing my post to describe the rules governing the display of documents for a LiveDocs corpus in a project, I had been part of a somewhat confusing discussion with one of my favorite Kilgray experts, who mentioned monolingual "stub" documents a number of times as a possible solution to content availability in a corpus, but when I tried to test his suggestion and saw that the list of documents on display in the corpus had not expanded to include content I knew was there, I thought he was wrong. But actually, he was right; we were talking about two different things - visibility of a document versus availability of its content.

For purposes of this discussion, a stub document is a small file with content of no importance, added only to create the desired behavior in memoQ LiveDocs. It might be a little text file - "stubby.txt" - with any nonsense in it.

I went back to my test projects and corpora used to prepare the last article and found that in fact for the main languages in a project all the content was available from the corpora, regardless of whether the relevant documents were displayed in the list. In the case of a corpus not offered in the list for a project because of sublanguage mismatches in the source and target, adding a stub document with either a generic setting (DE, EN, PT, etc.) or sublanguage-specific setting for the source language or the correct sublanguage setting for the target (DE-CH, EN-US, etc.) made all the corpus content for the main languages available instantly. (In the project, documents added will have the project language settings; use the Resource Console for any other language settings you want.)

Content of a test corpus before a stub document was added. Viewed in the Resource Console.

The test corpus with the document list shown in my project; only the stub document is displayed, but
all the indexed content shown above is also available in the Concordance and Translation Results.

It is unfortunate that in the current versions of memoQ the document list for a corpus in a project may not correspond to its actual content for the main languages. Not only does this preclude accessing a document's content without a match or a search, it also means that binary documents (such as one of the PDF files shown in the list) cannot be opened from within the project. I hope this bug will be fixed soon.

Since a few of my friends, colleagues and clients were concerned about odd behavior involving older corpora, I decided to have a look at those as well. Kilgray Support had made a general recommendation of rebuilding these corpora or had at least suggested that problems might occur, so I was expecting something.

And I found it. Test corpora created in the older version of memoQ (2013 R2) behaved in a way similar to my tests with memoQ 2015 - although the "display rules" for documents in the list differed as I described in my previous blog post, the content of "hidden" documents was available in Concordance searches and in the Translation Results pane. But....

When I accessed these corpora created in memoQ 2013 R2 using memoQ 2015, even if I could see documents (for example, a monolingual source document with a generic setting), the content was available in neither the Concordance nor the Translation Results until I added an appropriate stub document under memoQ 2015. Then suddenly the index worked under memoQ 2015 and I could access all the content, regardless of whether the documents were displayed in the list. If I deleted the stub document, the content became inaccessible again.

So what should we do to make sure that all the content of our memoQ corpora are available for searches in the Concordance or matches in the Translation results?

If you always work out of the same main source language (which in my case would be German or "DE", regardless of whether the variant is from Germany, Austria or Switzerland), then add a generic language stub document for your source language to all corpora - old and new - under memoQ 2015 and all will be well.

If your corpora will be used bidirectionally, then add a generic stub for both the source and target to those corpora or add a "bilingual stub" with generic settings for both languages. This will ensure that the content remains available if you want to use the corpora later in projects with the source and target reversed.

Although it's hard to understand the principles governing what is displayed, when and why, following the advice in the red text will at least eliminate the problem of content not being available for pretranslation, concordance searches and translation grid matches. And the mystery of inconsistent behavior for older corpora appears to be solved. The cases where these older corpora have "worked" - i.e. their content has been accessible in the Concordance, etc. - are cases where new documents were added to them under recent versions of memoQ. If you just keep adding to your corpora, doing so particularly from a project with generic language settings, you'll not have to bother with stub documents and your content will be accessible.

And if Kilgray deals with that list bug so we actually see all the documents in a corpus which share the main languages of a project, including the binary ones, then I think the confusion among users will be reduced considerably.

Sep 16, 2015

Getting around language variant issues in memoQ LiveDocs

I was told by some other users that a fundamental change had been made in the way language data are accessed in LiveDocs. It was said that until a few versions ago it had been possible to use documents for reference in LiveDocs regardless of their sublanguage settings. So I was told. The truth is more complicated than that.

According to my tests, memoQ 2015 is the first version of memoQ to have a logically consistent treatment of language variants for both bilingual and monolingual documents in corpora. All the other versions tested (memoQ 2013R2, 2014, 2014R2) are equally screwed up and show the same results.

The "visibility" of a monolingual or bilingual document when viewed in a corpus attached to a project running under memoQ 2015 follows these rules:

the sublanguage (language variant) settings for source and target (of the document or the project) must match the project
or the language setting (of the document or the project) must be generic.

Two rules. Pretty simple. It doesn't matter what version of memoQ the project or corpus was created in, only which version is actively running.

I created a test corpus with the following document mix:

The corpus contained 11 documents, both bilingual and monolingual with a mix of generic language settings and settings with language variants specified (such as German for Germany, Switzerland and Liechtenstein and English for Zimbabwe, the US and UK).

In a project running under memoQ 2015 with the languages set to generic German and generic English, all 11 documents in the corpus were accessible.

So if you want access to all LiveDocs corpus data for the major languages of your project, it is necessary to use generic language settings, either when you load the data into LiveDocs (difficult unless you always use the resource console, since adding documents to a corpus from within a project automatically applies the project's language settings!) or in the languages specified for the project itself. And this will only work with memoQ 2015. If you want to apply penalties to particular language variants this can be done using keyword markers (as seen in the screenshot above) and configuring the More penalties tab of the LiveDocs settings file applied to that corpus.

If the same corpus is attached to a project running under memoQ 2015 with language settings for Swiss German and generic English, the documents available from the corpus are these:

For a Swiss German and UK English project under memoQ 2015, this is the picture:

And for a Germany's German and US English:

All the screenshots above can be predicted based on the two rules stated. Work it out.

"But what happens with earlier versions of memoQ?" you might wonder. It's messy. Here is a look at a Swiss German and UK English project under memoQ 2013 R2, 2014 and 2014 R2:

And here's a project with generic German and Generic English under memoQ 2013 R2, 2014 and 2014 R2:

In each case the five bilingual documents are visible no matter what the project's language settings are. However, there is strict adherence to language variants and the generic language setting for monolingual documents! In my opinion, that's for the birds. I see no good reason to follow a different rule for data availability in bilingual versus monolingual documents. So in a sense, Kilgray has cleaned up this inconsistency in the latest version of memoQ.

Some have expressed a desire for a "switch" setting to allow language variant settings to be ignored. And perhaps Kilgray will provide such a feature in the future. But the best way to get there now is simply to make your project's language settings generic.

Changing the language settings for bilingual data in an existing LiveDocs corpus

If you have a corpus with a mix of language settings and you want to convert these to generic settings or a particular variant, this can be done as follows currently only for bilingual documents:

Select the bilingual documents to export from the corpus and export them to a folder. (If you choose to zip them all together, unpack the *.zip file later to make a folder of the exported *.mqxlz files.
Re-import the *.mqxlz files to the LiveDocs corpus via the Resource Console so you are able to specify the exact language settings you want. In the import dialog, you'll have to change the filter setting manually from "binary" to "XLIFF". These *.mqxlz files are not the same as bilingual files from a translation document in a project and are not recognized automatically.

Unfortunately, there is no way to change the language settings of a monolingual document except to re-import it in the Resource Console in its original form and set the language variant (or generic value) there.

So really, for now, the best way to go seems to be to use memoQ 2015 with generic project language settings.

Sep 15, 2015

A quick trip to LiveDocs for EUR-Lex bilingual texts

Quite a number of friends and respected colleagues use EUR-Lex as a reference source for EU legislation. Being generally sensible people, some of them have backed away from the overfull slopbucket of bulk DGT data and built more selective corpora of the legislation which they actually need for their work.

However, the issue of how to get the data into a usable form with a minimum of effort has caused no little trouble at times. The various texts can be copied out or downloaded in the languages of interest and aligned, but depending on the quality of the alignment tool, the results are often unsatisfactory. I've been told that AlignFactory does a better job than most, but then the question of how best to deal with the HTML bitexts from AlignFactory remains.

memoQ LiveDocs is of course rather helpful for quick and sometimes dirty alignment, but if the synchronization of the texts is too many segments off, it is sometimes difficult to find the information one needs even when the (bilingual) document is opened from the context menu in a concordance window.

EUR-Lex offers bi- or tri-lingual views of most documents in a web page. The alignments are often imperfect, but the synchronization is usually off by only one or two segments, so finding the right text in a document's context is not terribly difficult. So these often imperfect alignments are usually quite adequate for use as references in a memoQ LiveDocs corpus. Here is a procedure one might follow to get the EUR-Lex data there.

The bilingual text of a view such as the one above can be selected by dragging the cursor to select the first part of the information, then scrolling to the bottom of the window and Shift+clicking to select all the text in both columns:

Copy this text, then paste it into Excel:

Then import the Excel file as a file for "translation" in a memoQ project with the right language settings. Because of quirks with data access in LiveDocs if the target language variants are specified and possibly not matched, I have created a "data conversion project" with generic language settings (DE + EN in my case as opposed to my usual DE-DE + EN-US project settings) to ensure that data stored in LiveDocs will be accessed without trouble from any project. (This irritating issue of language variants in LiveDocs was introduced a few version ago by Kilgray in an attempt to placate some large agencies, but it has caused enormous headaches for professional translators who work with multiple sublanguage settings. We hope that urgent attention will be given to this problem soon, and until then, keep your LiveDocs language data settings generic to ensure trouble-free data access!)

When the Excel file is added to the Translations file list, there are two important changes to make in the import options. First, the filter must be changed from Microsoft Excel to "multilingual delimited text" (which also handles multilingual Excel files!). Second, the filter configuration must be "changed" to specify which data is in the columns of interest.

The screenshot above shows the import settings that were appropriate for the data I copied from EUR-Lex. Your settings will likely differ, but in each case the values need to be checked or set in the fields near the arrows ("Source language" particularly at the top and the three dropdown menus by the second arrow below).

Once the data are imported, some adjustments can be made by splitting or joining segments, but I don't think the effort is generally worth it, because in the cases I have seen, data are not far out of sync if they are mismatched, and the synchronization is usually corrected after a short interval.

In the Translations list of the Project home, the bilingual text can be selected and added to a LiveDocs corpus using the menus or ribbons.

The screenshot below shows the worst location of badly synchronized data in the text I copied here:

This minor dislocation does not pose a significant barrier to finding the information I might need to read and understand when using this judgment as a reference. The document context is available from the context menu in the memoQ Concordance as well as the context menu of the entry appearing in the Translation results pane.

A similar data migration procedure can be implemented for most bilingual tables in HTML files, word processing files or other data sources by copying the data into Excel and using the multilingual delimited text filter.

Search me!