Jun 29, 2013

Caption editing for YouTube videos

I've spent a great deal of time in recent weeks examining different means for remote instruction via the Internet. In the past I've had good success with TeamViewer to work on copywriting projects with a partner or deliver training to colleagues and clients at a distance. So far I have avoided doing webinars because of the drawbacks I see for that medium, both as an instructor and as a participant, but I haven't completely excluded the possibility of doing them eventually. I've also looked at course tools such as Citrix Go To Training and a variety of other e-learning platforms, such as Moodle, which is the tool used by universities and schools around the world and which also seems to be the choice of Kilgray, ProZ and others for certain types of instruction.

Recorded video can be useful with many of these platforms, and since I've grown tired of doing the same demonstrations of software functions time and again, I've decided to record some of these for easy sharing and re-use. When I noticed recently that my Open Source screen recording software, CamStudio had been released in a new version, I decided quite spontaneously to make a quick video of pseudotranslation in memoQ to test whether a bug in the cursor display for the previous version of CamStudio had been fixed.

After I uploaded the pseudotranslation demo to YouTube, I noticed that rather appalling captions (subtitles) had been created by automatic voice recognition. Although voice recognition software such as Dragon Naturally Speaking is usually very kind to me, Google's voice recognition on YouTube gave miserable results.

I soon discovered, however, that the captions were easy to edit and could also be exported as text files with time cues. These text files can be edited very easily to correct recognition errors or combine segments to improve the timing and subtitle display.

Once the captions for the original language are cleaned up and the timing is improved, the text files can be translated and uploaded to the video in YouTube to create caption tracks in other languages. As a test, I did this (with a little help from my friends), adding tracks for German and European Portuguese to the pseudotranslation demo. And if anyone else cares to create another track for their native language from this file, I'll add it with credits at the start of the track.

It's easy enough to understand why I might want to add captions in other languages to a video I record in English or German. But why would I want to do so in the original language? My thick American accent is one reason. I like to imagine that my English is clear enough for everyone to understand, but that is a foolish conceit. Of course I speak clearly - I couldn't use Dragon successfully if that were not true. But someone with a knowledge of English mostly based on reading or interacting with people who have very different accents might have trouble. It happens.

Although most of the demonstration videos SDL has online for SDL Trados Studio are easy to follow, some of the thick UK accents are really frightening and difficult for some people in places like Flyover America to follow. Some Kilgray videos of excellent content are challenging for those unaccustomed to the accents, and the many wonderful demos of memoQ, WordFast, OmegaT and other tools by CAT Guru on YouTube would have been difficult for me before I was exposed to the linguistic challenges of the wide world that can English. All of these excellent resources in English would benefit from clear English subtitles.

How difficult is it to create captions? The three-minute pseudotranslation demo cost me about ten minutes of work to clean up the subtitles. The English captions for another slightly shorter video explaining the use of the FeeWizard Online to estimate equivalent rates for charging by source or target words, lines, pages, etc. also took me about 10 or 15 minutes with all the text and timing corrections. And I've spent a good bit of time in the past week transcribing a difficult spoken English lecture by a German professor: it took me about 7 hours of transcription work to cope with a spoken hour. I don't know if this is typical, because I almost never do this sort of thing, and there were a lot of WTF moments. But I suppose three to seven times the recording length might be a reasonable range for estimating the effort of a draft edit and some timing changes. Not bad, really.

So if you are involved in creating instructional videos to put on YouTube or use elsewhere, please consider this easy way of making good work even better by investing a little time in caption creation and editing. Once you have done this for the original language, it will also be a simple matter to translate those captions to make your content even more accessible.

Jun 7, 2013

Understanding fuzzy term matching in memoQ 2013

Perhaps the most interesting and potentially useful feature for me in the recently released memoQ 2013 is fuzzy term matching. I have wanted something like this for several years, and several efforts at harmonizing terminology in a large, collaborative project last year made it clear that this might be very helpful in identifying deviations from agreed terminology in cases where that terminology appears as part of a compound word (as it sometimes tends to do in German, my source language).

So when I finally downloaded the latest version of memoQ last weekend and began testing, fuzzy terminology was the second thing I looked at (after the current comment mess). My initial tests left me very, very confused. Each example I created gave different results, and it was not easy to discern a pattern from examining just a few terms. The explanation of why this feature works as it does can be difficult to follow, at least as far as I am able to explain it, so many readers may be better off to read my conclusions in the next paragraph and skip everything below it (except maybe the last graphic).

Fuzzy term matching in memoQ 2013 is a real  improvement for terminology matching and quality assurance involving terms, at least for my language pair. This is not an easy challenge that the developers have taken on, but some useful results have been achieved and no harm has been done to previous functionality. And I expect that this feature will be the subject of further refinement and improvement for other languages as users make the case for these.

My first quick test of the feature involved a verb, the German word for "to wait" ("warten"). I put it into a test termbase and then imported a translation text consisting of various sentences that used forms of the verb. I noticed that there was a term hit for "warte", but nothing for "gewartet". After adding "warte" to the termbase for fuzzy matching, there was still no match for "gewartet", although it contained that character sequence.

Then I tried another example with "Gesetz" (law). There I seemed to hit the jackpot. There were hits with Unweltgesetz (a typo, but typical of many source texts I see), Umweltgesetze, Gesetzentwurf and Umweltgesetzentwurf with blue background highlighting of the character sequence matching the termbase entry.

A third term produced more confusion: with "Ausführung" in the termbase, there was no match for "Farbausführungungen", but there was a match for "Farbausführungsbeispiel". Clearly this is not a simple matching function.

A question to Kilgray Support brought an answer that explained the match behavior. The current implementation of fuzzy term matching in memoQ uses a combination of rules which depend on the index, the "edit distance" (calculated differences between the entry string and the characters in the term to match) and, depending on the language, character maps and a threshold length for possible compound words.

German, it seems, is a privileged language, the only one for which compound word recognition rules are currently active. Apparently five characters are the minimum to be recognized as a word, so the "Farb-" in "Farbanwendungen" wasn't enough, but "-beispiel" triggered the compound recognition rule that caused "-anwendung-" to be matched in the middle of "Farbanwendungsbeispiel". I imagine that compound matching would be useful for Dutch and some other languages, and the developer suggested that expanding coverage to other languages as needed would not be difficult.

Character mapping - defined equivalence between letters - is implemented for German, Hungarian, Italian and Spanish to allow matching in cases where letters may change with plural formation, for example. Thus the German word "Bratapfel" in the termbase would yield a hit for the plural form "Bratäpfel".

Edit distance is calculated by dividing the number of deviating characters by the number of total characters in the fuzzy term entry. A match is currently assumed if the edit distance is 0.2 or less. The term "warte" differs from the six-letter entry "warten" by one character; 1/6 is less than 0.2, so memoQ 2013 reports a term match. In the example of "Farbausführungen" above, the calculated edit is 6/10 (because 6 letters - four on the left, two on the right - are added to the term entry), and because this is larger than 0.2 no match is indicated. If, on the other hand, "Farbtonausführungen" occurs, a match will be found, because the added segment "Farbton-" meets the requirement of five or more letters for a compound word.

How relevant can this feature be to your language? What changes might be required to the matching behavior to obtain useful results in your source language(s)? Your feedback to Kilgray's support team and feedback from others working with your languages are the best way to help improve the usefulness of fuzzy term matching for your language. So speak up.

If you find this feature useful and want fuzzy term matching as the default for new entries in a termbase, this can be set in the properties for the termbase under New term defaults... for termbases created in memoQ 2013. Older termbases will also display this option, but it won't actually work in practice. To use this new feature with old term collections, these will need to be migrated to a memoQ 2013 termbase.

Jun 3, 2013

Innovation? No comment. A black mark for memoQ 2013.

Update August 2013: problem solved.

"This comment thing is the last straw. I am definitely not upgrading to the new version of memoQ!"
The response from a friend who asked me to show how the comment function in memoQ had devolved in the latest version (memoQ 2013, aka version 6.5) revealed the frustration of a user whose main interest in new product features for much of the past year has been how to disable or avoid them. Unfortunately for her, there appears to be no escape from the latest innovation, which one member of the memoQ Yahoogroups list suggested would become known as Commentgate. This may be a good example of the old adage "if it ain't broke, don't fix it".

The old comment function has been at the heart of my use of memoQ for years, and the ability to export comments to share with my clients was one of the main reasons I pushed Kilgray to introduce the bilingual RTF table exports which are so popular for editing and translation. Comments added in a word processor (such as Microsoft Word) could be re-imported to the memoQ project and reviewed. It was all very quick and simple.

The old memoQ comment dialog with a personal note on edits required
If the comments included questions about a text or a term, the response in the comment column of the RTF table would be included when the bilingual file was re-imported, which was often convenient for editing purposes. A record of questions and answers could be maintained fairly easily in the editable comment field.

All that has changed with memoQ 2013. Disastrously so in the initial release on May 31. In their eagerness to implement an LQA quality assurance model relevant to only a minority of users, mostly the sort of agencies who prefer metrics in place of actual quality, Kilgray dynamited the previous straightforward, robust comment function, adding dropdown selection fields for "severity level" and scope.

That wouldn't be quite so bad despite the extra steps needed to add a correctly classified comment now if it were possible to edit the comments. It is not possible to edit comments in memoQ 2013 in the current release. Nor are all the comments included in an RTF table export. Only the last comment is included; all others are lost:

If a comment is altered in the bilingual RTF file, when the bilingual is re-imported to the project, a new comment is created and classified as "Information" applicable to the entire row:

This is all really a shame. In the effort to push server-based workflows and cater to a limited special interest group, memoQ's architects have managed to sabotage one of the tools (or two, depending on how you count) which have contributed to their great success in recent years. And unfortunately, unlike other recent, often irritating "innovations", such as the various target autotext options that drive many users of dictation software batty, this new type of comment can't be switched off so that we can work in the old, accustomed way.

Many users have already objected strenuously to this broken functionality, and some compromise solutions have been suggested by Kilgray. One of these, which involves a delimited export of all the comments in the RTF bilingual export, would be reasonable. Whatever changes are made to commentary in memoQ 2013, I hope this will include making comments editable very soon with appropriate rights.

The two selection field in the new comments dialog have a default behavior that is probably useful in some cases. Once a comment has been created, the next comment made in the text will assume the same "severity" rating applies (not necessarily a valid assumption, but if I am going through a text marking things of the same type this can be useful). All comments are assumed to apply to the target text by default. This is a bit of a nuisance to me, as most of the comments I make in a file refer to errors or unclear expressions in the source text. But really, for the way I work, the two classification steps with the dropdown menus are just extra work and additional sources of possible errors and/or confusion, so I would be quite happy to bypass these altogether.

I do like the idea of a comment history for a text. This would be relevant and useful for the way I work. But overall, the current implementation of the comment function in memoQ 2013 does not serve my interests at all and creates unnecessary complications for me and many other users.

Jun 1, 2013

Translating multilingual Excel files in memoQ

Some weeks ago on a Friday, in the late afternoon, I received one of those typical Friday project inquiries: a request for a fast response on whether I would like to translate some 15 to 20 Excel files distributed in three folders, with file names redundant between the folders and over half the 50,000 source words already translated. My translations were, of course, to take the previous work into consideration and remain consistent with it. No translation memory resources were available. Fortunately for my blood pressure, I was offline that afternoon until after business hours. When I saw the request later that evening, I considered what sane approach there might be to such a project, and when none occurred to me at the time, I wrote a note to the project manager requesting more information about the source data, received no response and forgot the whole business as the usual Friday nonsense.

About a week later, while I was engaged in something completely different, it occurred to me that it would have been a fairly straightforward matter to translate the remaining text scattered through those files and build a reference translation memory from the existing translations. In fact, I could even use the available translations from other languages as references in a preview. How? By using a multilingual filter option that Kilgray added to memoQ version 6.2 (with build 15).

Finding that option is not exactly intuitive. I had heard about it but had not followed the discussions closely in the online lists, nor could I remember it from the online demonstrations I had viewed in recent months. But I knew that it worked with Excel files, so I started to import such a file and looked for the proper settings to import the source and target columns. And found nothing.

Fortunately, I used to be a software developer, so I put on my old developer's thinking cap and considered how I might best mystify users with a new feature. Aha! Name the feature something completely different! So I looked again at the list of import filters for my Excel file and found a likely candidate, the multilingual delimited text filter. (To use this filter, you must Import with options.)

The first page of the settings dialog for that filter offers Excel as one of the base formats to import:

The columns can be specified by marking the Simple bilingual configuration option, or with somewhat less confusion by examining the options on the Columns tab of the dialog. For the following test file with English source text and German as the desired target language

I used the following settings for the import:

After a little experimentation, I found that I could specify the third language (Portuguese) as a translation even though it was not a language indicated in the project. (Additional target languages are only possible with the PM edition of memoQ, but this information can be designated as a comment if needed in the Translator Pro edition.) This added the Portuguese translations (where available) to the preview in my working window:

Some odd property of the import filter in the version of memoQ tested caused the source text to be copied to the view of unpopulated translations of the project's target language in the preview, but that is of no real consequence. The preview, unlike a typical preview of an Excel file, bears no resemblance to the layout of the source file, but is instead organized by the source text grouped with other specified columns.

Considering how often I have encountered Excel files and other sources structured like this in the past decade, I would say this is probably one of the most useful filters that has been added to memoQ recently. More complex data structures may require cell background colors to be used to exclude unwanted parts of a spreadsheet (colors can be added while configuring the import). It's a shame that the current version of the filter doesn't support ranges or conditions for exclusion, but perhaps that will come later.

Making a translation memory from the existing translations in the file (which were locked and confirmed upon import in the example shown above) is a simple matter of using the command Operations > Confirm and update rows... and selecting the appropriate options. For the example shown here, selecting the locked rows would write all these to the primary translation memory:

Kilgray has a blog post and a recorded webinar (47 minutes long) with further details about using this filter. They state that "This webinar was designed for language service providers and enterprise users managing multilingual projects." However, given the frequency with which many freelancers encounter such formats and their desire to use other language information, comments, context data, etc. in their translations, I think this feature is just as relevant to freelance translators.

Update 2013-07-25: After a series of recent tests involving imports and segmentation, I wanted to see how the multilingual Excel filter would import data in which individual cells contain multiple sentences or line breaks. Theoretically the segments should correspond to the cell structures, but would they in fact? I decided to import one of my Excel files that I use for segmentation demos. To keep all the content of a cell in one segment with the regular Excel filter, I have to use a "paragraph segmentation" ruleset and set "soft breaks" as inline tags in the filter's import settings. But the default settings of the multilingual Excel filter achieve the same result:

This showed me that the "multilingual" filter might in fact save me time and trouble for importing files from certain customers where I want to avoid segmentation inside the cells altogether. And of course, the multilingual filter is an obvious quick way to load data from an Excel file and, as mentioned above for partial data, send it to a TM - a process which used to involve saving as a CSV file from Excel, worrying about saving as UTF-8, etc. That process might not even work with the test file shown here (I'm not really inclined to try it).