Showing posts with label XML. Show all posts
Showing posts with label XML. Show all posts

May 28, 2022

Filtering formatted text in Microsoft Office files

 Recently, I shared an approach to selecting text in a Microsoft Word file with editing restricted to certain paragraphs. This feature of Microsoft Word is, alas, not supported by any translation tool filters of which I am aware, so to import only the text designated for editing it is necessary to go inside the DOCX file (which is just a ZIP archive with the extension changed) and use the XML file which contains the document text with all its format markers.

This approach is generally valid for all formats applied to Microsoft Office files since Office 2007, such as DOCX from Word or PPTX from PowerPoint. I have prepared a video to show how the process of extracting the content and importing it for translation can work:

 After translation, the relevant XML file is exported and the original XML is replaced with the translated file inside the archive. If the DOCX or PPTX file was unpacked to get at the XML, the folder structure can then be re-zipped and the extension changed to its original form to create the deliverable translated file.

What I do not show in the video is that the content can also be extracted by other means, such as convenient memoQ project templates using filters with masks to extract directly using various ZIP filter options. But the lower tech approach shown in the video is one that should be accessible to any professional with access to modern translation environment tools which permit filter customization with regular expressions.

Once a filter has been created for a particular format such as red text, adapting it to extract only green highlighted text or text in italics or some other format takes less than a minute in an editor. Different filters are necessary for the same formats in DOCX and PPTX, because unfortunately Microsoft's markup for yellow highlighting, for example, differs between Word and PowerPoint in the versions I tested.

Although this is a bit of a nerdy hack, it's probably easier for most people than various macro solutions to hide and unhide text. And it takes far less time and is more accurate than copying text to another file.

In cases where it is important to see the original context of the text being translated, this can be done, for example, using memoQ's PDF Preview Tool, a viewer available in recent versions which will track the imported text in a PDF made from the original file. This can be done using the PDF Save options available in Microsoft applications.


May 5, 2022

Forget the CAT, gimme a BAT!

It's been nine months since my last blog post. Rumors and celebrations of my demise are premature; I have simply felt a profound reluctance to wade in the increasingly troubled waters of public media and the trendy nonsense that too often passes for professional wisdom these days. And in pandemic times, when most everything goes online, I feel a better place for me is in a stall to be mucked or sitting on a stump somewhere watching rabbits and talking to goats, dogs or ducks. Certainly they have a better appreciation of the importance of technology than most advocates of "artificial intelligence".


But for those more engaged with such matters, a recent blog post by my friend and memoQ founder Balázs Kis, The Human Factor in the Development of Translation Software, is worth reading. In his typically thoughtful way, he explores some of the contradictions and abuses of technology in language services and postulates that

... for the foreseeable future, there will be translation software that is built around human users of extraordinary knowledge. The task of such software is to make their work as efficient and enjoyable as possible. The way we say it, they should not simply trudge through, but thrive in their work, partially thanks to the technology they are using. 

From the perspective of a software development organization, there are three ways to make this happen:  

  • Invent new functionality 
  • Interview power users and develop new functionality from them 
  • Go analytical and work from usage data and automate what can be automated; introduce shortcuts 

I think there is a critical element missing from that bullet list. Some time ago, I heard about a tribe in Africa where the men typically carry one tool with them into the field: a large knife. Whatever problem they might encounter is to be solved with two things: their human brains and, optionally, that knife. In a sense, we can look at good software tools in a similar way, as that optional knife. Beyond the basic range of organizing functions that one can expect from most modern translation environment tools, the solution to a challenge is more often to be found in the way we use our human brains to consider the matter, not so much the actual tool we use. So, from a user perspective and from the perspective of a software development organization, thriving work more often depends not so much on features but on a flexible approach to problem solving based on an understanding of the characteristics of the material challenge and the possibilities, often not adequately discussed, of the available tools. But developing capacities to think frequently seems much harder than "teaching" what to think, which is probably why the former approach is seldom found in professional language service training, even when the trainers may earnestly believe this is what they are facilitating.

I'll offer a simple example from recent experience. In the past year, most of my efforts have been devoted to consulting and training for language technology applications, trying to deal with crappy CMS systems for which developers never gave proper consideration to translation workflows or developing methods to handle really weird outliers like comment translation for distributed PDFs or filtering the "protected" content of Microsoft Word documents with restricted editing to... uh... protect the "restricted" parts.

That editing function in Microsoft Word was new to me despite the fact that I have explored and used many functions of that tool since I was first introduced to it in 1986. I qualify as a power user because I am probably familiar with at least five percent of the program's features, though I am constantly learning new ways to apply that five percent. And the 95% remaining is full of surprises:

Most of the text here can't be edited in MS Word, but default CAT tool filters cannot exclude it.

Only the highlighted text can be edited in the word processor, and that was also the only text to be translated. The real files were much larger than this example, of course, and the text to be translated was interspersed with a lot of text to be left alone. What can you do?

It was interesting to see the various "solutions" offered, some of which involved begging or instructing the customer to do one thing or another, which is not always a practical option. And imagine the hassles of any kind of manual selection, copying and replacement if you have hundreds of pages like this. So some kind of automation is needed, really. Oh, and you can't even hide the protected text. It will import with the default filters of the translation tool, where it will then be indistinguishable from the actual text to be translated and it can be modified. In other words, bye-bye "protection".

What can be done?

There are a number of possibilities that fall short of developing a new option for import filters, which could take years given the often sluggish development cycles for any major CAT tool. One would be...

... to consider that a Microsoft Word DOCX file is really a ZIP archive with a bunch of stuff inside it. That stuff includes a file called document.xml, which contains the actual text of the MS Word document:


That XML file has an interesting structure. All the document text is in one line as one can see when it is opened in a code editor like Notepad++:


I've highlighted the interesting part, the part with the only text I want to see after importing the file for translation (i.e. the text for which editing is not restricted in MS Word). Ah yes, my strategy here is to deal with the XML text container for the DOCX file and ignore the rest. When the question was raised, I knew there must be such a file, but despite exploring the internal bits of MS Office files with ZIP archive tools for about a decade now, I never actually had occasion to poke around inside of document.xml, and I knew nothing of that file's structure. But simple logic told me there must be a marker there somewhere which would offer a solution.

As it turned out, the relevant markers are a set of tags denoting the beginning and end of a text block with editing permission. These can be seen at the start and finish of the text I highlighted in the screenshot. So all that remains is to filter that mess. A simple thing, really.

In memoQ, there is a "filter" which is not really a filter: the Regex Text Filter. It's actually a toolkit for building filters for text-based files, and XML files are really just text files with a lot of funky markup. I don't care about any of that markup except in the blocks I want to import, so I customized the filter settings accordingly:


A smattering of regular expressions went a long way here, and the expressions used are just some of many possible ways to parse the relevant blocks. Then I added the default XML filter after the custom regex text filter, because memoQ makes filter sequencing of many kinds very easy that way. This problem can be solved with any major CAT tool I think, but I don't have to think very hard about such things when I work with memoQ. The result can be sent from memoQ as an XLIFF file to any other tool if the actual translator has other preferences. Oh, the joys of interoperable excellence....

The imported text for translation, with preview 

After translation, document.xml is replaced in the DOCX file by the new version, and the work is done, the "impossible" accomplished without any new features added to the basic toolkit. Computer assistance is all very well, but without brain-assisted translation you're more likely to achieve half the result with double the effort or more.





Jan 2, 2019

Hacking the "Hey memoQ" dictation commands


In the initial release of the Hey memoQ dictation feature in memoQ version 8.7.3, it's a bit inconvenient to deal with command configuration. Unlike most configurations in memoQ, the dictation commands cannot yet be exported as a light resource and shared with other users, nor can a configuration for generic German, for example, be easily transferred to a desired variant such as "ger-DE" or "ger-CH". Surely this will be addressed soon, but at the moment it's a bit of a nuisance.

But fear not... there is usually a backdoor to hack memoQ configurations, and this is no exception.


The screenshot above shows the path to the current configuration file for dictation commands. The XML file contains all the configured commands for all the memoQ languages and variants, including those of no interest whatsoever.

Deep inside the Hey memoQ dictation command file with Notepad++

A peek inside the XML file reveals that the dictation commands are structured as key-value pairs. And here it is possible to enter the text for dictation commands, simply by typing the desired text between the string tags inside the Value tags.

A configuration (Commands set) for one variant of a language - such as generic Portuguese - can also be copied to other variants - such as Brazilian or European Portuguese, saving the trouble of re-entering everything laboriously in the configuration dialog within memoQ.

I made a copy of the XML configuration and edited it to have only the variants of English and German that were of interest to me. Then I copied this file over the one in the memoQ configuration directory shown in the screenshot above. When I restarted memoQ, the file bloated a bit; upon examining it, I saw that all the deleted languages had been restored after the ones I had left in the edited file, but the new file was still only 247 KB in size because the senseless copying of English commands to the other languages was gone.

A customized XML file can be shared with other users, who can use it to replace the existing configuration file and probably save time configuring their languages and variants of interest. My file with generic English, EN-US, EN-UK, generic German and DE-DE is here.


Apr 4, 2018

Complicated XML in memoQ: a filtering case example

Most of the time when I deal with XML files in memoQ things are rather simple. Most of the time, in fact, I can use the default settings of the standard XML import filter, and everything works fine. (Maybe that's because a lot of my XML imports are extracted from PDF files using iceni InFix, which is the alternative to the TransPDF XLIFF exports using iceni's online service; this overcomes any confidentiality issues by keeping everything local.)

Sometimes, however, things are not so simple. Like with this XML file a client sent recently:


Now if you look at the file, you might think the XLIFF filter should be used. But if you do that, the following error message would result in memoQ:


That is because the monkey who programmed the "XLIFF" export from the CMS system where the text resides was one of those fools who don't concern themselves with actual file format specifications. A number of the tags and attributes in the file simply do not conform to the XLIFF standards. There is a lot of that kind of stupidity to be found.

Fear not, however: one can work with this file using a modified XML filter in memoQ. But which one?

At first I thought to use the "Multilingual XML" filter that I have heard about and never used, but this turned out to be a dead end. It is language-pair specific, and really not the best option in this case. I was concerned that there might be more files like this in the future involving other language pairs, and I did not want to be bothered with customizing for each possible case.

So I looked a little closer... and noted that this export has the source text copied exactly to the "target". So I concentrated on building a customized XML filter configuration that would just pull the text to translate from between the target tags. A custom configuration of the XML filter was created after populating the tags by excluding the "source" tag content:



That worked, but not well enough. In the screenshot below, the excluded source content is shown with a gray background, but the imported content has a lot of HTML, for which the tags must be protected:


The next step is to do the import again, but this time including an HTML filter after the customized XML filter. In memoQ jargon, this sort of configuration is known as a "cascading filter" - where various filters are sequenced to handle compounded formats. Make sure, however, that the customized XML filter configuration has been saved first:


Then choose that custom configuration when you import the file using Import with Options:


This cascaded configuration can also be saved using the corresponding icon button.


This saved custom cascading filter configuration is available for later use, and like any memoQ "!light resource", it can be exported to other memoQ installations.

The final import looks much better, and the segmentation is also correct now that the HTML tags have been properly filtered:



If you encounter a "special" XML case to translate, the actual format will surely be different, and the specific steps needed may differ somewhat as well. But by breaking the problem down in stages and considering what more might need to be done at each stage to get a workable result with all the non-translatable content protected, you or your technical support associates can almost always build a customized, re-usable import filter in reasonable time, giving you an advantage over those who lack the proper tools and knowledge and ensuring that your client's content can be translated without undue technical risks.

Jun 24, 2017

The other sides of Iceni in Translation


The integration of the online TransPDF service from Iceni in memoQ 8.1 has raised the profile of an interesting company whose product, the Infix PDF Editorhas been reviewed before on this blog. TransPDF is a free service which extracts text content from PDF files, converts it to XLIFF for translation in common translation environments, and then re-integrates the target text from the translated XLIFF to create a PDF file in the target language.

This is a nice thing, though its applicability to my personal work is rather limited, as not many of my clients would be enthusiastic if I were to send PDF files as my translation results. Sometimes that fits, sometimes not. And of course, some have raised the question of whether using this online service is compatible with some non-disclosure restrictions.

I think it's a good thing that Kilgray has provided this integration, and I hope others follow suit, but for the cases where TransPDF doesn't meet the requirements of the job, it is useful to remember Iceni's other options for preparing text for translation.

Translatable XML or marked-up text export
As long as I can remember, the Infix PDF Editor has offered the option to export text on your local computer (avoiding potential non-disclosure agreement violations) so that it can be translated and then re-imported later to make a PDF in the target language. Only the location of this option in the menus has changed: the menu choices for the current version 7 are shown below.



This solution suffers from the same problem as the TransPDF service: not everyone will be happy with the translation in PDF, as this complicates editing a little. However, I find the XML extract very useful to put the content of PDF files into a LiveDocs corpus for reference or term extraction. The fact that Infix also ignores password protection on PDFs is also helpful sometimes.

"Article" export
The Article Tool of  the Iceni Infix PDF Editor enables various text blocks on different pages of a PDF file to be marked, linked and extracted in various translatable formats such as RTF or HTML. The quality of the results varies according to the format.


Once "articles" are defined, they are exported via the command in the File menu:


The RTF export has some problems, as this view in Microsoft Word with the format characters made visible reveals:


However, the Simple HTML export opened in Microsoft Word shows no such troubles (and can be saved in RTF, DOCX or other formats):


Use of the article export feature requires a license for the Infix PDF editor, unlike the XML or marked-up text exports for translation. In demo mode, random characters are replaced by an "X" so that one can see how the function works but not receive any unjust enrichment from it. However, this feature has significant value for the work of translators and is well worth an investment, as the results are typically better than using OCR software on a "live" (text-accessible) PDF file.

But wait... there's more!
Version 7 also has an OCR feature:


I tested it briefly on some scanned Portuguese Help Wanted ads that I'll probably use for a corpus linguistics lesson this summer; the results didn't look too awful all considered. This feature is worth a closer look as time permits, though it is unlikely to replace ABBYY FineReader as my tool of choice for "dead" PDFs.

Jan 6, 2017

A matter of priority in memoQ

Every memoQ user knows the Translation Results pane.


It's that subwindow on the upper right part of the memoQ translation/editing environment which shows content matches from various sources, including translation memories, LiveDocs corpora, term bases, etc.

Most of us don't really do much with it. And why should we? Well..........

Sometimes there are an awful lot of "hits" displayed in that pane. Lots of matches from the TM, and if you're like me and record a lot of specialized terminology and company names not to be translated, sometimes the entry you need to see is not apparent at a glance; you must scroll down some way to find it.

This is a real problem when I am doing financial or legal translations using specialized autotranslatables, or when certain names and nontranslatable acronyms come up very often and cannot be seen conveniently in the visible part of the list in the results pane.

So what's a memoQ user to do? Change the order of data types displayed, for example.


Under Options > Appearance, you are able to change the relative display priority of hits from every kind of memoQ data shown (as well as change the color codes, though I think this is usually a bad idea). The example above has the autotranslatable matches (coded green) set to display at the top of the list. If I had a lot of proper names saved in nontranslatable lists, I would move that category toward the top as well to take advantage of improved visibility and better keyboard shortcuts.

Some jobs definitely benefit from a customized display order in the Translation Results pane. You can change the order in the Options each time to meet the needs of a particular job, or...

... more conveniently, you can have several different configuration files with particular settings for certain work. The relevant configuration is saved in the file Preferences-editor.xml, which is found at C:\Users|{username}\AppData\Roaming\MemoQ.


There are, of course, a lot of other files in that folder. I keep a shortcut on my Desktop now so I can get to the various configuration files quickly when I want to make changes.

The relevant changes to make in Preferences-editor.xml are found between the tag sets for <hitorderex> and <disabledhittypes>:
<hitorderex>400,300,100,200,500,700,600</hitorderex> 
<disabledhittypes>100,500,700,600</disabledhittypes>
The first is the order in which the various types of translation hit results are to appear. The second lists those types in the sequence which are not to be displayed. Note that  also includes the types that will not be shown so that if their display is re-enabled, memoQ will know where they belong.

The correlation of the numeric codes used here to the hit types is as follows:
100 = Translation memory
200 = Term base
300 = Non-translatable
400 = Auto-translation
500 = Fragment assembly
600 = LSC
700 = Machine translation
So in the example above, the display of TM, fragment, LSC and machine translation results has been suppressed.

One convenient way to switch quickly between configuration "profiles" is to keep versions of the XML configuration files with descriptive suffixes in the filename and put an alias (shortcut) for that file somewhere convenient, like on your Desktop. Such a file where autotranslatables and nontranslatables are shown at the top might be Preferences-editor_Autotrans-Nontrans.xml

Before starting memoQ, I find the shortcut for configuration file I want loaded, open it by double-clicking and Save As... with the additions to the filename deleted. This will overwrite the preferences file that was used previously. To switch back, I quit memoQ, open the backup copy of the preferences file I usually use and save it under the name Preferences-editor.xml. Until Kilgray implements actual saveable/loadable user profiles, this is as easy as it will get. Of course this method can also encompass other aspects of configuration.



Mar 18, 2014

The curious case of crappy XML in memoQ

Recently one of my collaboration partners sent me a distressed e-mail asking about a rather odd XML file he received. This one proved to be a little different than the ordinary filter adaptation challenge.

The problem, as it was explained to me, seemed to involved dealing with the trashed special characters in the German source text:


"Special" characters in German - äöüß - were all rendered as entities, which makes them difficult to read and screws up terminology and translation memory matches among other things. Entities are simply coded entries, which in this case all begin with an ampersand (&) and end with a semicolon. They are used to represent characters that may not be part of a particular text encoding system.

At first I thought the problem was simply a matter of adjusting the settings of the XML filter. So I selected Import with options... in my memoQ project and had a look at what my possibilities were. The fact that the filter settings dialog had an Entities tab seemed like a good start.


This proved to be a complete dead end. None of the various options I tried in the dialog cleared up the imported garbage. So I resolved to create a set of "custom" entities to handle with the XML filter, and used the translation grid filter of memoQ to make an inventory of these.

Filtering for source text content in the memoQ translation grid
That's when I noticed that the translatable data in this crappy faux XML file was actually HTML text. So I thought perhaps the cascading filters feature of memoQ might help.

Using all the defaults, I found that the HTML was fixed nicely with tags, but I did not want the tags that were created for non-breaking spaces ():


So I had another look at the settings of the cascaded HTML filter:


I noticed that if the option to import non-breaking spaces as entities is unmarked (it is selected by default), these are imported quite properly:

 
Now the text of some 600 lines was much easier to work with - ordinary readable document with a few protected HTML tags.

I'll be the first one to admit that the solution here is not obvious; in fact, apparently one of my favorite Kilgray experts took a very different and complex path using external tools that I simply understand. There are many ways to skin a cat, most of them painful - at least for the cat.

As I go through and update various sections of my memoQ tips guide, I'll probably expand the chapters on cascading filters and XML to discuss this case. But I haven't quite figured out a simple way to prepare the average user for dealing with a situation like this where the problem is not obvious. One thing is clear however - it pays to look at the whole file in order to recognize where a different approach may be called for.

Maybe a decision matrix or tree would do the trick, but probably not for many people. In this case the file did not have a well-designed XML structure, and that contributed to the confusion. My colleague is an experienced translator with good research skills, and he scoured the memoQ Help and the Kilgray Knowledgebase in vain for guidance. Our work as translators poses many challenges. Some of these are old, familiar ones, repackaged in new and confusing ways, as in this case. So we must learn to look beyond mere features and instead observe carefully what we are confronted, using that wit which distinguishes us from the dumb machines which the dummies fear might replace them.

Feb 1, 2014

The fix is in for PDF charts

Over four years ago, I reviewed Iceni Infix after I began working with it. I'm not as strong a fan as some, because I generally have little enthusiasm for direct editing of PDFs and dealing with frequent problems such as missing unusual fonts and having to play the guess-my-optimum-font-substitution game, but I do find it useful in many situations. I found another one of those today.

A new client of a friend works with a horrible German program to produce reports full of charts. The main body of the text is written in Microsoft Word and is available as a reasonable DOCX file, but the charts are a problem, as they are available only in the specific, oddball tool or PDF format. Nobody wants to deal with that software, really. It is supported by no translation tools vendor I am aware of, and like another example of incompatible German software, Across, it enjoys the obscurity it deserves.

After thinking about the approach needed in this case, I realized that if the graphics could be isolated conveniently on pages, the XML export from the PDF document would contain only information from the graphics. After translation, the format could be touched up with Infix before making bitmap screenshots at an enlargement which would yield decent resolution when sized in  layout. Of course, in projects involving multiple languages the XML files could be used with great convenience.

Selecting and deleting the text on the pages with Iceni Infix is really a no-brainer. The time charge for such work will be quite reasonable. And exporting the XML or marked-up text to translate is also quite straightforward:


The exports can be handled in nearly any CAT tool, so TMS and terminology resources can be put to full use. Or you can edit in a simple, free tool like Notepad++ or an XML-savvy editor.



The screenshot above shows the XML in memoQ. No customization of the default filter is required. Reports from other users who have worked in a similar way indicate that OmegaT and other environments generally have few, if any, problems. In one case there was trouble re-integrating the graphics in a project that also had 50 pages of text, but there may have been other issues I am not aware of in that case.


With the content in the TM, if the chart data are made available in another format, the translations can be transferred quickly to that for even better results. The same approach can be used for a very wide variety of other electronically generated graphic formats (except some of the really insane ones I've seen where the text is broken up; I don't know if Iceni sanitizes such messes or not).

I think this is an approach which can benefit many of us in a variety of projects. It is not really suited for cases of bitmap graphics, but I have other approaches there in which Iceni Infix may also play a useful role and allow CAT integration. Licenses for the tool are quite reasonably priced, and the trial version (in Pro mode) is entirely suited for testing and learning this process.

Jul 17, 2013

How would you translate the chart in this DOCX file?

Can anyone tell me quickly the best way to translate the chart in this DOCX file? Or how to get an accurate word count of the words to be translated in the file?

*****

I love to see the different approaches people take to this problem. It's one which I think is encountered with some frequency by translators, and in the past I too many different approaches to it - long ago I usually did something involving PDF conversion, editing of the PDF and making a screenshot. But that is inefficient and doesn't allow the use of CAT tools.

Yesterday I picked up a project with 18 of those silly charts embedded in it. A real nuisance. Here's what happens if you try to edit one of those charts in situ:


Hopeless, right? A lot of very authoritative web pages make it clear that without having the linked Excel files, you cannot modify the text. Not true, actually. With or without hints, a number of technically versatile colleagues found ways to solve he problem or at least made close guesses. Some of these are here in the comments. One very interesting exchange on Twitter showed than somehow the settings of the OmegaT import filters can be tweaked to solve this:




The thing about OmegaT is that it's sort of geeky - the solution looks pretty good here, but I can't actually make it work myself.

The solution I worked out last night is very similar to the one described by Stanislas in the comments.
  1. Change the file extension to ZIP
  2. Look inside the ZIP file with Windows Explorer or another suitable tool as described in other blog posts.
  3. Inside the "word" subfolder there is a folder named "charts". It contains XML data with all the chart headings, numbers and labels. Copy it.
  4. Paste a copy of the folder where you want your source files. Import the chart XML files into any CAT tool or XML editor. It's a good idea to configure a filter to exclude and protect the references to the original Excel files with the data. (Though I am curious whether deliberately spoiling these data can protect against the unwanted update that one person worried about in the comments. I'll have to try that.)
  5. When the translations are completed, paste the XML files back inside the charts folder in the file structure.
  6. Rename the extension back to what it was at the start (DOCX in this case). You're done. No refresh necessary (unlike with embedded Excel or PowerPoint objects).



A memoQ filter configuration for these XML files can now be found on Kilgray's Language Terminal.

Sep 16, 2012

The Sodrat Suite: delimited text to MultiTerm

The growing library of tools in the Sodrat Suite for Translation Productivity now includes a handy drag & drop script sample for converting simple tab-delimited terminology lists into data which can be imported directly into the generations of (SDL) Trados MultiTerm with which we've been blessed for more than half a decade.

Many people rightly fear and loathe the MultiTerm Convert program from SDL and despite many well-written tutorials for its use, intelligent, competent adult translators have become all too frequent callers on the suicide hotline in Maidenhead, UK.

Thus I've cast my lot with members of an Open Source rescue team dedicated to squeezing a little gain for the victims of all this pain and prescribing appropriate remedies for what ails so many of us by developing the Sodrat Software Suite. The solutions here are quick, but they aren't half as dirty as what some pay good money for.

The script below is deliberately unoptimized. It represents less work than drinking a cup of strong, hot coffee on a cold and clammy autumn morning. Anyone who feels like improving on this thing and making it more robust and useful is encouraged to do so. It was written quickly to cover what I believe is the most common case for this type of data conversion. An 80 or 90% solution is 100% satisfactory in most cases. Copy the script from below, put it in a text file and change the extension to VBS, or get the tool, a readme file and a bit of test data by clicking the icon link above.

To run the conversion, just put your tab delimited text file in the folder with the VBS script and then drag it onto the script's icon. The MultiTerm XML import file will be created in the same folder and use the name of the original file with terms as the basis of its name.

Drag & Drop Script for Converting Tab-delimited
Bilingual Data to MultiTerm XML

ForReading = 1
Set objArgs = WScript.Arguments
inFile = objArgs(0) ' name of the file dropped on the script

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile(inFile, ForReading)

' read first line for language field names
strLine = objFile.ReadLine
arrFields = Split(strLine, chr(9))

outText = "          "UTF-16" & chr(34) & "?>" & chr(13) & "" & chr(13)   
   
Do Until objFile.AtEndOfStream
 strLine = objFile.ReadLine
 if StrLine <> "" then
  arrTerms = Split(strLine, vbTab)
   
  outText = outText & "" & chr(13)
      for i = 0 to (UBound(arrTerms) )
        outText = outText & chr(9) & "" & chr(13) & chr(9) & chr (9) _
                   & "" & chr(13)
        ' write the term
        outText = outText & chr(9) & chr (9) & chr (9) & "" & _
               arrTerms(i) & "
" & chr(13) & chr(9) & "
" & chr(13)
      next
  outText = outText & "
" & chr(13)
 end if
Loop

outText = outText & "
"
objFile.Close
outFile = inFile & "-MultiTerm.xml"

' second param is overwrite, third is unicode
Set objFile = objFSO.CreateTextFile(outFile,1,1)
objFile.Write outText
objFile.Close


A new look for MultiTerm XML data in memoQ & Trados

Recently I got back to testing suggestions made last year for improving the quality and usability of terminology data from memoQ and Trados MultiTerm. With a bit of a refresher for rusty XSLT skills and brilliant help with sorting challenges from Stefan Gentz of Tracom, things are now looking quite promising. Here's a first look at early results for HTML conversions of term data in MultiTerm XML format:


This approach, perhaps including other useful conversions, will be included in the chapter on "possibly useful scripts and macros" in the tutorial guide memoQ 6 in Quick Steps, which will be released very soon with over 200 pages of productivity suggestions for the latest version of Kilgray's desktop technology.

Jul 5, 2012

Pride goeth before....

Yesterday a friend passed on the link to his new web site in English, which I had localized recently. The revolving graphic banner of heavy equipment loading cargo for shipment around the world looked brilliant. Of the various suggestions my copywriting friend and I had made for the slogans he had chosen the best. A good job for a deserving company which provides excellent international service. And then I looked closer.

When I discovered that his service provider uses the Open Source content management system TYPO3, I thought the little request he had made would be an easy one with a guaranteed good outcome. After all, memoQ, my translation environment tool of choice, has special XML filter configurations for TYPO3. But it seems that things are not so easy in the world of German CMS service.

Vee haf vayz off enterink ze kontent I was told, or words to that effect. I was dutifully informed that an XML export would contain "unnecessary information" of no interest to a mere translator and that I was to translate from an MS Word file provided. When I expressed my concern that copying errors might result from this procedure and encouraged the use of the free plug-in to export content from TYPO3 for translation, I was informed that his service fee for such frivolous nonsense would be
EUR 95,00 zzgl.MwSt bitte schön. Welcome to Servicewüste Deutschland.

After a few days of negotiation on the technical aspects of this three page translation, I finally decided to heed that old advice about not arguing with fools (who can tell the difference?) and simply sent the translation, albeit as a bilingual draft for simpler review. And I awaited feedback before issuing the final translation. Apparently all was well, because the draft was used to enter the content straightaway.

So far I've found four errors on the two pages I've looked at. All from copy/paste carelessness or retyping things and doing so with a bit too much Teutonic flair in ze zpellink. The thought of the potential damage to the image of my friend's business makes me decidedly queasy. God help him if they decide to do other languages like Chinese, Russian or Arabic, which is a possibility.

A good translation is more than just the right words for an occasion. It is a process. A process of communication among people, which sometimes involves technology. Humans are prone enough to error; even the best of us can overlook small but important details in a familiar text, and it's usually wise to stack the deck and deal with processes that minimize the risk of errors. Like providing translation content in a format that will minimize the human intervention required for information transfer. Please. Our customers deserve that at least.

Jun 3, 2010

Dealing with embedded XML and HTML in an Excel file

One of the occasionally gratifying aspects of translation for an IT geek like me is that IT challenges continue to follow me. Actually, that's one of the things about the current state of the profession that I hate too. (I'm a not-so-closeted Luddite.)

This week's challenge was more of a fun puzzle, because it wasn't my problem, but rather someone else's. An agency owner friend sent me an Excel file that was driving him nuts; his localization engineer, a former star at a Top Ten agency, had pronounced the task of filtering the data in a useful way to be impossible. I love it when engineers say something is impossible; it usually means there is a simple solution at hand if one gives the matter a little real thought.

The file structure looked something like this:

Only the yellow columns were to be translated; some had plain text content (with line beaks in some cases), other yellow columns had XML or HTML content.

Just for fun, I fired off a quick support request to Kilgray along with a copy of my test file, because I thought maybe there was a cascading filter feature I might have overlooked. (There isn't, but the idea was noted as a good one, so maybe we'll see it in the future.) In any case, Denis Hay offered a creative suggestion as he almost inevitably does:
Hi Kevin,

While waiting for "cascading filters" (which I also find a great idea), what you could do is simply copy these Excel columns to a Word table, than use either Tortoise Tagger, or preferably the +Tools from the Wordfast website to tag the HTML/XML content. Import that tagged word file into memoQ, and you should get what you wanted.

Once translated, just paste back to Excel.

Kind regards,
Denis Hay
Technical consulting and training
Kilgray Translation Technologies


There's another way I discovered by the time Denis' suggestion arrived. It works well manually, but it can also be automated with macros if you're dealing with content management system exports where the structure recurs and you'll be doing a lot of this.

Do the following:
  1. Copy each individual Excel column of interest (or at least the ones with XML/HTML) into a plain text file.
  2. In the case of the text files with tagged content (i.e. XML or HTML), change the file extension to fit the content (i.e "text2.txt" becomes "test2.xml", etc.).
  3. Translate the text files with your favorite translation environment tool, using the filters appropriate for each type of content.
  4. After exporting the files from your working environment, copy and paste the text file content back into the corresponding columns of the original Excel file. Note taht if there are line breaks somewhere, your row positions may get screwed up. This can be solved by performing this operation in OpenOffice Calc. (Maybe there's an appropriate setting for Excel to avoid this problem, but I don't know it.)
The key to sorting this puzzle out was to consider the discrete parts (i.e. the individual yellow columns) of the entire data set as separate collections of data. Dividing a problem up into its constituent parts is often a good way to find an easy solution.