Dec 26, 2016

The challenge of too many little files to translate

It seems to me that most translators face this challenge eventually: a customer has many small files of some kind - tiny web pages perhaps or other content snippets in XML, text or Microsoft Word files or perhaps even in some bizarre proprietary format - and wants them translated.

Imagine a dictionary project with thousands of words with their definitions, each "entry" being stored in a separate text file. How would you translate that efficiently?

The brute force method of opening and translating each file individually is not very satisfactory. Not only does this take a long time, but when I have tried foolishness like that I tend to overlook some files and spend far too much time checking to ensure that nothing has been overlooked. And QA measures like spellchecking? Let's change the subject....

Some translation tools offer the possibility to "glue" the content of the little files together and then (usually) "unglue" them later to reconstitute the original structure of little files, now translated.

Other tools offer various ways to combine content in "views" to allow translation, editing, searching and filtering in one big pseudofile. This is very convenient, and this is the method I use most often in my work with memoQ or SDL Trados Studio after learning its virtues earlier as a Déjà Vu user.

Unusual file formats can often be dealt with the same way after some filter tweaking or development. But sometimes....

... there are those projects from Hell where you have to ask yourself what the customer was smoking when he structured his data that way, because some other way would be so much more practical and convenient... for you. Ours is generally not to question why some apparently insane data structure was chosen but to deal with the problem as efficiently as possible within budget and charge appropriately for any extra effort incurred. Hourly fees for translation rather than piece rates certainly have a place here.

Sometimes there is a technical solution, though it may not be obvious to most people. For example, in the case presented to me by a colleague on Christmas Eve


the brief was to write the translation in language XX in the empty cell in that columnof the 3x2 table embedded in a DOCX file. There were hundreds of these files, each containing a single word to translate.

If these were Excel or delimited text files, a simple solution would have been to use the Multilingual Delimited Text Filter for memoQ and specify that the first row is a header. But that won't fly (yet) for MS Word files of any kind.

In the past when I have had challenging preparation to do in RTF or Microsoft Word formats - such as when only certain highlighted passages are to be translated and everything else is ignored - I have created macros in a Microsoft Office application to handle the job.

But this case was a little different. The others were always single files, or just a few files where individual processing was not inconvenient. And macro solutions often suffer from the difficulty that most mere mortals fear to install macros in Microsoft Word or Excel or simply have no idea how to do so.

So some kind of bulk external processing is called for. In this case, probably with a custom program of some kind.

I usually engineer such solutions with a simple scripting language - a dialect of the BASIC language which I learned some 45 years ago - using a free feature which is part of the Microsoft Windows operating system: Windows Scripting Host. And one-off, quick-and-dirty solutions with these tools do not require a lot of skill. The components of many solutions can be found on Microsoft Help pages or various internet forums with a little research if you have only a vague idea of what to do.

In this case, the tasks were to
  1. Select the files to process (all 272 of them)
  2. Open each file, copy the English word into the empty cell next to it
  3. Hide all the other text in the file so that it can be excluded from an import into a working tool like Déja Vu, memoQ or SDL Trados Studio (using the options for importing Microsoft Word files in this case; the defaults usually ignore hidden text on import)
After that the entire folder structure of files could be imported into most professional translation support environments and all 300 or so words to translate could be dealt with in a single list view.

A more detailed definition of the technical challenge would include the fact that to manipulate data in some way in a Microsoft Office file format, the object model for the relevant program would probably have to be used in programming (for XML-based formats there are other possibilities that some might prefer).

Microsoft kindly makes the object models of all its programs available, usually for free, and there is a lot of documentation and examples to support work with them. That may in fact be a problem: there is a lot of information available, and it is sometimes a challenge to filter it all intelligently.

In this case, I needed to use the Microsoft Word object model. It also conveniently provided the methods I needed to create the selection dialog for my executable script file. The method I knew from the past and wanted to use at first is only available to licensed developers, and I am not one of these any more.

It is easy to find examples of table manipulation and text alteration techniques in Microsoft Word using its object model in VBScript or some other Microsoft Basic dialect like Visual Basic for Applications (VBA). The casual dabbler in such matters might run into some trouble using these examples if there is no awareness of differences between these dialects; trouble is often found where VBA examples that declare variables by type (example: "Dim i as Integer") occur. Declarations in VBScript must be untyped (i.e. "Dim i"), so a few changes are needed.

In this case, the quick and simple solution (' documentary comments are delimited by apostrophes and marked green) to make the files import-ready was:

' We have a folder full of DOCX files, each containing
' a three-column table where COL1 ROW2 needs to be copied to COL2 ROW2
' and then the COL1 ROW2 and other content needs to be hidden.

Option Explicit

Dim fso
Dim objWord
Dim WshShell
Dim File
Dim objFile
Dim fileCounter
Dim wrd ' Word app object
Dim oFile  ' Word doc object
Dim oCell1  ' first cell of interest in the table
Dim oCell2  ' second cell of interest in the table
Dim oCellx1  ' other uninteresting text
Dim oCellx2  ' other uninteresting text
Dim oCellx3  ' other uninteresting text 
Dim oCellx4  ' other uninteresting text 

fileCounter = 0

'set the type of dialog box you want to use
'1 = Open
'2 = SaveAs
'3 = File Picker
'4 = Folder Picker
Const msoFileDialogOpen = 1

Set fso = CreateObject("Scripting.FileSystemObject")
Set objWord = CreateObject("Word.Application")
Set WshShell = CreateObject("WScript.Shell")

'use the path selected in the SelectFolder method
'set the dialog box to open at the desired folder
objWord.ChangeFileOpenDirectory("c:\")

With objWord.FileDialog(msoFileDialogOpen)
   'set the window title to whatever you want
   .Title = "Select the files to process"
   .AllowMultiSelect = True
   'Get rid of any existing filters
   .Filters.Clear
   'Show only the desired file types
   .Filters.Add "All Files", "*.*"
   .Filters.Add "Word Files", "*.doc;*.docx"
         
   '-1 = Open the file
   ' 0 = Cancel the dialog box
   '-2 = Close the dialog box
   'If objWord.FileDialog(msoFileDialogOpen).Show = -1 Then  'long form
   If .Show = -1 Then  'short form
      'Set how you want the dialog window to appear
      'it doesn't appear to do anything so it's commented out for now
      '0 = Normal
      '1 = Maximize
      '2 = Minimize
      'objWord.WindowState = 2

      'the Word dialog must be a collection object
      'even with one file, one must use a For/Next loop
      '"File" returns a string containing the full path of the selected file
     
      For Each File in .SelectedItems  'short form
       'Change the Word dialog object to a file object for easier manipulation
        Set objFile = fso.GetFile(File)
Set wrd = GetObject(, "Word.Application") 
wrd.Visible = False 
wrd.Documents.Open objFile.Path 
Set oFile = wrd.ActiveDocument

Set oCell1 = oFile.Tables(1).Rows(2).Cells(1).Range  ' EN text
        oCell1.End = oCell1.End - 1
        Set oCell2 = oFile.Tables(1).Rows(2).Cells(2).Range  ' Target (XX)
        oCell2.End = oCell2.End - 1
        oCell2.FormattedText = oCell1.FormattedText  ' copies EN>XX 
oCell1.Font.Hidden = True ' hides the text in the source cell

' hide the other cell texts (nontranslatable) now
Set oCellx4 = oFile.Tables(1).Rows(2).Cells(3).Range
oCellx4.Font.Hidden = True
Set oCellx1 = oFile.Tables(1).Rows(1).Cells(1).Range
oCellx1.Font.Hidden = True
Set oCellx2 = oFile.Tables(1).Rows(1).Cells(2).Range
oCellx2.Font.Hidden = True
Set oCellx3 = oFile.Tables(1).Rows(1).Cells(3).Range
oCellx3.Font.Hidden = True

wrd.Documents.Close 
Set wrd = Nothing
 
fileCounter = fileCounter + 1
      Next    
   Else 
   End If
End With 

'Close Word
objWord.Quit

' saying goodbye
msgbox "Number of files processed was: " & fileCounter



The individual files look like the above screenshot (all text in the top row is hidden, so the entire row is invisible, including its bottom border line) after processing with the script, which is saved in a text file with a *.vbs extension (it can be launched under Windows by double-clicking):


Of course the script could be made much shorter by declaring fewer variables and structuring in a more efficient way, but this was a one-off thing where time was of the essence and I just needed to patch something together fast that worked. If this were a routine solution for a client I would be a bit more professional, lock the screen view, change to some sort of "wait cursor" during processing or show a progress bar in a dialog and all the other trimmings that one expects from professional software these days. But professional software development is a bit of a bore after so many decades, and I haven't got the patience to see the same old stupid mistakes and deceits practiced by yet another generation of technowannabe world rulers, I just want to solve problems like this so I can get back to my translations or go play with the dogs and feed the chickens.

But before I could do that I had to save my friend from the Hell of manually unhiding all that table text after his little translation was finished, so I put another 5 minutes (or less) of effort into the "unhiding" script:

Option Explicit

Dim fso
Dim objWord
Dim WshShell
Dim File
Dim objFile
Dim fileCounter
Dim wrd 
Dim oFile  
Dim oCell1  ' source text cell in the table
Dim oCellx1  ' other uninteresting text
Dim oCellx2  ' other uninteresting text
Dim oCellx3  ' other uninteresting text 
Dim oCellx4  ' other uninteresting text 

fileCounter = 0

Const msoFileDialogOpen = 1

Set fso = CreateObject("Scripting.FileSystemObject")
Set objWord = CreateObject("Word.Application")
Set WshShell = CreateObject("WScript.Shell")

objWord.ChangeFileOpenDirectory("c:\")

With objWord.FileDialog(msoFileDialogOpen)
   .Title = "Select the files to process"
   .AllowMultiSelect = True
   .Filters.Clear
   .Filters.Add "All Files", "*.*"
   .Filters.Add "Word Files", "*.doc;*.docx"
   If .Show = -1 Then  
      For Each File in .SelectedItems
         Set objFile = fso.GetFile(File)
Set wrd = GetObject(, "Word.Application") 
wrd.Visible = False 
wrd.Documents.Open objFile.Path 
Set oFile = wrd.ActiveDocument
Set oCell1 = oFile.Tables(1).Rows(2).Cells(1).Range
oCell1.Font.Hidden = False 
Set oCellx4 = oFile.Tables(1).Rows(2).Cells(3).Range
oCellx4.Font.Hidden = False
Set oCellx1 = oFile.Tables(1).Rows(1).Cells(1).Range
oCellx1.Font.Hidden = False
Set oCellx2 = oFile.Tables(1).Rows(1).Cells(2).Range
oCellx2.Font.Hidden = False
Set oCellx3 = oFile.Tables(1).Rows(1).Cells(3).Range
oCellx3.Font.Hidden = False

wrd.Documents.Close 
Set wrd = Nothing
 
fileCounter = fileCounter + 1
      Next    
   Else 
   End If
End With 

objWord.Quit
msgbox "Number of files processed was: " & fileCounter

2 comments:

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)