Jan 10, 2014

memoQ AutoCorrect update & MS Word export macro

Last summer I wrote about autocorrection of text in memoQ and offered an indexed embedding of a video I created to give an overview of the AutoCorrect functions in memoQ 2013. There have been a few enhancements since then in memoQ 2013 R2; where only "smart quote" toggling was possible before there are now various options for correcting accidental miscapitalization.

I've also been looking to optimize the procedure for migrating the Microsoft Word autocorrection lists to memoQ. There are a number of problems with using the table-generating macro that Kilgray suggests in the knowledgebase article on using MS Word 2003 autocorrect data; when I created a 17,000 entry list from a large AutoCorrect file for one language, it was nearly impossible to do anything with it because of memory problems. The following macro, which could be put into the Normal template in MS Word, should be a little easier to work with:
Sub BuildAutoCorrectList()
  Dim ACE As AutoCorrectEntry
  ' Create new document.
  Documents.Add
  ' Iterate through AutoCorrect entries.
  For Each ACE In Application.AutoCorrect.Entries
    ' Insert each entry name and its value on a new line.
    Selection.TypeText ACE.Name & vbTab & ACE.Value & vbCr
  Next
End Sub
Invoke the macros dialog in MS Word with Alt+F8. Select the Normal.dot or Normal.dotm file (depending on your version of MS Office) from the dropdown list, enter the name of the new macro and click the Create button. Then paste in the code above. When the macro is run, it will create a new document with the autocorrection list in tab-delimited text. To bring the list into memoQ, you'll have to
  1. Paste in the XML header needed by the "light resource" for AutoCorrect lists in memoQ. You can see what this looks like for the language setting you want by creating a dummy resource, exporting it and opening the file with a text editor. European Spanish might look like this, for example:
    <MemoQResource ResourceType="AutoCorrect" Version="1.0">
      <Resource>
        <Guid>6d61e3bc-da00-4cb8-a4f3-93c980543bba</Guid>
        <FileName>spa-ES#EU Spanish AutoCorrect.mqres</FileName>
        <Name>European Spanish</Name>
        <Description />
        <Language>spa-ES</Language>
      </Resource>
    </MemoQResource>
     
  2. Save the file as plain text with UTF-8 encoding.
  3. Change the file extension to "*.mqres"
  4. Import the resource to memoQ.
AutoCorrect lists which are language-neutral (for example, lists of company names) use "all#" in the name and "Neutral" between the tags.

Other sources for autocorrection data
With a bit of searching, one can find other sources of data to add to AutoCorrect resources for various language. Wikipedia, for example, offers lists of commonly misspelled words, such as this one in English, which includes links to Dutch, Hungarian, Portuguese, Spanish and Turkish lists. The structure of the data lends itself easily to reformatting with the search and replace features of a text editor:
alamanya->almanya
aferim->aferin
agrasif->agresif
ağostos->ağustos
ahret->ahiret
ayle->aile
alarım->alarm
atmış->altmış
Copy the data from the Wikipedia page to a text file. Then use search and replace to substitute tabs for the "->" structures, add an appropriate XML header for the memoQ resource and save the file as UTF-8 with an MQRES extension and you have an AutoCorrect list ready for import to memoQ. An example of the Turkish list converted and ready for use in memoQ is available for download here.

For German, there is a list of common spelling errors on Wikipedia which can be adapted with very little effort to make this resource.

The English list on the Oxford Dictionaries page can also be adapted without much ado. And there are many others to be found on the Internet.

Merging memoQ AutoCorrect resources
Entries from multiple AutoCorrect lists can be combined in a single tab-delimited file, and duplicates can be removed using Microsoft Excel, for example.

The screenshot above shows a merged German AutoCorrect list opened in Excel. When using the Remove Duplicates function on the Data ribbon, be sure that only Column A is selected in the dialog:


The reason Column B must not be selected is that it contains the desired text after correction, and there may be more than one error entry for a particular word.

After duplicates have been removed from the list, save the file as Unicode text, then import it to memoQ. A similar procedure with Excel may be followed to maintain other memoQ light resources; I do this rather frequently for segmentation exceptions to ensure that the lists for the different language variants I work with remain synchronized. (It would be nice, of course, if Kilgray would create a reasonable light resource manager with such capabilities. It gets tiring to do this so often with stopword lists and other resources.)

4 comments:

  1. Sadly, I get an error when trying to import my list to memoQ, despite following your blog instructions to the letter. Here is the error message: http://prntscr.com/gblmjx

    And here is what my list looks like: http://prntscr.com/gblnvw

    Any ideas?

    ReplyDelete
    Replies
    1. I figured it out! The tags in the header were saying instead of . I copied the header from a new/dummy autocorrect file, so it ought to have worked with too >.>

      Delete
    2. :-) If you type tags on this stupid blog, you have to use encoding or they disappear! & l t ; or & g t ; (without spaces) for the lesser than and greater than characters enclosing the tag. It sucks.

      Delete
  2. Wow, thanks a lot! Works for OmegaT too.

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)