Jul 17, 2013

How would you translate the chart in this DOCX file?

Can anyone tell me quickly the best way to translate the chart in this DOCX file? Or how to get an accurate word count of the words to be translated in the file?

*****

I love to see the different approaches people take to this problem. It's one which I think is encountered with some frequency by translators, and in the past I too many different approaches to it - long ago I usually did something involving PDF conversion, editing of the PDF and making a screenshot. But that is inefficient and doesn't allow the use of CAT tools.

Yesterday I picked up a project with 18 of those silly charts embedded in it. A real nuisance. Here's what happens if you try to edit one of those charts in situ:


Hopeless, right? A lot of very authoritative web pages make it clear that without having the linked Excel files, you cannot modify the text. Not true, actually. With or without hints, a number of technically versatile colleagues found ways to solve he problem or at least made close guesses. Some of these are here in the comments. One very interesting exchange on Twitter showed than somehow the settings of the OmegaT import filters can be tweaked to solve this:




The thing about OmegaT is that it's sort of geeky - the solution looks pretty good here, but I can't actually make it work myself.

The solution I worked out last night is very similar to the one described by Stanislas in the comments.
  1. Change the file extension to ZIP
  2. Look inside the ZIP file with Windows Explorer or another suitable tool as described in other blog posts.
  3. Inside the "word" subfolder there is a folder named "charts". It contains XML data with all the chart headings, numbers and labels. Copy it.
  4. Paste a copy of the folder where you want your source files. Import the chart XML files into any CAT tool or XML editor. It's a good idea to configure a filter to exclude and protect the references to the original Excel files with the data. (Though I am curious whether deliberately spoiling these data can protect against the unwanted update that one person worried about in the comments. I'll have to try that.)
  5. When the translations are completed, paste the XML files back inside the charts folder in the file structure.
  6. Rename the extension back to what it was at the start (DOCX in this case). You're done. No refresh necessary (unlike with embedded Excel or PowerPoint objects).



A memoQ filter configuration for these XML files can now be found on Kilgray's Language Terminal.

9 comments:

  1. The text in the figure is on another computer (in a .xlsx file). Would it be possible that your CAT tool would import this text too if it could find it? If you had this file you could do change source in File | Info | Edit Links to Files | Change Source. You might also be able to break the link, and then manually create the missing .xlsx file and re-link it? Just guessing really.

    ReplyDelete
  2. PS: I count 65 words using ABBYY Screenshot Reader.

    Michael

    ReplyDelete
  3. ABBYY Screenshot Reader? Good idea. That number sounds about right. The much-favored PractiCount that everyone depends on so much told me 17 words. Even Microsoft can do better than that.

    ReplyDelete
  4. Sometimes LibreOffice gets it (after breaking the chart or Excel file up), but not here.

    You may print it as PDF and open it with a suitable PDF reader (PDF X-Change Viewer did and does a brilliant job here, while others did not get it), copy all text and paste it into a Word file to count it. This sounds complicated, but may be the faster solutions for huge files. This way works only for word counting, as layout screws up and words get shortened (!) and separated by tabs. I count 74 words, including numbers.

    ReplyDelete
  5. Stanislas BironJuly 17, 2013 2:31 PM

    Hi! This is how I managed to edit the chart text. However I don't know how the file will react when the Excel file becomes accessible again (It might synchronize and the text elements of the chart might revert to their original values).

    1. I renamed the .docx file to .zip
    2. I opened the zip file with WinRar and navigated to the folder \word\charts\
    3. There is a file there named "chart1.xml"
    4. Opened the file with a text (or XML) processor. The text elements are present within this file (I used Find/Replace to replace the text safely, because if a tag is altered by error, the file will be considered corrupt by Word)
    5. After all the text elements have been replaced, I closed WinRar, and it prompted me to update the file in the archive. I selected "Yes".
    6. Finally, I renamed the file to .docx and it opeped correctly with my updated text.

    It works on my computer, I don't know if this will work for you but that may be worth the try!

    Stan

    ReplyDelete
    Replies
    1. Very close to my procedure, yes. Time to update the blog post now....

      Delete
  6. So after trying to count the words in this document, and failing (because I stupidle forgot to use the rename to .zip trick), I all of a sudden remembered that I actually bought the Enterprise edition of AnyCount recently, for around €95! To my not so surprise it also couldn’t get at the words in those pesky charts. I therefore decided to ask AnyCount’s support department what they thought, which went a little something like this:


    Michael Beijer (Client) Posted On: 17 July 2013 01:34 PM
    ________________________________________

    Subject: Could you tell me how to get an accurate word count from the attached file?
    I was hoping that AnyCount would catch the words in the figure as well. I managed a correct count using ABBY Screenshot Reader, and was wondering why AnyCount didn't use OCR on the image portion... Is there a setting to make it do this?

    Michael


    Alexander Artamoshkin
    Jul 17 (1 day ago)


    to me


    Dear Michael,

    Thank you for you question.

    Unfortunately, there is no possibility to count text in the chart.

    The only way to do that is to make a printscreen of this chart and to remove all the diagrams and lines leaving only the text.

    Thereafter you will be able to count the text from the printscreen you have got.

    We are sorry for inconvenience.

    Please feel free to contact us if you have any questions.

    Best regards,
    Alexander.

    ----------------------------------------------
    Alexander Artamoshkin,
    AIT Software Development Team



    Michael Beijer
    12:37 PM (5 hours ago)


    to support


    Hi Alexander,

    Two things.

    1. I was under the impression that AnyCount could use OCR to count words in images. Why can't it figure out that there is text there that it can't get at and apply OCR?

    2. The text in question is actually present inside the file. Please have a look at Kevin Lossner's blog post on this exact problem (http://www.translationtribulations.com/2013/07/how-would-you-translate-chart-in-this.html) and how to get at it. Couldn't this be programmed into AnyCount? It's just a matter of renaming the .docx as a zip file and locating the text displayed in the charts.

    Michael


    Michael Beijer
    Translator & Terminologist
    (Dutch/Flemish into English)
    Skype/Twitter: michaelbeijer
    iMessage: michael@wordbook.nl

    Alexander Artamoshkin
    1:49 PM (4 hours ago)


    to me


    Hello Michael,

    Thank you for your questions.

    >Why can't it figure out that there is text there that it can't get
    at and apply OCR?

    AnyCount may recognize the pictures like a text. Because of this it may give you false results.

    >Please have a look at Kevin Lossner's blog post on this exact problem (http://www.translationtribulations.com/2013/07/how-would-you-translate-chart-in-this.html)
    and how to get at it

    I would like to mention that AnyCount uses its own OCR with its own abilities.

    Please feel free to contact us if you have any additional questions.

    Best regards,
    Alexander.

    ----------------------------------------------
    Alexander Artamoshkin,
    AIT Software Development Team

    ReplyDelete
  7. So today I decided to count the file in various programs, and finally to do it myself. After all, how hard could it be, right? Here are the results:

    ------------------------------*
    PractiCount (FAILED):

    words: 31
    characters with spaces: 202
    characters without spaces: 173
    lines: 3
    pages: 1
    ------------------------------*
    AnyCount (FAILED):

    Text: 32 words
    Text Boxes: 0
    Shapes: 0
    Running Headers: 0
    Running Footers: 0
    Footnotes: 0
    End Notes: 0
    Embedded Object: 0
    Linked Object: 0
    Comments: 0
    Hidden Text: 0
    File Total: 0
    ------------------------------*
    MS Word (FAILED):

    words: 32
    ------------------------------*
    LibreOffice Writer (mangles document; doesn't display chart at all):

    31 words
    ------------------------------*
    Michael Beijer (SUCCEDED):

    1. normal text (accessible in .docx file):

    A survey was conducted to determine feelings regarding the best communication strategy.
    Figure 5: What topics were particularly suited to communicate the content of the fire safety plan at public events?

    2. words in chart (just rename .docx as.zip, navigate to: \zip\Figure_5_chart_eng-US (1)\word\charts and count):

    Importance of representing specific action proposals (n=64)
    Importance of representing the draft of the fire safety plan (n=55)
    Extreme
    Strong
    Moderate
    Somewhat
    None

    = 56 words total

    That is, 56 - 31 = 25 missing words! (or put another way: you just lost 44.6% of your rate for this job)

    ReplyDelete
  8. It seems that incessant complaining might actually work. After some back and forth between me and the AnyCount (AIT) support guy (in which I politely reminded him that AnyCount is not cheap (at €85 for the Enterprise edition) and really should be able to do this if it is going to offer us translators any added value), I received the email below.

    Incidentally, this just goes to show that we should all be more vocal when it comes to features we believe our tools should offer. Because if they don’t give us what we want, we can just choose another solution. There are enough good ones around. For example, Kilgray refused to add a way to remove duplicates from TBs in memoQ pro. Instead, they added it to qTerm. So what did I do, after asking and asking and not being listened to? Wrong. I didn't do nothing and just wait some more. I switched to CafeTran (http://cafetran.wikidot.com/). Igor (the developer of CafeTran) has added more feature his (freelance) users actually want in the last week than Kilgray has in the last year! Have a look at this: http://wordbook.nl/new-features-added-to-CafeTran-in-the-last-week-alone.html

    ---------------------------------------------*
    Alexander Artamoshkin
    12:07 PM (1 hour ago)

    to me
    Dear Michael,

    Your suggestion has been reviewed by our lead software developer and entered into our corporate suggestion database. We will consider your suggestion while working on upcoming versions of AnyCount.

    Do not hesitate to contact us if you have any other questions or suggestions.

    Best regards,
    Alexander.

    ----------
    Alexander Artamoshkin,
    AIT Software Development Team

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)