Pages

Feb 1, 2014

The fix is in for PDF charts

Over four years ago, I reviewed Iceni Infix after I began working with it. I'm not as strong a fan as some, because I generally have little enthusiasm for direct editing of PDFs and dealing with frequent problems such as missing unusual fonts and having to play the guess-my-optimum-font-substitution game, but I do find it useful in many situations. I found another one of those today.

A new client of a friend works with a horrible German program to produce reports full of charts. The main body of the text is written in Microsoft Word and is available as a reasonable DOCX file, but the charts are a problem, as they are available only in the specific, oddball tool or PDF format. Nobody wants to deal with that software, really. It is supported by no translation tools vendor I am aware of, and like another example of incompatible German software, Across, it enjoys the obscurity it deserves.

After thinking about the approach needed in this case, I realized that if the graphics could be isolated conveniently on pages, the XML export from the PDF document would contain only information from the graphics. After translation, the format could be touched up with Infix before making bitmap screenshots at an enlargement which would yield decent resolution when sized in  layout. Of course, in projects involving multiple languages the XML files could be used with great convenience.

Selecting and deleting the text on the pages with Iceni Infix is really a no-brainer. The time charge for such work will be quite reasonable. And exporting the XML or marked-up text to translate is also quite straightforward:


The exports can be handled in nearly any CAT tool, so TMS and terminology resources can be put to full use. Or you can edit in a simple, free tool like Notepad++ or an XML-savvy editor.



The screenshot above shows the XML in memoQ. No customization of the default filter is required. Reports from other users who have worked in a similar way indicate that OmegaT and other environments generally have few, if any, problems. In one case there was trouble re-integrating the graphics in a project that also had 50 pages of text, but there may have been other issues I am not aware of in that case.


With the content in the TM, if the chart data are made available in another format, the translations can be transferred quickly to that for even better results. The same approach can be used for a very wide variety of other electronically generated graphic formats (except some of the really insane ones I've seen where the text is broken up; I don't know if Iceni sanitizes such messes or not).

I think this is an approach which can benefit many of us in a variety of projects. It is not really suited for cases of bitmap graphics, but I have other approaches there in which Iceni Infix may also play a useful role and allow CAT integration. Licenses for the tool are quite reasonably priced, and the trial version (in Pro mode) is entirely suited for testing and learning this process.

6 comments:

  1. I'm using Infix from time to time for over two years now and it's excellent in cases when there's much more form than content, e.g. ads, folders, complicated tables, stuff like that. Unfortunately when working with XML export usually I have to modify the import settings, because some tags for different inline formatting are by default treated as external, which generates wrong segmentation. But other than that it's great.
    Of course usually, if a client want's to receive a PDF file back, there's a problem with fonts, because usually folders use proprietary fonts I don't have and/or without full UTF support. But that's another story.

    ReplyDelete
    Replies
    1. "... more form than content" - that is more or less what another colleague remarked on his experience with it. Thanks for the correction/addendum on filter adaptation. For te graphics I was translating in this case, the exported XML required no adaptation whatsoever, but had I tried to deal with body text, I probably would have found the same thing. Do you know which tags had to be redefined as inline in your work?

      The fonts are indeed a headache. If the client is cooperative enough to provide them, however, that should not be the case.

      When our Brazilian colleague (who I think was involved in the inspiration of many of the translation features) began to push this tool many years ago, I thought the idea of translating a PDF directly and returning it in a PDF layout was rather idiotic, though I refrained from saying much about that in public. At the time I was doing mostly manuals and documents with a lot of text, and I think it is not suited for that. But for some of the flyers and other graphic-heavy documents I've seen (form over content) or cases like these charts, I have not seen a better tool.

      Delete
  2. Sounds good, though I am looking for a Mac alternative for this. Recently I translated a landscape PDF after opening and saving it as LibreOffice (or OpenOffice) Draw (the memoQ ODF filter converts this format without any problem). It worked rather good and might be an alternative for smaller projects (there were some serious, but resolvable issues with carriage returns).

    ReplyDelete
    Replies
    1. What's the problem? Look at the little icons at the left of the product box picture. I believe one of them indicates that there is a Mac version :-) This is by no means a Windows-specific post.

      Delete
  3. Kevin, thanks for sharing this, it opens incredible new opportunities for us - LSPs who have been struggling with PDFs sent by customers which had to be OCRed (a cost no customer would ever be willing to pay for because they don't understand WHY you need to play with their files in such a way, why you just don't TRANSLATE them :-)).
    I have done a little bit of testing on some short files - forms, for example, and it worked excellent. Have you got experience with processing longer and more complicated files using infix?

    ReplyDelete
    Replies
    1. Vaclav, I have processed a few longer documents, but in a somewhat unusual way - deleting the text and isolating the graphics. Because of page transition issues, I am rather skeptical of the wisdom of trying to handle longer documents with the main emphasis on text flow. However, I've discovered another good use (well, several related uses) with long documents, which I'll share separately at some point. As you may have noticed, Infix is also good at extracting/inserting pages. You can also select page ranges for export as text or XML. What you may not have noticed, is that the software can extract text from some documents in which text copying is otherwise blocked. Go look up the annual reports of the German Federal Patent Court (BPatG) and try to extract that text, for example. If you want to build a corpus for term extraction or general reference, Infix can be a nice tool to have in such cases.

      Delete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)