Apr 26, 2012

Twitterview: SDL Trados Studio, memoQ, DVX2 and PDF extraction

When I began using Twitter somewhat hesitantly three years ago, I never expected that it would eventually prove to be one of the most useful social media tools for gathering information of professional value. Much of this is serendipitous; I really never know what will come floating down the twitstream or where some of the conversations in it will go. Like the direct chat I had with with a colleague in New Zealand about features she liked best in the two main CAT tools she uses, SDL Trados Studio and memoQ.

We both really appreciate the TM-driven segmentation in memoQ and the superior leverage this offers. But to my surprise, she expressed a preference for SDL Trados Studio, particularly for the quality of its PDF text extractions from electronically generated files. This is not a feature I make heavy use of in either tool, though I have used it more often lately in memoQ for alignments in the LiveDocs module and found it generally satisfactory. Most of my work involving PDF files is with scanned documents - there one has no choice but to use a good OCR tool like OmniPage or ABBYY FineReader.

So I was quite intrigued that the quality of PDF was "better" than from standalone tools. Especially because my experience is quite different. Further discussion (not shown in the graphic) revealed that what she actually meant was that the quality of the text extraction with the CAT tool usually beat the quality of text received from translation agencies who performed conversions. That is easy to explain, really. In my experience, most agencies are clueless about how to use conversion tools and too often use automated settings and save the results "with layout". This is very often utterly unsuited for work with translation environment tools or requires a lot of cleanup and code zapping.

For years I have recommended to agencies and colleagues that they spare themselves a lot of headaches by saving PDF conversions as plain text and adding any desired formatting later. Most people ignore that advice and suffer accordingly. So in a way, a CAT tool that does so encourages "best practice" for PDF translation for those files they are actually able to handle.

Encouraged by the Twitter exchange, I decided to do a few tests with files from recent projects. I took a PDF I had with various IFRS-related texts from EU publications. It appeared to extract quickly and cleanly in memoQ, giving me a translation grid full of nicely segmented text. SDL Trados Studio 2009 choked badly on it and extracted nothing. Her extraction in SDL Trados Studio 2011 caused a timeout with the project I was told, but the text itself was completely extracted and converted to DOCX format. This is useful, because unlike the extraction to plain text in memoQ, this offers the possibility to add or change some text formatting in the translation grid. Other extraction examples from SDL Trados Studio 2011 showed that text formatting was preserved.

A closer examination of the extracted texts revealed some problems with both the memoQ and Trados Studio extractions. The memoQ 5 PDF text extraction engine proved incapable of handling text in multiple columns properly. The paragraph order was all fouled up. The extraction with SDL Trados Studio had a great number of superfluous spaces. Whether it is possible to optimize this in the settings somehow I do not know. The results of all the extraction tests are downloadable here in a 6 MB ZIP file. I've included the SDL Trados Studio extraction saved to plain text as well for a better comparison of the text order and surplus spaces problems.

Overall, I am personally not very pleased with the results of the text extractions from PDF in either tool. The results from SDL Trados Studio are clearly better, and other examples that were shared made it clear that this tool works better than many an untrained PM with better PDF conversion software. This is certainly much better than solutions I see many translators using. But really, nothing beats good OCR software, an understanding of how to use it well and a proper workflow to get a good TM and target file better fit for most purposes.

*****

Update 2012-05-22: I met colleague Victor Dewsbery at a recent gathering in Berlin, and he told me about his tests with the recently introduced PDF import feature of Atril's Déjà Vu X2 translation environment. He kindly offered to share his results (available for download here) and wrote:

Here is the result of the PDF>DVX2>RTF>ZIP process for your monster EU PDF file. Comments on the process and the result:
  • The steps involved were: 1. import the file into DVX2 as a PDF file; 2. mark all segments and copy source to target; 3. export the file as if it were a translated file (it comes out as an RTF file). The RTF file is 20 MB in size and zips to 3 MB.
  • Steps 1 and 3 took a long time, and DVX2 claimed to be not responding. For step 1 I just left it and it eventually came up with the goods. Step 3 exported the RTF file perfectly, even though DVX2 claimed that the export had not finished. I was able to open the RTF file (it was locked, but I simply renamed it), and this is the version which I enclose. Half an hour later DVX2 had still not ended the export process (and had to be closed via the Task Manager), although the exported file was in fact perfectly OK. The procedure worked more smoothly with a couple of smaller PDF files. Atril is working on streamlining the process and ironing out the glitches in the process, especially the “not responding” messages.
  • The result actually looks very good to me. There are hardly any codes in the DVX2 project file (the import routine also integrates CodeZapper). I didn’t spot any mistakes in the sequence of the text. Indented sections with numbering seem to be formatted properly - i.e. with tabs and without any multiple spaces.
  • The top and bottom page boundaries in the exported file are too wide, so most pages run over and the document has over 900 pages instead of just under 500. Marking the whole document and dragging the header/footer spaces in Word seems to fix this fairly quickly.
  • I note that some headlines are made up of individual letters with spaces between them. This may be related to the German habit of using letter spacing (“Sperrschrift”) for emphasis as an alternative to bold type.
  • I found one instance where text was chopped up into a table on page 857 of the file.
  • There are occasional arbitrary jumps in type size and right/left page boundaries between sections.
On the strength of this sample, it would usually be OK to simply import the PDF file into DVX2, translate in the normal way, and then fix any formatting problems in the exported file.

3 comments:

  1. Interesting article Kevin. I would completely agree with you about how best to handle PDFs, but also see a place for a CAT tool to have a method for handling these files as well as they can.
    Interestingly I downloaded your test files and gave them to our filter developer. The original PDF seems to contain very wide spaces (probably some special formatting settings). As the PDF converters are designed to replicate the design, multiple spaces are inserted to cope with this.
    Not all files are like these so it's probably not a consistent problem for Studio... but using a dedicated OCR tool like ABBYY FineReader would clearly do a better job with any PDF.

    ReplyDelete
  2. Wanting too much, one ends up having almost nothing
    I confess, I do not believe cat tools should deepen their capability in this respect. And yes, translator's requirements are absolutely unclear to anybody else... Nice layout is their goal and if it is achieved by textboxes, they do not care..
    I prefer using specialised tools - pdf transformer by nuance, fine reader by abbyy, and - believe me - plustools from the wordfast tools...
    The latter gives the best results for translation (not layout preserving) purposes

    Stefan Pecen, simulta, Bratislava, Slovakia

    ReplyDelete
  3. Recently, I OCRed a PDF file Abby Reader 11, with very complex layout. I had to edit the resulting Word file very heavily to get something useful. Then I tried to import the resulting Word file into a memoQ project, but memoQ would crash each time.

    I thought that this was an opportunity to test DVX2's PDF import. I imported the PDF file into a DVX2 project, copied source to target and exported the result. I got an almost perfect Word file without any manual work. There were far less tags than I expected, very few in fact.
    I was amazed at the quality of the results.

    ReplyDelete

Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)