20 January 2008

#54. Pretty Darn Fine PDF Conversion

While stirring the linguine and listening to old Porter and Dolly duets, a few of us got to talking about the problems of editing PDF files. Portable Document Format is a nearly universal standard for sharing and securing documents while preserving all their formatting. Adobe Systems devised the PDF format years ago, and publishes the gold-standard software, Acrobat, for creating PDFs from other programs.

How To Edit a PDF

So what if you need to modify a document, but you only have it in PDF format? If you have Adobe Acrobat, you can edit the PDF file directly, or save it to a Microsoft Word or Rich Text Format (RTF) file, which most word processors can read. But Acrobat costs hundreds of dollars, and is overkill if you just need to edit an occasional PDF.

The Internet offers a number of free utilities that convert PDF files to plain text (for example, Text Mining Tool, which extracts only the text from PDFs as well as HTML and CHM on-line help files). But what if you don’t want to lose all the document’s heading structure and formatting?

Like a naturalist chef who scours the meadows and forests for just the right herbs, I went in search of the perfect utility for transforming PDFs to Microsoft Word or RTF format while preserving the format and layout. I found some shareware programs that claim this ability, ranging in price from the $20s to the hundreds… but if I have to pay for it, it isn’t perfect, right? The search continued.

I recently heard about such a utility from my colleague Samer, the FreewareGenius, so I tried it out. Free PDF To Word Converter presents a simple, clear interface:

My test PDF documents were one page of heavily formatted text (60 KB), a seven-page pamphlet (416 KB), and a couple of large user guides with graphics (5 and 6 MB). None converted perfectly. In all my tests, all the text adopted Normal style, as expected, including its Times New Roman font, and spaces were substituted for tabs.

Among the other major conversion effects of Free PDF To Word Converter, each line in the resulting Word was in a separate text box (required for proper positioning). Boldfacing and indents generally were preserved. Some bulleted lists lost their bullets entirely, while other lists saw their bullets represented as graphic symbols, not Word lists. And despite the author’s claim to preserve graphics, the user guides’ graphics were lost entirely. Worse yet, one user guide (the more complex one) came out so tiny that it was unreadable even at Word’s maximum 500% magnification, though the other one was properly sized.

To edit a file resulting from such a conversion, you have to extract the text from the text boxes, manually apply styles, reapply formatting and layout as needed throughout the document, and replace the graphics – for starters. This is a lot of work. However, when faced with a large document, you might still prefer that to reformatting it all from scratch. And you can tell Free PDF To Word Converter not to use text boxes, though page layouts can suffer as a result.

After several conversions, Free PDF To Word Converter demands that you get a free registration code from its Web site to continue. You have to solve a fairly simple arithmetic problem (which I ashamed to admit I failed several times, despite using a calculator) to get the code, or you can bypass the math by buying a lifetime registration code for $15.

Next I tried the pdf995 conversion suite, which is among the better-known PDF converters that attempt to compete with Adobe Acrobat. This suite comprises four main programs, downloaded separately, so installation is a bit complicated. (You can use pdf995 for free, buy individual modules for $9.95 each, or buy the whole suite for $19.95.) Finding the PDF-to-Word or HTML conversion dialog box is not very intuitive.

Pdf995 first converts the PDF to HTML, then saves it as a Word DOC file. The resulting file lost most of the text formatting, but boldfacing was preserved. All the bullets, indentation, and graphics were gone. A soft return was inserted at the end of every line, but the lines were not encased in text boxes – making it easier to reformat the text to look like the original, using Find and Replace a lot. Some hyperlinks were preserved, and a gray page background was added. Unfortunately, pdf995 was unable to convert the large user guides at all.

The Internet offers another way to convert PDFs to Word or other formats: Upload your PDF file to conversion Web site Zamzar, and provide your email address. In a few seconds to a few minutes, you’ll receive an emailed link to where your converted document awaits downloading. The free service limits your source files to 100 MB each, though that should be more than adequate for most needs; you can pay for faster service and a 1 GB file limit.

In converting my smaller PDFs, Zamzar enclosed large blocks of text in text boxes (not line by line). It converted list bullets to symbols, but preserved positioning, relative type sizes, boldfacing, and hyperlinks. It inserted nonbreaking spaces between words in one document but not another.

Uploading a large file to Zamzar can take quite a long time. Unfortunately, Zamzar gagged on uploading my two large user guides and failed to complete the upload, though each is barely more than 1/20 of Zamzar’s declared size limit.

As an alternative, you can let Adobe do the work for you. Mail a PDF attachment to pdf2html@adobe.com, and Adobe sends you back an HTML file that you can open and edit in Word. Adobe lost the graphics and bullets in my smaller files and messed up the page layouts and page breaks. But the text generally came out reasonably well, preserving boldface and hyperlinks, and without text boxes or returns at the ends of lines. Adobe doesn’t specify a maximum file size, but it rejected my user guides of over 5 MB each.

If you have a Gmail account, you can get similar results by emailing a PDF file to yourself. When you receive the email, click View As HTML next to the attachment. Save the HTML page on your disk, then open it in Word. Unlike Adobe’s conversion, however, you will find a hard return at the end of every line.

In conclusion, I recommend Free PDF To Word Converter, especially for large files, though you’ll have to choose whether to use text boxes. Zamzar and Adobe are good alternatives for smaller files, though you should think twice about uploading confidential information. Remember, no matter how you convert your PDFs into Word, you still will have a lot of formatting clean-up work to do afterward.

In a future column, I’ll review some free programs for creating PDFs from Word and other documents without Acrobat.

Scalable Vector Graphic Collections

by Mark Lautman

In my column in post #52, I introduced scalable vector graphics (SVGs) and how to view them in web browsers. This week I'll provide a list of some places where you can get SVGs, as well as how to view them in common office applications.

There are two predominant open formats for SVG: the OpenDocument Format describes what OpenOffice files are supposed to be. This specification describes, among other things, the SVG format used by OpenOffice Draw. There are two significant collections of SVG graphics in this format:

  • The highest quality, easiest to use, and most finely honed collection is my very own at Custom OO Shapes. I started creating this collection almost two years ago, and maintain that it is the first such collection on the Internet.
  • Putting ego aside, a more comprehensive collection is available from OxygenOffice Professional. The developers of this project offer hundreds of shapes, and encourage designers to contribute their own drawings.

The other SVG format is specified by the World Wide Web Consortium (W3C), and this is what Web browsers display. There are many more graphics available in this format compared to OpenDocument.

You can open these last two collections only in Web browsers, not directly in OpenOffice. Fortunately, you can convert them into something more useful. The SVG Import Filter for OpenOffice comes in two flavors. The first is an extension that you add onto OpenOffice, enabling you to open W3C files directly in OO. I could get this extension to work in Windows, but not in Linux. The other flavor is a command-line utility that converts SVG files to OO Draw format.

(In preparing this column, I discovered that the Gnome desktop also displays thumbnails of SVG files, as the illustration above indicates.)

You can also go the reverse direction. OpenOffice has an export feature that saves a diagram as SVG, which you can view in a Web browser.

In my next column on SVGs, I'll discuss a few tools for creating SVG files. Mark Lautman

Alert for Norton Internet Security Users

If you use the anti-spam module of Norton Internet Security 2007 or earlier, and have painstakingly compiled a black list (spammer addresses to be blocked) and white list (non-spammer addresses to be allowed), do not upgrade to NIS 2008 yet! It recently has been revealed that the upgrade package deletes your previous black and white lists. Symantec says they're working on a new 2008 upgrade that can import previous lists, so wait for that one – and back up your existing lists anyway.

