All About PDF Files

A PDF (Portable Document Format) is a type of file that can do a number of useful things. One of those things is creating a "virtual printout" from some other program.

A Brief History

Suppose you had a piece of software that most people don't have, say, a complicated design program, and you wanted to share some output. Your only option was to print out, on paper, whatever you wanted to share. That's hard to email, yes?

PDF was developed in the early 90's by Adobe as a way to create and share printouts as a file without having to physically print to real paper. A file that contained exactly what the printout would have looked like had it been printed. That need gave birth to what we know today as a PDF file.​

In July 2008, Adobe gave up control of the PDF specification, releasing it as an ISO open standard. Now any company can develop and offer products that create and manipulate PDF files with no royalties due. PDF is one of the most successful and utilized file formats in the world.

All you need to view that file is a PDF reader which is ubiquitous today. PDF readers are built into most computers and devices.

Swiss Army Knife

PDF files have a wide range of uses. Here's just a few...

  • Capture a printout as a file (that can be easily shared)

  • As a scanner output, making electronic copies of paper documents

  • Security controls that restrict what the recipient can do with the file. e.g. Prevent editing, prevent content extraction
    (Ctrl-C), prevent printing, forced expiration, encryption requiring a password to read, and other access controls

  • Long term document archiving. PDF as an ISO open standard will certainly be supported farther into the future than some obscure file format from a little-used application. Even once-popular formats, like early Microsoft Works files, can be hard to open today. PDF helps protect against such "file-format rot".

  • Eliminate paper files. Many business, especially paper intensive ones like a CPA or medical practice, might convert many thousands or even millions of sheets of paper that can occupy whole rooms, or even warehouses, to electronic format, saving much space and money.

Common Needs

Here, I'll focus on the first two uses shown above. That's what most of my clients use PDF for. Surprisingly, I've had clients make physical paper printouts (of reports, spreadsheets, whatever) then immediately scan them so they'll have an electronic copy. What an utterly unnecessary waste of paper, ink, time, and money.

When you want to "capture a printout" in an electronic format that you can send as an email attachment or store on your computer, then you'll simply use the print command in your software as before, but instead of selecting your real printer, you'll select a PDF Printer instead. Your software (Word, QuickBooks, etc.) doesn't know the difference. Windows 10 has a PDF printer built-in, called Microsoft Print to PDF. It's right in there with your other printers on the printer selector screen.

When printing to a PDF, you'll be asked to specify a file name that will contain the electronic "printout". That resulting file is known as a PDF File and it'll contain exactly what you would have seen had you physically printed to paper.

The free PDF printer that's included with Windows 10 is very basic. All it can do is capture printout and save it to a PDF file. If you have advanced needs, such as password protection, then you'll need to purchase a full-feature PDF creation product. Adobe Acrobat and Foxit Phantom are two popular choices. In my opinion, Foxit Phantom is the better of the two, and it's a lot cheaper as well.

Scanner Output as PDF

The other common use for PDF is to contain the scanned images of documents. e.g. You put a 25-page document on your scanner's document feeder then press the button. Then you'll get a single PDF file that contains all 25 pages. Easy enough, right? Not so fast... There's a few important things to consider here. Doing them correctly will optimize your results. Yeah, I know... Nothing is as easy as it seems...

First and foremost: If you can, you are always better off creating a PDF file directly from the source program instead of making a physical printout only to scan it right back in. Your PDF files will be much smaller and higher in quality.

Next, when scanning in printed sheets, consider the contents of the document. Most documents with 8 point or larger text will scan well at 300 DPI and maybe as low as 200 DPI. But if there's any "small print", text that is 6 points or lower, you might need to scan at a higher resolution to make it easily readable in the PDF.

 

Major caveat: If you even think you might OCR the document later on (more on OCR below) then a minimum of 300 DPI is a must, and even then only for an absolutely perfect source paper document with 8 point type or larger. I generally recommend no less than 600 DPI for scanned documents that may be OCR'ed later on. OCR software requires far more detail than a human eye to perform its "image to text" conversions. But beware, the higher the scanning resolution, the larger the resulting PDF file will be. If you plan to only store the file locally, that's probably okay. Disk space is cheap nowadays.

Next, should you scan in color or B&W? If the paper documents contain color then you'll almost certainly want that color in your scanned documents, so that's easy. But what if the documents are only black text on white paper? Scanning in B&W is faster and saves space. But it's more likely to result in choppy letters that are missing their tinier elements (serifs, ascenders, descenders, ears, and other small/thin elements), that make up the letters in some font families. If there are handwritten notations made in pencil (which is gray, not black) and some of those pencil strokes are light, they may be omitted in the scanned PDF. That's because when scanning in B&W, each pixel is either on or off -- there's no in-between. The scanner driver has to decide if a "pixel" on the scanned paper is dark enough to be "on" or light enough to be "off" in the scanned file.

Scanning in grayscale fixes that. Using grayscale, the scanner driver can accurately reproduce lightly printed "pixels" from the paper document, making for a more accurate and readable PDF file. Grayscale makes larger documents than B&W, because of the added shading in grayscale, but it's still smaller than scanning in color.

B&W is fine for perfect documents when the fonts are sans-serif and use solid, beefier elements. Otherwise, you're better off with grayscale.

Understanding all this will also help you understand the next topic -- OCR. When you scan-in a sheet of paper, the scanner is basically "taking a picture" of whatever is on the paper. If you scan in a 25-page document, your resulting PDF file will contain 25 photos of that document, one photo per physical page in the PDF file. As you probably know, photos can be pretty big, running several megabytes. Twenty-five photos can be really big! That PDF file might be too big to attach to an email.

"But it's all plain and boring text, why is it scanned as photos?"

Because your scanner doesn't know or care what you are scanning. All it sees is ink on paper. It could be text, a photo, a Rorschach test, or all three -- it doesn't know. All it can do is scan as a photo.

Consider the following two boxes.

The blue box contains output "printed" directly from the source program to a PDF file. The text is 462 bytes (characters) in length (that's tiny!). The resulting PDF will be a bit larger because of some overhead but not much.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sit amet purus gravida quis. Magna sit amet purus gravida. Sed faucibus turpis in eu mi bibendum. Porttitor leo a diam sollicitudin tempor id eu nisl. Gravida neque convallis a cras semper auctor neque vitae tempus. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl. Nisi vitae suscipit tellus mauris a diam maecenas.

The green box contains the same text but captured and stored as an image. This is what happens when you scan in a paper document. The image below is 100,810 bytes in length which is over 200(!) times larger.

So, even though the two boxes look pretty much alike, they are very different indeed. For example, you can highlight (select) bits of the text in the blue box (and perhaps copy and paste into Microsoft Word). Go ahead and try that now. Using your mouse (or finger, if reading this on a mobile device) try to highlight a sentence in the blue box and again in the green box.

The blue box worked but not the green box! That's because the blue box text is actual characters (in your PDF file).

The green box "text" isn't text at all. It's a photo of text. And THAT'S the key difference!

OCR - Scanning in a paper document so you can edit it

Now you know the difference between a PDF that contains real text and one that contains a photo of text. And this is why you cannot simply scan-in, say, a 25-page document, such as a report or legal document, and open/edit it with Microsoft Word.

There are ways to convert a PDF file of scanned pages into actual editable text that can be opened with Microsoft Word. That process is called OCR (Optical Character Recognition) and most full-featured PDF products, like Acrobat and Phantom, can do that.

HOWEVER, OCR'ing a PDF file of scanned documents is fraught with problems. OCR software uses complex algorithms to analyze the pictures of letters and symbols in a scanned document, trying to convert them into actual editable text. Even if the scanned-in paper documents are perfect originals (not photocopies), free of blemishes and random marks, free of creases, and all the text is perfectly sharp and pristine, the OCR engine that performs the conversion will still make mistakes -- that must be manually corrected by you.

 

To the extent the paper document is anything less than entirely pristine, the OCR error rate will rapidly climb to untenable levels, making the entire OCR effort too much trouble.

And that's assuming the paper document is just plain old text. If you are trying to OCR a document containing tables, photos, illustrations, side bars, or any number of other page design/layout elements that isn't just plain old lines of text running from the left margin to the right, then you'll encounter a whole new set of issues when trying to edit the results.

It's far beyond the scope of this web page to get into the hairy details of how OCR works. But suffice to say, it's often a frustrating process yielding poor results. You're always better off finding the original electronic documents. Lacking that, having a real human recreate the document by manually typing it in may be the only way to get a super clean error-free document. You could easily find yourself spending nearly as much time correcting errors. If your paper document is hundreds of pages long and in less-than-pristine condition or contains complications like tables, illustrations, photos, etc., well, you've got a problem.