Portable Document Format

Portable Definition File (PDF) is a page description language.

PDF History

It was released in the early 1990s, as a proprietary technology.

It succeeded as it was the first PDL that solved the print fidelity issue by preserving the exat visual layout (fonts, vector graphics, raster images, and page geometry) independent of device, OS or application.

It was released as an open ISO standard in 2008.

PDF Characteristics

Early versions PDF files were human-readable.

Modern PDF files are only partly human-readable and mostly binary files, because they contain compression.

Advantages of PDF (or page description languages):

  • Keeps layout/design/printability among different platforms
  • Hinders editability (if required)

Disadvantages of PDF formats (or page description languages):

  • Reduces processability (compared to a markup language)
  • Reduces accessibility
  • Reduces reusability
  • Doesn’t ensure integrity natively

Linearized PDFs are optimized for web viewing.

A Tagged PDF is a structure that allow to store PDF information in tags. The root of the structure tree in a tagged PDF is StructTreeRoot. It is defined in ISO 32000. It is used on both PDF/A and PDF/UA.

PDF Compression

PDF compression techniques:

  • FlateDecode (zlib/deflate) – Common for text and streams.
  • LZW (Lempel-Ziv-Welch) – Older, less common now.
  • DCTDecode (JPEG) – Used for image data.
  • CCITTFaxDecode – Used for monochrome images (e.g., scanned documents).
  • JBIG2Decode – Used for bilevel image compression.

Most text, fonts, and metadata are compressed using FlateDecode, which is reversible.

Types of PDF

PDF types featured on this post:

  • PDF/A
  • PDF/UA
  • PDF/E

A PDF can complement different PDF types if it meets all the requirements. For example, a PDF file can be both PDF/A and PDF/UA.

PDF/A

PDF/A is focused on long-time preservation of files.

PDF/A is defined through the family of standards ISO 19005.

It uses Tagged PDF.

It consists of different generations:

  • A-1 – ISO 19005-1 – 1.4
  • A-2 – ISO 19005-2 – 1.7
  • A-3 – ISO 19005-3 – 1.7
  • A-4 – ISO 19005-4 – 2.0

Each version or generation may have different conformance levels:

  • a = Accessible / fully tagged
  • b = Basic visual fidelity only
  • u = Unicode support (text reproducible)

PDF/A-4 simplified the conformance levels.

PDF/A-4 conformance levels:

  • PDF/A-4 (base): Simplified baseline
  • PDF/A-4f: allows file attachments
  • PDF/A-4e: for engineering (CAD) use
  • PDF/A-4u: Unicode-based accessibility

ISO 19005-1 official website

PDF/UA

PDF Universal Access (PDF/UA) is the global standard for PDF accessibility.

It is defined in ISO 14289.

It uses Tagged PDF.

Versión:

  • PDF/UA-2

The PDF/UA World (probably the same as the former PDF/UA Foundation) is an organization to promote the use of PDF/UA. It provides a guide to use PDF/UA.

PDF/UA World official website

PDF/E

PDF/E is used on engineering workflows.

PDF Tagging

PDFs allows tagging, as described in the PDF 2.0 specifications standardized in ISO 32000-2.

PDF tagging is a pre-requisite to achieve accessibility.

Some of the PDF/A conformance levels, such as PDF/A-1a, PDF/A-2a, PDF/A-3a and PDF/UA, mandates tagging through an structure tree. PDF/A-4u would be the closest of the PDF/A family to the mentioned conformance levels, but in this case tagging is only recommended. PDF/UA remains as the only family that mandates it.

PDF Accessibility

One of the main drawbacks of PDF is that it reduces it accessibility and re-usability.

A possible solution is enabling the PDF accessibility using the PDF/UA format, that requires PDFs tagging.

Another possible solution is using an alternative markup language (such as HTML) with the option to be exported to PDF, or offering a processable format (described as a markup language, e.g HTML) and a printable format (described as a page description language, e.g. PDF) independently. It must be checked that the converter keeps the original tagging structure within the PDF converters.

One the PDF is created, PDF accessibility must be checked.

PDF Security

You can read this post about PDF security.

PDF Manipulation

How to parse PDF files

How to merge PDF files

How to compress PDF files

How to generate a PDF file from a script

How to print multiple ranges of PDF files in macOS X

PDF Tools

PDF Reading Tools

Sumatra PDF

Sumatra PDF is a PDF reader for desktop platform.

ReadEra

ReadEra is a PDF reader for mobile platforms.

Moon+ Reader

Moon+ Reader is a PDF reader for mobile platforms.

PDF Manipulation Tools

PDF manipulation tools

  • Stirling PDF
  • PDFsam
  • PDF Toolkit

Stirling PDF

Stirling PDF

Stirling PDF official website

PDFsam

PDFsam (from PDF split and merge) is a family of tools.

PDFsam Basic is FOSS.

PDFsam Basic code repository

PDFsam Enhanced is proprietary.

PDF Toolkit

PDF toolkit (pdftk) is a CLI application to manipulate PDF data.

It can dumps raw PDF data.

PDF Compression Tools

PDF compression tools:

  • qpdf
  • MuPDF

qpdf is a library to decompress, modify, and re-compress PDFs.

MuPDF contains the CLI mutool command, that is a library to uncompress, modify, and recompress PDFs.

PDF OCR Tools

OCRmyPDF

OCRmyPDF is an application to read PDFs. It leverages the tesseract OCR library.

OCRmyPDF code repository

OCRmyPDF official documentation

Accesibility Checker Tools

PDF accessibility checkers:

  • veraPDF
  • PDF Accessibility Checker (PAC)
  • Adobe Acrobat Pro’s accessibility checker
  • AXE PDF

veraPDF is a FOSS validation tool for PDF/A that also supports PDF/UA checks.

It belongs to the Open Preservation Foundation.

vera PDF code repository

PDF Accessibility Checker (PAC) is a PDF accessibility checker. It is closed-source freeware, available as a desktop application for Windows.

It is developed by the Swiss foundation Access for All (Zugang für alle) / via the company axes4 GmbH.

The project is funded mainly by Germanic public organizations.

PAC versions;

  • 2021
  • 2024
  • 2026

PAC official website

Adobe Acrobat Pro’s accessibility checker is a paid proprietary solution.

AXE PDF is a paid proprietary PDF accessibility checker.

PDF Security Tools

pdfid is a PDF scanner tool and analyzer. It is used in security. It is written in Python.

PDF Generation Tools

This section is about tools to generate a PDF from a markup language.

  • PDF-LIB
  • pdfme
  • clawPDF

PDF-LIB is a library written in JavaScript.

PDFlib code repository

pdfme is an application written in TypeScript.

TypeScript code repository

clawPDF allows to create a PDF document, including PDF-A1b, A2b and A3b formats.

clawPDF code repository

PDF Markup Recovery Tools

This section is about tools to reconstruct the source descriptive markup language document from a PDF.

You can find a post with a list of descriptive markup languages.

PDF Conversion Tools

This section is about tools to convert a PDF to a different output.

  • Calibre
  • pdf2epubEX
  • overcuriosity’s pdf2epub

You might also be interested in…

Leave a Reply

Your email address will not be published. Required fields are marked *