How to parse a PDF File

This post explain methods to parse a PDF File, it means, be able to extract its content and iterate through the elements of its content.

Parsing a PDF allows you to convert it to a more convenient file format, as for example, CSV, JSON or XML.

PDFs libraries to parse PDF files:

PyPDF2
PyMuPDF

Other PDF parser tools:

GROBID
LayoutParser
pdfminer

GROBID is FOSS.

LayoutParser is FOSS.

pdfminer is FOSS.

You can download a sample PDF file from this external link.

Using Python Module PyPDF2

PyPDF2 is a module or library for Python programming language that allows to manipulate PDF files.

You need to know Python programming language in order to use this library. You can read an introduction to Python (including resources to learn it) on this post.

As the module is free/ and open-source (FOSS), it is available to be installed using the pip command.

You can type this command to install PyPDF2:

pip install pypdf2

The PyPDF2 documentation can be found on this external link.

This post shows some source code example using Python 3.11.3 and PyPDF2 3.0.1. Function calls may vary depending on the versions.

Source code example of how to parse the pages within a PDF file, printing its content:

import PyPDF2

sample_file = open(r"sample.pdf", mode='rb')
sample_reader = PyPDF2.PdfReader(sample_file)
print(sample_reader.metadata)

for i, current_page in enumerate(sample_reader.pages):
    print("===================")
    print("Content on page:" + str(i + 1))
    print("===================")
    print(current_page.extract_text())

After you process the text to get a table or two-dimension array, you can export it as CSV using the following code:

import csv

rows = [
['header1', 'header2'],
['row1_value1', 'row1_value2'], 
['row2_value1', 'row2_value2']
]

with open('output.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    for row in rows:
        csvwriter.writerow(row)

Reading PDFs that contain text as images

You can improve the Python program to be able to read PDF them and convert it to text using OCR technologies. You can read this external post where an approach is proposed.

There are online tools that feature OCR technology to read text that has been saved as an image.

Some of these tools are PDF2Go’s PDF to text (that has multilanguage compatibility), FreeConvert and Xodo.

External references

wellsr.com; “Read PDF file with Python using PyPDF2“; wellsr.com

How to parse a PDF File

Using Python Module PyPDF2

Reading PDFs that contain text as images

External references

You might also be interested in…

Leave a ReplyCancel Reply