How to parse a PDF File

This post explain methods to parse a PDF File, it means, be able to extract its content and iterate through the elements of its content.

Parsing a PDF allows you to convert it to a more convenient file format, as for example, CSV, JSON or XML.

You can download a sample PDF file from this external link.

Using Python Module PyPDF2

PyPDF2 is a module or library for Python programming language that allows to manipulate PDF files.

You need to know Python programming language in order to use this library. You can read an introduction to Python (including resources to learn it) on this post.

As the module is free/ and open-source (FOSS), it is available to be installed using the pip command.

You can type this command to install PyPDF2:

pip install pypdf2

The PyPDF2 documentation can be found on this external link.

This post shows some source code example using Python 3.11.3 and PyPDF2 3.0.1. Function calls may vary depending on the versions.

Source code example of how to parse the pages within a PDF file, printing its content:

import PyPDF2

sample_file = open(r"sample.pdf", mode='rb')
sample_reader = PyPDF2.PdfReader(sample_file)
print(sample_reader.metadata)

for i, current_page in enumerate(sample_reader.pages):
    print("===================")
    print("Content on page:" + str(i + 1))
    print("===================")
    print(current_page.extract_text())

After you process the text to get a table or two-dimension array, you can export it as CSV using the following code:

import csv

rows = [
['header1', 'header2'],
['row1_value1', 'row1_value2'], 
['row2_value1', 'row2_value2']
]

with open('output.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    for row in rows:
        csvwriter.writerow(row)

External References

Leave a Reply

Your email address will not be published. Required fields are marked *