How to extract text from a PDF document with Python

How to extract text from a PDF document with Python

In this article, we will explore how to extract text from a PDF document using PyPDF2.

Extracting text from a PDF document can be a very useful task in a wide range of applications, such as data mining, document analysis, and business process automation. Python, thanks to its wide range of libraries, offers many tools to perform this task easily and effectively. In this article, we will explore how to extract text from a PDF document using PyPDF2.

First of all, you need to install the PyPDF2 library. This can be done via the command:


pip install PyPDF2

PyPDF2 is one of the most popular libraries for working with PDF files in Python. It offers basic functionality like PDF merging, page rotation and of course text extraction.

Here is a simple example of how to extract text from a PDF using PyPDF2:


import PyPDF2

# Open the PDF file
with open('document.pdf', 'rb') as file:
  # Create a PDF reader
  pdf_reader = PyPDF2.PdfReader(file)

  # Initialize an empty string for the extracted text
  text = ""

  # Iterate through all pages of the PDF
  for page in range(len(pdf_reader.pages)):
    # Extract the text of each page
    text += pdf_reader.pages[page].extract_text()

  print(text)

Explanation of the code:

  • We open the PDF file in binary reading mode ('rb').
  • We create a PdfReader object that reads the contents of the PDF.
  • We iterate through all the pages of the document and use the extract_text() method to get the text of each page.
  • The extracted text is then stored in a variable and can be printed or processed further.

Conclusion

Text extraction is just one of the many operations that can be performed on PDFs with Python. Other possibilities include image extraction, page manipulation, and much more, opening up a wide range of automated document management applications.