How to extract text from a PDF document with Node.js

How to extract text from a PDF document with Node.js

In this article we will see how to extract text from a PDF document using Node.js, through dedicated libraries.

Extracting text from a PDF file is one of the most common and useful operations in various development contexts. Thanks to the power of Node.js and the availability of many open source libraries, we can easily implement this functionality. In this article we will see how to extract text from a PDF document using Node.js, through dedicated libraries.

There are several libraries that allow you to work with PDFs in Node.js. Two of the most popular ones for text extraction are:

  1. pdf-parse: A library that allows you to parse and extract text from PDF files in an easy way.
  2. pdf-lib: A more powerful and complete library that allows you to create, edit and extract data from PDFs.

For this article, we will focus on the pdf-parse library as it is the simplest and most effective solution for extracting text from a PDF.

The pdf-parse library greatly simplifies the process of extracting text from a PDF. Let's see an example of how to use it.


const fs = require('fs');
const pdfParse = require('pdf-parse');

// Read PDF file
const dataBuffer = fs.readFileSync('document.pdf');

// Perform text extraction
pdfParse(dataBuffer).then(function(data) {
  // Print extracted text
  console.log(data.text);
}).catch(function(error) {
  console.error('Error while extracting text:', error);
});

Code explanation:

  • fs.readFileSync('document.pdf'): Reads the content of the PDF file as a binary buffer.
  • pdfParse(dataBuffer): We pass the buffer to the pdfParse function, which returns a Promise that, once solved, contains the text extracted from the PDF.
  • data.text: The text property contains the text extracted from the PDF document.

In addition to simply extracting text, pdf-parse also provides other useful information about the PDF, such as:

  • Number of pages (data.numpages).
  • File size (data.filesize).
  • Text structured in blocks or on individual pages.

You can access this information through the data object returned by Promise. Here is an example:


pdfParse(dataBuffer).then(function(data) {
  console.log('Number of pages:', data.numpages);
  console.log('File size:', data.filesize);
  console.log('Extracted text:', data.text);
});

Conclusion

In this article we have seen how to extract text from a PDF document using Node.js and the pdf-parse library. This process is fast and can be easily integrated into an application to automate operations involving the extraction and processing of data from PDF.