How to extract text from a PDF document with Go

How to extract text from a PDF document with Go

To accomplish this task, we will use an open-source Go library called pdfcpu, which allows you to handle PDF files, including text extraction.

Extracting text from a PDF document is a common need in many scenarios, such as data mining, business automation, or data analysis. In this article, we will explore how to make a simple Go program to extract text from a PDF file. Go is a modern and powerful programming language, especially suited for developing high-performance applications. To accomplish this task, we will use an open-source Go library called pdfcpu, which allows you to handle PDF files, including text extraction.

First, you need to install the pdfcpu package in your current Go project. This can be done using the command:


go get github.com/pdfcpu/pdfcpu

Here is a simple example of how to extract text from a PDF using pdfcpu in the project's main.go file:


package main

import (
  "fmt"
  "log"
  "os"

  "github.com/pdfcpu/pdfcpu/pkg/api"
  "github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
)

func main() {
  // Check if the PDF file was passed as an argument
  if len(os.Args) < 2 {
    log.Fatalf("Usage: %s <file.pdf>\n", os.Args[0])
  }

  pdfFile := os.Args[1]

  // Configure PDFCPU to extract text
  conf := pdfcpu.NewDefaultConfiguration()

  // Extract text from PDF
  out, err := api.ExtractTextFile(pdfFile, conf)
  if err != nil {
    log.Fatalf("Error extracting text: %v\n", err)
  }

  // Print the extracted text
  fmt.Println(string(out))
}


Code explanation:

  • Package import: The code uses the packages fmt, log and os provided by Go, as well as pdfcpu for manipulating PDF files.
  • Checking arguments: The program checks whether a PDF file was passed as an argument. If not, an error message is displayed.
  • Extracting text: The api.ExtractTextFile function is used to extract text from the PDF file. The function takes two parameters: the path to the PDF file and a default configuration.
  • Printing extracted text: The extracted text is converted to a string and printed to the console.

Conclusion

Extracting text from a PDF in Go is a relatively simple task thanks to the pdfcpu library. With just a few lines of code, you can effectively extract text from PDF files and use it for further processing. This tool can be very useful for a variety of applications, especially when integrated into automated processing pipelines or document management systems.