How to convert HTML to PDF with Pyppeteer in Python

In this article, we will explore a simple Python script to generate PDF files from HTML files using the Pyppeteer library, a Python implementation of Puppeteer. Puppeteer is a tool developed by Google that allows controlling a browser through code, offering the ability to automate many operations, including web page rendering and PDF creation. Let's see how the code works and how we can adapt it to our needs.

Pyppeteer is a library that replicates the functionality of Puppeteer (originally developed for JavaScript) in Python, allowing control over the Chrome or Chromium browser. Its capability to generate PDFs, take screenshots, and automate interactions with pages is particularly useful in contexts such as report generation, web page archiving, and web scraping.

Before running the script, it is necessary to install Pyppeteer. You can install it via pip:


pip install pyppeteer

Now let's look at the Python code that allows converting HTML files to PDF.


import asyncio
import os
from pyppeteer import launch

async def create_pdf_from_html(html, pdf_path):
  browser = await launch()
  page = await browser.newPage()

  await page.setContent(html)
  await page.pdf({'path': pdf_path, 'format': 'A4'})
  await browser.close()

async def main():
    html_files_dir = './html'
    pdf_files_dir = './pdf'
    for html_file in os.listdir(html_files_dir):
        if html_file.endswith('.html'):
          html_path = os.path.join(html_files_dir, html_file)
          pdf_path = os.path.join(pdf_files_dir, html_file.replace('.html', '.pdf'))
          html = open(html_path).read()
          await create_pdf_from_html(html, pdf_path)

asyncio.get_event_loop().run_until_complete(main())

The code consists of two asynchronous functions: create_pdf_from_html and main.

The create_pdf_from_html function is the heart of the program. It accepts two parameters:

html: the HTML content to convert to PDF.
pdf_path: the path to save the generated PDF file.

The function creates a browser instance, loads the HTML page, and generates a PDF in A4 format.


async def create_pdf_from_html(html, pdf_path):
  browser = await launch()  # Launches a browser instance
  page = await browser.newPage()  # Creates a new page

  await page.setContent(html)  # Sets the HTML content
  await page.pdf({'path': pdf_path, 'format': 'A4'})  # Generates the PDF in A4 format
  await browser.close()  # Closes the browser

The main function handles processing multiple HTML files. It defines the input (./html) and output (./pdf) directories, then loops through all HTML files in the html_files_dir directory. For each file, it reads the HTML content and generates a PDF file in the destination directory.


async def main():
    html_files_dir = './html'  # Directory for HTML files
    pdf_files_dir = './pdf'  # Directory for PDF files
    for html_file in os.listdir(html_files_dir):  # Loops through files in the HTML folder
        if html_file.endswith('.html'):  # Checks the file extension
          html_path = os.path.join(html_files_dir, html_file)  # Path of the HTML file
          pdf_path = os.path.join(pdf_files_dir, html_file.replace('.html', '.pdf'))  # PDF path
          html = open(html_path).read()  # Reads the HTML content
          await create_pdf_from_html(html, pdf_path)  # Converts to PDF

Script execution is managed using an asynchronous event loop. The asyncio.get_event_loop().run_until_complete(main()) method starts the execution of main and completes all defined asynchronous operations.

To ensure the script works correctly, it's important to organize files into two folders:

./html: where the HTML files to be converted to PDF are placed.
./pdf: where the generated PDF files will be saved.

Generating PDFs from HTML is a useful technique in many applications, such as:

Automatic report generation.
Archiving web pages for future reference.
Transforming dynamic content into printable documents.

This approach is especially advantageous for those working with dynamic web content who need to automate document generation in stable, print-ready formats.

Conclusions

In this article, we have shown how to use Pyppeteer to create PDFs from HTML files with Python. We analyzed the code in detail, highlighting how Python's asynchronous features and Pyppeteer make it easy to handle browser automation and document generation tasks.