In this article, we will explore a simple Python script to generate PDF files from HTML files using the Pyppeteer library, a Python implementation of Puppeteer. Puppeteer is a tool developed by Google that allows controlling a browser through code, offering the ability to automate many operations, including web page rendering and PDF creation. Let's see how the code works and how we can adapt it to our needs.
Pyppeteer is a library that replicates the functionality of Puppeteer (originally developed for JavaScript) in Python, allowing control over the Chrome or Chromium browser. Its capability to generate PDFs, take screenshots, and automate interactions with pages is particularly useful in contexts such as report generation, web page archiving, and web scraping.
Before running the script, it is necessary to install Pyppeteer. You can install it via pip:
pip install pyppeteer
Now let's look at the Python code that allows converting HTML files to PDF.
import asyncio
import os
from pyppeteer import launch
async def create_pdf_from_html(html, pdf_path):
browser = await launch()
page = await browser.newPage()
await page.setContent(html)
await page.pdf({'path': pdf_path, 'format': 'A4'})
await browser.close()
async def main():
html_files_dir = './html'
pdf_files_dir = './pdf'
for html_file in os.listdir(html_files_dir):
if html_file.endswith('.html'):
html_path = os.path.join(html_files_dir, html_file)
pdf_path = os.path.join(pdf_files_dir, html_file.replace('.html', '.pdf'))
html = open(html_path).read()
await create_pdf_from_html(html, pdf_path)
asyncio.get_event_loop().run_until_complete(main())
The code consists of two asynchronous functions: create_pdf_from_html
and main
.
The create_pdf_from_html
function is the heart of the program. It accepts two parameters:
html
: the HTML content to convert to PDF.pdf_path
: the path to save the generated PDF file.
The function creates a browser instance, loads the HTML page, and generates a PDF in A4 format.
async def create_pdf_from_html(html, pdf_path):
browser = await launch() # Launches a browser instance
page = await browser.newPage() # Creates a new page
await page.setContent(html) # Sets the HTML content
await page.pdf({'path': pdf_path, 'format': 'A4'}) # Generates the PDF in A4 format
await browser.close() # Closes the browser
The main
function handles processing multiple HTML files. It defines the input (./html
) and output (./pdf
) directories, then loops through all HTML files in the html_files_dir
directory. For each file, it reads the HTML content and generates a PDF file in the destination directory.
async def main():
html_files_dir = './html' # Directory for HTML files
pdf_files_dir = './pdf' # Directory for PDF files
for html_file in os.listdir(html_files_dir): # Loops through files in the HTML folder
if html_file.endswith('.html'): # Checks the file extension
html_path = os.path.join(html_files_dir, html_file) # Path of the HTML file
pdf_path = os.path.join(pdf_files_dir, html_file.replace('.html', '.pdf')) # PDF path
html = open(html_path).read() # Reads the HTML content
await create_pdf_from_html(html, pdf_path) # Converts to PDF
Script execution is managed using an asynchronous event loop. The asyncio.get_event_loop().run_until_complete(main())
method starts the execution of main
and completes all defined asynchronous operations.
To ensure the script works correctly, it's important to organize files into two folders:
- ./html: where the HTML files to be converted to PDF are placed.
- ./pdf: where the generated PDF files will be saved.
Generating PDFs from HTML is a useful technique in many applications, such as:
- Automatic report generation.
- Archiving web pages for future reference.
- Transforming dynamic content into printable documents.
This approach is especially advantageous for those working with dynamic web content who need to automate document generation in stable, print-ready formats.
Conclusions
In this article, we have shown how to use Pyppeteer to create PDFs from HTML files with Python. We analyzed the code in detail, highlighting how Python's asynchronous features and Pyppeteer make it easy to handle browser automation and document generation tasks.