pdf libraries

pdf libraries

Python

Python

Generate PDF from HTML Using Playwright Python

Marcelo Abreu, founder of pdforge

Marcelo | Founder

Marcelo | Founder

Oct 7, 2024

Oct 7, 2024

An Introduction to PDF Generation with Python and Playwright

Generating PDFs from HTML is a common requirement in SaaS applications, whether it’s for creating reports, invoices, or exporting user data. Playwright is a powerful browser automation tool that integrates seamlessly with Python, making it an ideal solution for converting HTML into PDFs. Its headless browser capabilities enable you to accurately render web pages and convert them into high-quality PDFs that support CSS, JavaScript, and dynamic content.

Playwright have a robust documentation that you can access.

Alternative PDF Libraries: How Playwright Compares to Other Tools

Numbers of download from playwright

While Playwright is highly capable, there are alternative libraries you can consider:

Pyppeteer (2,063,960 monthly downloads): A Python port of Puppeteer, which can also control a browser for rendering HTML to PDF. However, Playwright offers more flexibility with multi-browser support and superior handling of dynamic content.

PyPDF2 (9,982,763 monthly downloads): This is a Python library for manipulating existing PDFs. While it cannot convert HTML directly into PDF, it can be useful if you need to merge or split PDF documents after generation.

Playwright’s ability to directly interact with HTML, CSS, and JavaScript gives it a clear advantage for more complex PDF generation needs.

Guide to generate pdf from html using python playwright
Guide to generate pdf from html using python playwright

Setting Up Playwright and Python for HTML to PDF Generation

Installing Playwright and Setting Up Your Python Environment

Start by installing Playwright along with the necessary Python bindings:

pip install playwright
python -m

This setup allows you to control a headless browser using Python, which is essential for rendering HTML into PDF. Once installed, you can launch a browser, load your HTML content, and export it as a PDF.

Configuring Playwright for HTML to PDF Conversion

Here’s a basic script to get started with generating PDFs using Playwright:

from playwright.sync_api import sync_playwright
def generate_pdf(html_file, output_pdf):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(f'file:///{html_file}')
        page.pdf(path=output_pdf)
        browser.close()
generate_pdf('invoice.html', 'output.pdf')

In this example, the script loads an HTML file, renders it, and exports it as a PDF. You can customize the page.pdf() function to adjust formatting options, such as margins, page size, and more.

Creating a Complete Invoice HTML/CSS File for Example

Let’s use a basic invoice template to illustrate how HTML files are converted. The following HTML provides a simple, styled invoice:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Invoice</title>
    <style>
        body { font-family: Arial, sans-serif; padding: 20px; }
        .invoice-header { text-align: center; margin-bottom: 20px; }
        .invoice-details { border: 1px solid #ddd; padding: 20px; }
        .total { font-weight: bold; margin-top: 20px; }
    </style>
</head>
<body>
    <div class="invoice-header">
        <h1>Invoice #12345</h1>
        <p>Date: 2024-10-01</p>
    </div>
    <div class="invoice-details">
        <p>Client: John Doe</p>
        <p>Service: Software Development</p>
        <p>Amount Due: $1000</p>
    </div>
    <div class="total">
        <p>Total: $1000</p>
    </div>
</body>
</html>

This invoice will be rendered into a PDF using Playwright, with the formatting from the CSS directly applied.

Step-by-Step Guide: Generating PDFs from HTML Using Playwright

Using Playwright to Render HTML and Convert It to a PDF

To generate a PDF from HTML, use Playwright’s page.pdf() function. You can specify the output format and customize settings like margins and page orientation:

page.pdf(path='output.pdf', format='A4', margin={'top': '20px', 'bottom': '20px'})

Playwright also supports adding headers and footers, which can include dynamic elements like page numbers:

page.pdf(path='output.pdf', displayHeaderFooter=True, 
         headerTemplate='<div>Invoice Header</div>',
         footerTemplate='<div>Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>')

This provides full control over how the final PDF is structured, making it ideal for professional documents.

HTML Template Engines: Jinja2 for Dynamic Content

When generating dynamic reports or invoices, manually creating HTML files can be cumbersome. Instead, you can use a template engine like Jinja2 to automate this process. Jinja2 allows you to create HTML templates and populate them with dynamic content.

Here’s how you can use Jinja2 with Python:

from jinja2 import Template
html_template = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Invoice</title>
</head>
<body>
    <h1>Invoice #{{ invoice_id }}</h1>
    <p>Client: {{ client_name }}</p>
    <p>Amount Due: ${{ amount_due }}</p>
</body>
</html>
"""
template = Template(html_template)
html_content = template.render(invoice_id="12345", client_name="John Doe", amount_due="1000")
with open('invoice.html', 'w') as file:
    file.write(html_content)

Jinja2 templates enable you to generate complex, data-driven HTML dynamically, streamlining the PDF creation process.

Best Practices for Optimizing HTML to PDF Performance in Production

Optimizing HTML to PDF conversion at scale is critical for SaaS applications. Here are some best practices:

Minimize JavaScript: Use lightweight JavaScript in your HTML templates to avoid performance bottlenecks.

Use Serverless Architectures: Deploy your PDF generation logic in a serverless environment, such as AWS Lambda or Google Cloud Functions. This allows you to scale your PDF generation automatically based on demand, reducing infrastructure costs. We have a full guide on how to deploy playwright on AWS Lambda here.

Cache Resources: Caching CSS and other assets helps reduce load times when generating PDFs from the same template multiple times.

Leverage Asynchronous Processes: Use Python’s async capabilities to handle multiple PDF generation requests simultaneously, improving overall throughput.

Security Considerations: Ensuring Safe PDF Generation in SaaS Systems

When generating PDFs in a SaaS context, security is paramount. Ensure the following:

Sanitize Input: Prevent malicious content by validating and sanitizing user input in your HTML templates.

Isolated Environments: Run Playwright in isolated, sandboxed environments to prevent potential security breaches.

Access Control: Ensure that sensitive data embedded in PDFs is only accessible by authorized users.

How to Use a PDF API to Automate PDF Creation at Scale

For SaaS applications that need to generate PDFs at scale, integrating with a third-party PDF API can be a more efficient approach. APIs such as pdforge offer extensive features like watermarking, encryption, and built-in scaling capabilities. They simplify the process of handling high volumes of PDF generation requests without the overhead of managing your own infrastructure.

Here’s an example of how you might integrate with a PDF API:

import requests
import json

url = 'https://api.pdforge.com/v1/pdf/sync'
headers = {
    'Authorization': 'Bearer your-api-key',
    'Content-Type': 'application/json'
}
data = {
    'templateId': 'your-template',
    'data': {
        'html': 'your-html'
    }
}

response = requests.post(url, headers=headers, data=json.dumps(data))
with open('output.pdf', 'wb') as f:
        f.write(response.content)

Using a PDF API like this can help offload the heavy lifting involved in creating and managing large-scale PDF generation.

Conclusion

Playwright and Python provide an effective solution for generating PDFs from HTML, especially when you need precise control over rendering and layout. However, for applications that require large-scale PDF generation or more advanced features, a third-party service like pdforge may be the best option, offering both scalability and additional functionality.

Generating pdfs at scale can be quite complicated!

Generating pdfs at scale can be quite complicated!

We take care of all of this, so you focus on what trully matters on your Product!

We take care of all of this, so you focus on what trully matters on your Product!

Try for free

7-day free trial

Table of contents

Title