pdf guide

pdf guide

Javascript

Javascript

How to Scale HTML to PDF with AWS Lambda and Playwright

Marcelo Abreu, founder of pdforge

Marcelo | Founder

Marcelo | Founder

Nov 27, 2024

Nov 27, 2024

Introduction to Serverless HTML to PDF Conversion

Looking for a step-by-step guide to deploy HTML to PDF on a serverless architecture using AWS Lambda with Playwright? Came to the right place!

While there are numerous guides on using HTML to PDF libraries, practical instructions on scaling this process are scarce. This article will help you implement a scalable solution for generating PDFs in your SaaS application.

The Need for Scalable PDF Generation in SaaS Applications

SaaS applications often require dynamic PDF generation for reports, invoices, and documentation. Traditional server-based methods can be resource-intensive and challenging to scale. Leveraging a serverless architecture allows you to generate PDFs efficiently while automatically handling scaling during peak usage.

Playwright Overview

Playwright is an open-source Node.js library that automates browser interactions. It supports Chromium, Firefox, and WebKit, making it ideal for rendering web pages and generating PDFs. Its ability to run headless browser instances makes it suitable for serverless environments like AWS Lambda.

We have several guides on how to use playwright for pdf generation, so this article will focus mainly on the serverless architecture, but you can check out the full guides here:

How to scale html to pdf generation with playwright and aws lambda
How to scale html to pdf generation with playwright and aws lambda

Setting Up Playwright and AWS Lambda for Serverless PDF Generation

Integrating Playwright with AWS Lambda enables scalable and cost-effective PDF generation without managing server infrastructure.

Implementing the HTML to PDF Serverless Function

First, set up a Node.js project and install Playwright:

npm init -y
npm

Create a script that navigates to a URL and generates a PDF:

const { chromium } = require('playwright');

exports.handler = async (event) => {
  const browser = await chromium.launch({
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });
  const page = await browser.newPage();
  await page.goto('https://example.com', { waitUntil: 'networkidle' });
  const pdfBuffer = await page.pdf({ format: 'A4' });
  await browser.close();
  return {
    statusCode: 200,
    headers: { 'Content-Type': 'application/pdf' },
    body: pdfBuffer.toString('base64'),
    isBase64Encoded: true,
  };
};

Configuring AWS Lambda

AWS Lambda doesn’t include the necessary Chromium binaries by default. You’ll need to include them in a Lambda Layer.

Uploading Chromium to an AWS Lambda Layer

Download a compatible version of Chromium for AWS Lambda. You can find precompiled binaries in repositories like alixaxel/chrome-aws-lambda or build your own.

1. Download Chromium Binary:

Download the headless Chromium binary compatible with AWS Lambda’s Amazon Linux environment.

2. Create a ZIP Archive:

Package the chromium binary and necessary libraries into a ZIP file.

3. Create a Lambda Layer:

In the AWS Lambda console, navigate to “Layers” and create a new layer. Upload the ZIP archive you created.

4. Add the Layer to Your Function:

In your Lambda function’s configuration, add the newly created layer.

Configuring Lambda with Docker (Recommended)

Alternatively, you can package your Lambda function and Chromium dependencies using Docker.

Dockerfile Example:

# NoddeJS 20 (Amazon linux 2023 "AL2023")
FROM public.ecr.aws/lambda/nodejs:20.2024.04.24.10-x86_64

# Install Dependencies for AL2023 to run Playwright
RUN dnf -y install \
    nss \
    dbus \
    atk \
    cups \
    at-spi2-atk \
    libdrm \
    libXcomposite \
    libXdamage \
    libXfixes \
    libXrandr \
    mesa-libgbm \
    pango \
    alsa-lib \
    lsof

# Copy package*.json to the Lambda task Root directory
COPY package*.json ${LAMBDA_TASK_ROOT}

# Copy all files in ./lambda_function to the Lambda task Root directory
# NOTE: The .dockerignore file is explicitly ignoring the `node_modules` directory. 
COPY . ${LAMBDA_TASK_ROOT}

# Set Correct File Permissions Before Building the Docker Image
RUN chmod 755 -R ${LAMBDA_TASK_ROOT}

# Run npm install to install all the dependencies
RUN npm install

# Set the path where Playwright should install Chromium
ENV PLAYWRIGHT_BROWSERS_PATH=/var/task

# Install Playwright and specific browser binaries
RUN npx playwright install chromium

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD ["app.handler"

If you want to learn more about docker building and how to enhance it, we recommend this guide.

Full Lambda Function

Here’s the complete Lambda function code:

const { chromium } = require('playwright');

exports.handler = async (event) => {
  try {
    const browser_args = {
      devtools: false,
      headless: true,
      args: [
        "--disable-dev-shm-usage",
        "--disable-setuid-sandbox",
        "--disk-cache-size=33554432",
        "--no-sandbox",
        "--single-process",
        "--disable-gpu",
      ],
    };

    const ctx_args = { ignoreHTTPSErrors: true };

    // launching browser
    const browser = await chromium.launch(browser_args);

    // creating new context
    const ctx = await browser.newContext(ctx_args);

    // creating new tab
    const page = await ctx.newPage();

    // injecting html
    await page.setContent(event.htmlContent, { waitUntil: "networkidle" });

    // generating pdf
    const pdfBuffer = await page.pdf();

    return {
        statusCode: 200,
        headers: { 'Content-Type': 'application/pdf' },
        body: pdfBuffer.toString('base64'),
        isBase64Encoded: true,
    }
  } catch (error) {
    return {
      statusCode: 500,
      headers: { 'Content-Type': 'application/pdf' },
      body: JSON.stringify({ message: "Internal server error" }),
      isBase64Encoded: true,
    };
  } finally {
    // closing context
    await ctx.close();

    // closing browser
    await browser.close();
  }
};

Uploading the Docker Image to AWS

To deploy the Docker image to AWS:

1. Build the Docker Image for AWS Lambda:

Note: If you’re using an M1 MacBook, use docker buildx to emulate the linux/amd64 platform.

docker build -t your-image-name

2. Tagging Your Docker Image:

After building the image, tag it with your Amazon ECR repository URI to prepare it for pushing.

Create an ECR Repository (if you haven’t already):

aws ecr create-repository --repository-name

Authenticate Docker to Your ECR Registry:

aws ecr get-login-password --region your-region | docker login --username AWS --password-stdin

Tag Your Docker Image:

Replace your-image-name, your-account-id, your-region, and your-repo-name with your specific AWS account details and desired repository name.

3. Push the Image to ECR:

4. Deploy the Lambda Function:

In AWS Lambda, create a new function using the container image you’ve pushed to ECR.

Advanced Topics

Handling Concurrency and Scaling with AWS Lambda

AWS Lambda has a default concurrency limit of 1,000 simultaneous executions per region. To increase this limit:

Request a Quota Increase:

Go to the AWS Service Quotas console and request an increase for “Concurrent executions” for Lambda.

Common Issues with Playwright and AWS Lambda

Running Playwright in a serverless environment can present challenges.

Chrome Being Memory-Intensive

Chromium can consume significant memory. To mitigate issues:

Clean the /tmp Directory:

AWS Lambda provides a limited /tmp directory (512 MB). Cleaning up temporary files helps manage space.

const { exec } = require('child_process');

// cleaning files
exec("rm -rf /tmp/*", (_error, stdout) =>
  console.log(`Clearing /tmp directory: ${stdout}`)
);

M1 vs. Intel Processors

If you’re developing on an M1 MacBook, you may encounter compatibility issues due to architecture differences.

Use Docker Buildx for Cross-Platform Builds:

docker buildx build --platform linux/amd64 -t

This command emulates the linux/amd64 platform, ensuring compatibility with AWS Lambda’s execution environment.

Conclusion

Implementing HTML to PDF conversion in a serverless architecture with Playwright and AWS Lambda offers scalability and cost savings.

While setting up this infrastructure requires effort, it eliminates the need to manage servers.

Alternatively, third-party solutions like pdforge can handle PDF generation without the overhead of maintaining your own architecture.

Generating pdfs at scale can be quite complicated!

Generating pdfs at scale can be quite complicated!

We take care of all of this, so you focus on what trully matters on your Product!

We take care of all of this, so you focus on what trully matters on your Product!

Try for free

7-day free trial

Table of contents

Title