How to Run Puppeteer on AWS Lambda

Running Puppeteer on AWS Lambda can be challenging due to the serverless environment's limitations and Chrome's resource requirements. However, with the right setup and optimizations, it's possible to create a reliable web scraping solution that scales automatically. In this guide, we'll explore how to set up Puppeteer on AWS Lambda and provide a working boilerplate solution. Why Run Puppeteer on AWS Lambda? Running Puppeteer on AWS Lambda offers several advantages: Serverless Architecture: No need to manage servers or worry about uptime Cost-Effective: Pay only for the compute time you use Auto-Scaling: Automatically handle varying workloads Easy Integration: Works well with other AWS services However, there are some challenges to consider: Lambda's execution time limits (up to 15 minutes) Memory constraints (up to 10GB) Cold starts affecting performance Chrome binary compatibility issues Setting Up Puppeteer on AWS Lambda I've created a boilerplate repository that handles these challenges and provides a working solution. Let's go through the setup process: Prerequisites Node.js 18.x (recommended) AWS Account with Lambda and S3 access AWS CLI configured for local deployment Local Development Setup First, clone the repository and set up your local environment: # Install Node.js 18 nvm install 18 nvm use 18 # Install dependencies npm install # Create environment file echo "SECRET=your-secret-key-here" > .env # Run locally node index.js AWS Configuration Create an S3 bucket for your Lambda deployment package Create a Lambda function with these recommended settings: Runtime: Node.js 18.x Memory: 1024 MB Timeout: 30 seconds Architecture: x86_64 Deployment Options Manual Deployment # Create deployment package zip -r lambda.zip index.js node_modules # Upload to S3 aws s3 cp lambda.zip s3://your-bucket-name/lambda.zip Then update your Lambda function through the AWS Console: Go to AWS Lambda Console Select your function Go to Code tab Click "Upload from" -> "Amazon S3 location" Paste the S3 URL of your uploaded zip file Automated Deployment with GitHub Actions The boilerplate includes a GitHub Actions workflow for automated deployment. To set it up: Add these secrets to your GitHub repository: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY Update the workflow file (.github/workflows/main.yml) with your values: Replace {{your-bucket-name}} with your S3 bucket name Replace {{your-function-name}} with your Lambda function name Push to main to trigger deployment Using the Lambda Function The function accepts POST requests with this structure: { "url": "https://example.com" } Required headers: secret: your-secret-key Key Features of the Boilerplate Stealth Mode: Uses puppeteer-extra-plugin-stealth to avoid detection AWS Compatibility: Uses @sparticuz/chromium for Lambda compatibility Security: Secret key authentication Automated Deployment: GitHub Actions workflow included Dependencies The boilerplate uses these key dependencies: @sparticuz/chromium: ^123.0.1 puppeteer-extra: ^3.3.4 puppeteer-core: 19.6 puppeteer-extra-plugin-stealth: ^2.11.1 puppeteer: ^21.5.0 dotenv: ^16.4.5 Alternative Solution: CaptureKit While running Puppeteer on AWS Lambda is powerful, it requires significant maintenance and handling of edge cases. If you're looking for a managed solution that handles all the infrastructure and maintenance, consider using CaptureKit. It provides three powerful APIs in one platform: Screenshot API Reliable screenshot capture with no infrastructure management Full-page screenshots with lazy loading support Built-in ad and cookie banner blocking Multiple output formats (PNG, WebP, JPEG, PDF) Direct S3 upload integration Content Extraction API Clean, structured HTML extraction Metadata parsing (title, description, OpenGraph & Schema data) Link scraping (internal and external) Consistent data without maintenance headaches Perfect for data pipelines and web scraping AI Analysis API Instant webpage summarization Key insights extraction AI-powered content analysis Scale your web research process Focus on creating, not extracting content All CaptureKit APIs are: Developer-first with instant access No credit card required for free tier Lightning-fast support Built for production use cases Best Practices and Tips Memory Management Monitor Lambda memory usage Adjust memory allocation based on your needs Clean up resources properly Performance Optimization Use Lambda layers for dependencies Implement connection pooling Cache frequently accessed data Error Handling Implement proper error logging Set up CloudWatch alarms Handle timeouts gracefully Security Never commit AWS credentials Use environment variables for secrets Impleme

Mar 28, 2025 - 14:52
 0
How to Run Puppeteer on AWS Lambda

Running Puppeteer on AWS Lambda can be challenging due to the serverless environment's limitations and Chrome's resource requirements. However, with the right setup and optimizations, it's possible to create a reliable web scraping solution that scales automatically. In this guide, we'll explore how to set up Puppeteer on AWS Lambda and provide a working boilerplate solution.

Why Run Puppeteer on AWS Lambda?

Running Puppeteer on AWS Lambda offers several advantages:

  • Serverless Architecture: No need to manage servers or worry about uptime
  • Cost-Effective: Pay only for the compute time you use
  • Auto-Scaling: Automatically handle varying workloads
  • Easy Integration: Works well with other AWS services

However, there are some challenges to consider:

  • Lambda's execution time limits (up to 15 minutes)
  • Memory constraints (up to 10GB)
  • Cold starts affecting performance
  • Chrome binary compatibility issues

Setting Up Puppeteer on AWS Lambda

I've created a boilerplate repository that handles these challenges and provides a working solution. Let's go through the setup process:

Prerequisites

  1. Node.js 18.x (recommended)
  2. AWS Account with Lambda and S3 access
  3. AWS CLI configured for local deployment

Local Development Setup

First, clone the repository and set up your local environment:

# Install Node.js 18
nvm install 18
nvm use 18

# Install dependencies
npm install

# Create environment file
echo "SECRET=your-secret-key-here" > .env

# Run locally
node index.js

AWS Configuration

  1. Create an S3 bucket for your Lambda deployment package
  2. Create a Lambda function with these recommended settings:
    • Runtime: Node.js 18.x
    • Memory: 1024 MB
    • Timeout: 30 seconds
    • Architecture: x86_64

Deployment Options

Manual Deployment

# Create deployment package
zip -r lambda.zip index.js node_modules

# Upload to S3
aws s3 cp lambda.zip s3://your-bucket-name/lambda.zip

Then update your Lambda function through the AWS Console:

  1. Go to AWS Lambda Console
  2. Select your function
  3. Go to Code tab
  4. Click "Upload from" -> "Amazon S3 location"
  5. Paste the S3 URL of your uploaded zip file

Automated Deployment with GitHub Actions

The boilerplate includes a GitHub Actions workflow for automated deployment. To set it up:

  1. Add these secrets to your GitHub repository:

    • AWS_ACCESS_KEY_ID
    • AWS_SECRET_ACCESS_KEY
  2. Update the workflow file (.github/workflows/main.yml) with your values:

    • Replace {{your-bucket-name}} with your S3 bucket name
    • Replace {{your-function-name}} with your Lambda function name
  3. Push to main to trigger deployment

Using the Lambda Function

The function accepts POST requests with this structure:

{
  "url": "https://example.com"
}

Required headers:

secret: your-secret-key

Key Features of the Boilerplate

  1. Stealth Mode: Uses puppeteer-extra-plugin-stealth to avoid detection
  2. AWS Compatibility: Uses @sparticuz/chromium for Lambda compatibility
  3. Security: Secret key authentication
  4. Automated Deployment: GitHub Actions workflow included

Dependencies

The boilerplate uses these key dependencies:

  • @sparticuz/chromium: ^123.0.1
  • puppeteer-extra: ^3.3.4
  • puppeteer-core: 19.6
  • puppeteer-extra-plugin-stealth: ^2.11.1
  • puppeteer: ^21.5.0
  • dotenv: ^16.4.5

Alternative Solution: CaptureKit

While running Puppeteer on AWS Lambda is powerful, it requires significant maintenance and handling of edge cases. If you're looking for a managed solution that handles all the infrastructure and maintenance, consider using CaptureKit. It provides three powerful APIs in one platform:

Screenshot API

  • Reliable screenshot capture with no infrastructure management
  • Full-page screenshots with lazy loading support
  • Built-in ad and cookie banner blocking
  • Multiple output formats (PNG, WebP, JPEG, PDF)
  • Direct S3 upload integration

Content Extraction API

  • Clean, structured HTML extraction
  • Metadata parsing (title, description, OpenGraph & Schema data)
  • Link scraping (internal and external)
  • Consistent data without maintenance headaches
  • Perfect for data pipelines and web scraping

AI Analysis API

  • Instant webpage summarization
  • Key insights extraction
  • AI-powered content analysis
  • Scale your web research process
  • Focus on creating, not extracting content

All CaptureKit APIs are:

  • Developer-first with instant access
  • No credit card required for free tier
  • Lightning-fast support
  • Built for production use cases

Best Practices and Tips

  1. Memory Management

    • Monitor Lambda memory usage
    • Adjust memory allocation based on your needs
    • Clean up resources properly
  2. Performance Optimization

    • Use Lambda layers for dependencies
    • Implement connection pooling
    • Cache frequently accessed data
  3. Error Handling

    • Implement proper error logging
    • Set up CloudWatch alarms
    • Handle timeouts gracefully
  4. Security

    • Never commit AWS credentials
    • Use environment variables for secrets
    • Implement proper IAM roles

Conclusion

Running Puppeteer on AWS Lambda is a powerful solution for serverless web scraping, but it requires careful setup and maintenance. The provided boilerplate handles many common challenges and provides a solid foundation for your projects.

For those who want to focus on their core business logic without managing infrastructure, CaptureKit offers a comprehensive solution that handles all the complexities of web scraping and content extraction.

Choose the approach that best fits your needs:

  • Use the Puppeteer Lambda boilerplate if you need full control and customization
  • Use CaptureKit if you want a managed solution with additional features