AWS Feed
Field Notes: Scaling Browser Automation with Puppeteer on AWS Lambda with Container Image Support

This post is contributed by Bill Kerr, SHI and Raj Seshadri, Global SA Lead, AWS.

Imagine you are launching a brand new website selling goods and services. You are expecting a huge amount of traffic due to the seasonality of the product. You would like to test 100K simultaneous connections to the website and make sure it is working properly. How would you go about doing that? Try a headless browser automation tool like Puppeteer. Puppeteer can now be packaged as a container image in a Lambda function to perform browser automation or any web scraping functionality.

Puppeteer is a Node library which allows you to automate tasks in headless Chrome. When using Puppeteer in Lambda with container image support, you can scale browser automation horizontally. With Lambda, Node packages can be installed in a container instead of having to put them in Lambda layers. This blog will show how to run Puppeteer and Chrome in a Lambda container function. In this example, multiple instances of Puppeteer will simultaneously take screenshots of several popular news websites and store them in Amazon S3.

Solution Overview

Pupeteer image

The overall solution architecture is shown in the preceding diagram. Two Lambda functions are used in this example.

  1. A Puppeteer function that requires a URL and bucket name as inputs. This uses Puppeteer to take a screenshot of the URL in headless Chrome and save the image in the S3 bucket.
  2. A fan-out function that requires a list of URLs as input, which asynchronously invokes the Puppeteer function for each URL in the list.

Lambda container Dockerfile for Puppeteer function

Here is a documented version of the Dockerfile that is used to create a container for use with Lambda.

# Start with an AWS provided image that is ready to use with Lambda
FROM amazon/aws-lambda-nodejs:12 # Allow AWS credentials to be supplied when building this container locally for testing,
# so S3 can be accessed
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
ARG AWS_REGION=us-east-1 # Install Chrome to get all of the dependencies installed
ADD https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm chrome.rpm
RUN yum install -y ./chrome.rpm ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID  AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY  AWS_REGION=$AWS_REGION # Copy files into the container
COPY jest.config.js package*.json tsconfig.json ${LAMBDA_TASK_ROOT}/
COPY bin/app.ts ${LAMBDA_TASK_ROOT}/bin/
COPY test/app.test.ts ${LAMBDA_TASK_ROOT}/test/ # Install, build and test the Puppeteer app and Chrome
RUN npm install
RUN npm run build
RUN npm test # Lambda handler path
CMD [ "bin/app.lambdaHandler" ]

Deploy the cloud infrastructure

Prerequisite

AWS CDK must be installed. Review the CDK installation instructions.

Download and install the dependencies and example CDK application

In a terminal, check out the code used in this article and install it.

# Clone the example CDK application code
git clone https://github.com/shi/crpm-lambda-container-puppeteer # Change to the infrastructure directory containing CDK and CRPM
cd crpm-lambda-container-puppeteer/infra
# Install the CDK application
npm install # Deploy the CDK Toolkit CloudFormation stack
cdk bootstrap aws://unknown-account/unknown-region # Deploy the Puppeteer example CloudFormation stack
cdk deploy puppeteer

Puppeteer Usage

The next steps are performed in the AWS Console.

1.In the AWS Console, open the Lambda function that was created by CDK above.

    • Look for InvokeLambdaFunctionName in the Outputs section to get the name of the function to open.
    • You can also find the function name in the Resources tab of the CloudFormation stack in the AWS Console.

2. In the function, click on the Test tab.

3. Create a new test event with JSON like the following, and run it. Have fun changing the URLs to what you want.

["https://news.yahoo.com/","https://news.google.com/","https://www.huffpost.com/","https://www.cnn.com/","https://www.nytimes.com/","https://www.foxnews.com/","https://www.nbcnews.com/","https://www.washingtonpost.com/","https://www.wsj.com/","https://abcnews.go.com/","https://www.usatoday.com/"]

4. Click on the Invoke button to invoke the fan-out function.

5. Open the S3 bucket that was created by the aforementioned CDK.

6. Look for puppeteer.BucketName in the Outputs section to get the bucket name.

7. Within a minute of running the fan-out function, you should see a list of images in the screenshots folder in the bucket. They should slowly trickle in as you refresh the list of screenshots until all are done.

8. If any screenshots are missing, you can view CloudWatch Logs for the Puppeteer function.

9. Search all for error to determine how to implement improved error handling in the code.

10. You could modify the app to perform functional testing of a website, and save screenshots in S3 whenever errors occur.

Clean up

In the AWS Console, manually empty the bucket that was created by the CDK. Look for puppeteer.BucketName in the Outputs section. You can also find the bucket name in the Resources tab of the CloudFormation stack. Then, run the following command after the bucket has been emptied.

# Destroy the Puppeteer example CloudFormation stack
cdk destroy puppeteer # Delete the CDK Toolkit CloudFormation stack
aws cloudformation delete-stack --stack-name CDKToolkit

Conclusion

In this post, we showed you how to use Lambda functions packaged as container image to do web scraping functions. The possibilities of such applications are limitless when using lambda with container image support.

For more serverless learning resources, visit the Serverlessland website.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers
Bill Kerr

Bill Kerr

Bill Kerr is a senior developer at Stratascale who has worked at startup and Fortune 500 companies. He’s the creator of CRPM and he’s a super fan of CDK and cloud infrastructure automation.