Building a pool of containerized web scrapers.

Felipe Lujan
4 min readDec 4, 2020

--

In this tutorial, I’ll show you how to build a pool of containerized puppeteer applications that, once deployed to Cloud Run (Google’s much cheaper equivalent to Fargate), will allow you to scrape hundreds of thousands of pages a day and generate no cost when it automatically scales down to 0 instances.

Photography: PA on theguardian.com

In this tutorial, you’ll learn how to set up a web scraper in Cloud run.

Let’s start with the basic setup of a NodeJS application.

Having NodeJs installed (download it here if you don’t have it) create a new folder for this project and execute the following command inside it.

npm init -y

Now install express, puppeteer and execution-time, with the following command:

npm i express puppeteer execution-time

Create a file called app.js and paste the following code

app.js

This code fires up an Express server on port 8080 and exposes a GET endpoint on the route “/”, but don’t hit it just yet. Let’s talk about web scraping.

Set up a web scraping workflow with puppeteer

Puppeteer is a chromium automation tool that allows you to start chromium browsers, open pages (think of them as tabs) and visit web pages.

Our application will fetch a piece of text from placeholder.com, so the workflow becomes the following.

  1. Open a browser.
  2. Open a page.
  3. Visit placeholder.com.
  4. Scrape the text in the h1 element.

Sounds like a fairly common browsing session, doesn’t it?

How to reuse a browser instance in puppeteer.

When demand increases and we need to scrape data from thousands webpages efficiency will be vital. For that, we’ll reuse the browser created in step 1 and repeat steps 2 through 4 many times. Who opens a new chrome browser for every page, right?

To implement a reusable Puppeteer browser paste the following code in app.js, bellow // BROWSER SECTION

This function starts the chromium instance started by puppeteer.

Now, let's define a function that contacts the newly created browser instance so we can finally visit some pages.

In app.js paste this code bellow // PAGE SECTION

This is where the magic happens.

The tab function takes in 2 arguments. The address for the WebSocket where the Chromium Browser is waiting for instructions, and a URL. The tab function will open a new page (or browserContext) and visit the provided URL.

Once placeholder.com loads, we can query the DOM by using document.querySelector(‘h1.entry-title’).innerText which returns the inner text of the h1 element. Then we log that to the console.

Placeholder.com — The Free Image Placeholder Service Favoured By Designers

I’ve also included some references to the execution-time library, that will help us measure performance.

Execute the following command to start the application and see what happens.

node app.js
console output

Ok, the server started. Let's hit localhost:8080 a few times

console output

Interesting. On the first call, puppeteer started the chromium instance and then visited placeholder.com. For subsequent requests, the previous browser instance was reused, reducing execution time significantly.

Deploy to Cloud Run

First, make sure that your Google Cloud project meets the following criteria.

  • Billing is enabled
  • The cloud build API is enabled.
  • The GCR API is enabled.
  • The cloud Run API is enabled.

Once your project is ready. Create a file called Dockerfile in the same folder of index.js

With your Dockerfile ready, submit your containerized application to cloud build. Replace YOUR_PROJECT_NAME with the name of your GCP project.

gcloud builds submit --tag gcr.io/YOUR_PROJECT_NAME/puppeteer-lab

Once that’s complete. Head over to the GCP console and search for Cloud Run. Once there, create a new service.

  1. Name your project puppeteer-lab or however you want. Click continue.
  2. Select the newly built container image.
select the docker image for your cloud run service

In the Advanced Settings menu, increase the Memory allocated to 512 MiB and under Maximum number of instances, type 1.

Click Continue

3. Select Allow unauthenticated invocations. Click create

Your Clour Run deployment

You´ll be redirected to your Cloud run service details page, which looks like this.

Cloud run service details page

Click on the service URL right next to the Region. This will start a puppeteer browser a page that will visit and scrape placeholder.com.

See the logs

Logs after starting one puppeteer browser. 3,5 seconds

Now, in the service details page, hold down the CONTROL key on your keyboard and click the service URL 5 times. Check out the logs once more.

Subsequent scraping requests run much faster

What’s next?

You can now edit the application to fit your needs. Pass query parameters to the Express endpoint or use a new POST endpoint and pass along the URL that you want to scrape in its body.

When you’re ready, tweak the number of maximum instances allowed and the amount of memory required for your workflow.

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Felipe Lujan
Felipe Lujan

Written by Felipe Lujan

Google Developer Expert — Google Cloud.

No responses yet

Write a response