Building a pool of containerized web scrapers.
In this tutorial, I’ll show you how to build a pool of containerized puppeteer applications that, once deployed to Cloud Run (Google’s much cheaper equivalent to Fargate), will allow you to scrape hundreds of thousands of pages a day and generate no cost when it automatically scales down to 0 instances.

In this tutorial, you’ll learn how to set up a web scraper in Cloud run.
Let’s start with the basic setup of a NodeJS application.
Having NodeJs installed (download it here if you don’t have it) create a new folder for this project and execute the following command inside it.
npm init -y
Now install express, puppeteer and execution-time, with the following command:
npm i express puppeteer execution-time
Create a file called app.js and paste the following code
This code fires up an Express server on port 8080 and exposes a GET endpoint on the route “/”, but don’t hit it just yet. Let’s talk about web scraping.
Set up a web scraping workflow with puppeteer
Puppeteer is a chromium automation tool that allows you to start chromium browsers, open pages (think of them as tabs) and visit web pages.
Our application will fetch a piece of text from placeholder.com, so the workflow becomes the following.
- Open a browser.
- Open a page.
- Visit placeholder.com.
- Scrape the text in the h1 element.
Sounds like a fairly common browsing session, doesn’t it?
How to reuse a browser instance in puppeteer.
When demand increases and we need to scrape data from thousands webpages efficiency will be vital. For that, we’ll reuse the browser created in step 1 and repeat steps 2 through 4 many times. Who opens a new chrome browser for every page, right?
To implement a reusable Puppeteer browser paste the following code in app.js, bellow // BROWSER SECTION
Now, let's define a function that contacts the newly created browser instance so we can finally visit some pages.
In app.js paste this code bellow // PAGE SECTION
The tab function takes in 2 arguments. The address for the WebSocket where the Chromium Browser is waiting for instructions, and a URL. The tab function will open a new page (or browserContext) and visit the provided URL.
Once placeholder.com loads, we can query the DOM by using document.querySelector(‘h1.entry-title’).innerText
which returns the inner text of the h1 element. Then we log that to the console.
Placeholder.com — The Free Image Placeholder Service Favoured By Designers
I’ve also included some references to the execution-time library, that will help us measure performance.
Execute the following command to start the application and see what happens.
node app.js

Ok, the server started. Let's hit localhost:8080 a few times

Interesting. On the first call, puppeteer started the chromium instance and then visited placeholder.com. For subsequent requests, the previous browser instance was reused, reducing execution time significantly.
Deploy to Cloud Run
First, make sure that your Google Cloud project meets the following criteria.
- Billing is enabled
- The cloud build API is enabled.
- The GCR API is enabled.
- The cloud Run API is enabled.
Once your project is ready. Create a file called Dockerfile in the same folder of index.js
With your Dockerfile ready, submit your containerized application to cloud build. Replace YOUR_PROJECT_NAME
with the name of your GCP project.
gcloud builds submit --tag gcr.io/YOUR_PROJECT_NAME/puppeteer-lab
Once that’s complete. Head over to the GCP console and search for Cloud Run. Once there, create a new service.

- Name your project
puppeteer-lab
or however you want. Click continue. - Select the newly built container image.

In the Advanced Settings menu, increase the Memory allocated to 512 MiB and under Maximum number of instances, type 1.
Click Continue
3. Select Allow unauthenticated invocations. Click create
Your Clour Run deployment
You´ll be redirected to your Cloud run service details page, which looks like this.

Click on the service URL right next to the Region. This will start a puppeteer browser a page that will visit and scrape placeholder.com.
See the logs

Now, in the service details page, hold down the CONTROL key on your keyboard and click the service URL 5 times. Check out the logs once more.

What’s next?
You can now edit the application to fit your needs. Pass query parameters to the Express endpoint or use a new POST endpoint and pass along the URL that you want to scrape in its body.
When you’re ready, tweak the number of maximum instances allowed and the amount of memory required for your workflow.