Build a web crawler with Queues and Browser Rendering

Last reviewed: about 1 year ago

Example of how to use Queues and Browser Rendering to power a web crawler.

This tutorial explains how to build and deploy a web crawler with Queues, Browser Rendering, and Puppeteer.

Puppeteer is a high-level library used to automate interactions with Chrome/Chromium browsers. On each submitted page, the crawler will find the number of links to cloudflare.com and take a screenshot of the site, saving results to Workers KV.

You can use Puppeteer to request all images on a page, save the colors used on a site, and more.

Prerequisites

Sign up for a Cloudflare account ↗.
Install Node.js ↗.

Node.js version manager

Use a Node version manager like Volta ↗ or nvm ↗ to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

1. Create new Workers application

To get started, create a Worker application using the create-cloudflare CLI ↗. Open a terminal window and run the following command:

npm create cloudflare@latest -- queues-web-crawler

yarn create cloudflare queues-web-crawler

pnpm create cloudflare@latest queues-web-crawler

For setup, select the following options:

For What would you like to start with?, choose Hello World example.
For Which template would you like to use?, choose Worker only.
For Which language do you want to use?, choose TypeScript.
For Do you want to use git for version control?, choose Yes.
For Do you want to deploy your application?, choose No (we will be making some changes before deploying).

Then, move into your newly created directory:

cd queues-web-crawler

2. Create KV namespace

We need to create a KV store. This can be done through the Cloudflare dashboard or the Wrangler CLI. For this tutorial, we will use the Wrangler CLI.

npx wrangler kv namespace create crawler_links

yarn wrangler kv namespace create crawler_links

pnpm wrangler kv namespace create crawler_links

npx wrangler kv namespace create crawler_screenshots

yarn wrangler kv namespace create crawler_screenshots

pnpm wrangler kv namespace create crawler_screenshots

🌀 Creating namespace with title "web-crawler-crawler-links"
✨ Success!
Add the following to your configuration file in your kv_namespaces array:
[[kv_namespaces]]
binding = "crawler_links"
id = "<GENERATED_NAMESPACE_ID>"

🌀 Creating namespace with title "web-crawler-crawler-screenshots"
✨ Success!
Add the following to your configuration file in your kv_namespaces array:
[[kv_namespaces]]
binding = "crawler_screenshots"
id = "<GENERATED_NAMESPACE_ID>"

Add KV bindings to the Wrangler configuration file

Then, in your Wrangler file, add the following with the values generated in the terminal:

wrangler.jsonc
wrangler.toml

{
  "kv_namespaces": [
    {
      "binding": "CRAWLER_SCREENSHOTS_KV",
      "id": "<GENERATED_NAMESPACE_ID>"
    },
    {
      "binding": "CRAWLER_LINKS_KV",
      "id": "<GENERATED_NAMESPACE_ID>"
    }
  ]
}

kv_namespaces = [
  { binding = "CRAWLER_SCREENSHOTS_KV", id = "<GENERATED_NAMESPACE_ID>" },
  { binding = "CRAWLER_LINKS_KV", id = "<GENERATED_NAMESPACE_ID>" }
]

3. Set up Browser Rendering

Now, you need to set up your Worker for Browser Rendering.

In your current directory, install Cloudflare’s fork of Puppeteer and also robots-parser ↗:

npm i -D @cloudflare/puppeteer

yarn add -D @cloudflare/puppeteer

pnpm add -D @cloudflare/puppeteer

npm i robots-parser

yarn add robots-parser

pnpm add robots-parser

Then, add a Browser Rendering binding. Adding a Browser Rendering binding gives the Worker access to a headless Chromium instance you will control with Puppeteer.

wrangler.jsonc
wrangler.toml

{
  "browser": {
    "binding": "CRAWLER_BROWSER"
  }
}

browser = { binding = "CRAWLER_BROWSER" }

4. Set up a Queue

Now, we need to set up the Queue.

npx wrangler queues create queues-web-crawler

yarn wrangler queues create queues-web-crawler

pnpm wrangler queues create queues-web-crawler

Creating queue queues-web-crawler.
Created queue queues-web-crawler.

Add Queue bindings to wrangler.toml

Then, in your Wrangler file, add the following:

wrangler.jsonc