Skip to content
Cloudflare Docs

Build a web crawler with Queues and Browser Rendering

Last reviewed: about 1 year ago

Example of how to use Queues and Browser Rendering to power a web crawler.

This tutorial explains how to build and deploy a web crawler with Queues, Browser Rendering, and Puppeteer.

Puppeteer is a high-level library used to automate interactions with Chrome/Chromium browsers. On each submitted page, the crawler will find the number of links to cloudflare.com and take a screenshot of the site, saving results to Workers KV.

You can use Puppeteer to request all images on a page, save the colors used on a site, and more.

Prerequisites

  1. Sign up for a Cloudflare account.
  2. Install Node.js.

Node.js version manager

Use a Node version manager like Volta or nvm to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

1. Create new Workers application

To get started, create a Worker application using the create-cloudflare CLI. Open a terminal window and run the following command:

Terminal window
npm create cloudflare@latest -- queues-web-crawler

For setup, select the following options:

  • For What would you like to start with?, choose Hello World example.
  • For Which template would you like to use?, choose Worker only.
  • For Which language do you want to use?, choose TypeScript.
  • For Do you want to use git for version control?, choose Yes.
  • For Do you want to deploy your application?, choose No (we will be making some changes before deploying).

Then, move into your newly created directory:

Terminal window
cd queues-web-crawler

2. Create KV namespace

We need to create a KV store. This can be done through the Cloudflare dashboard or the Wrangler CLI. For this tutorial, we will use the Wrangler CLI.

Terminal window
npx wrangler kv namespace create crawler_links
Terminal window
npx wrangler kv namespace create crawler_screenshots
🌀 Creating namespace with title "web-crawler-crawler-links"
Success!
Add the following to your configuration file in your kv_namespaces array:
[[kv_namespaces]]
binding = "crawler_links"
id = "<GENERATED_NAMESPACE_ID>"
🌀 Creating namespace with title "web-crawler-crawler-screenshots"
Success!
Add the following to your configuration file in your kv_namespaces array:
[[kv_namespaces]]
binding = "crawler_screenshots"
id = "<GENERATED_NAMESPACE_ID>"

Add KV bindings to the Wrangler configuration file

Then, in your Wrangler file, add the following with the values generated in the terminal:

{
"kv_namespaces": [
{
"binding": "CRAWLER_SCREENSHOTS_KV",
"id": "<GENERATED_NAMESPACE_ID>"
},
{
"binding": "CRAWLER_LINKS_KV",
"id": "<GENERATED_NAMESPACE_ID>"
}
]
}

3. Set up Browser Rendering

Now, you need to set up your Worker for Browser Rendering.

In your current directory, install Cloudflare’s fork of Puppeteer and also robots-parser:

Terminal window
npm i -D @cloudflare/puppeteer
Terminal window
npm i robots-parser

Then, add a Browser Rendering binding. Adding a Browser Rendering binding gives the Worker access to a headless Chromium instance you will control with Puppeteer.

{
"browser": {
"binding": "CRAWLER_BROWSER"
}
}

4. Set up a Queue

Now, we need to set up the Queue.

Terminal window
npx wrangler queues create queues-web-crawler
Output
Creating queue queues-web-crawler.
Created queue queues-web-crawler.

Add Queue bindings to wrangler.toml

Then, in your Wrangler file, add the following: