Build a web crawler with Queues and Browser Rendering
Example of how to use Queues and Browser Rendering to power a web crawler.
This tutorial explains how to build and deploy a web crawler with Queues, Browser Rendering, and Puppeteer.
Puppeteer is a high-level library used to automate interactions with Chrome/Chromium browsers. On each submitted page, the crawler will find the number of links to cloudflare.com
and take a screenshot of the site, saving results to Workers KV.
You can use Puppeteer to request all images on a page, save the colors used on a site, and more.
- Sign up for a Cloudflare account ↗.
- Install
Node.js
↗.
Node.js version manager
Use a Node version manager like Volta ↗ or nvm ↗ to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0
or later.
To get started, create a Worker application using the create-cloudflare
CLI ↗. Open a terminal window and run the following command:
npm create cloudflare@latest -- queues-web-crawler
yarn create cloudflare queues-web-crawler
pnpm create cloudflare@latest queues-web-crawler
For setup, select the following options:
- For What would you like to start with?, choose
Hello World example
. - For Which template would you like to use?, choose
Worker only
. - For Which language do you want to use?, choose
TypeScript
. - For Do you want to use git for version control?, choose
Yes
. - For Do you want to deploy your application?, choose
No
(we will be making some changes before deploying).
Then, move into your newly created directory:
cd queues-web-crawler
We need to create a KV store. This can be done through the Cloudflare dashboard or the Wrangler CLI. For this tutorial, we will use the Wrangler CLI.
npx wrangler kv namespace create crawler_links
yarn wrangler kv namespace create crawler_links
pnpm wrangler kv namespace create crawler_links
npx wrangler kv namespace create crawler_screenshots
yarn wrangler kv namespace create crawler_screenshots
pnpm wrangler kv namespace create crawler_screenshots
🌀 Creating namespace with title "web-crawler-crawler-links"✨ Success!Add the following to your configuration file in your kv_namespaces array:[[kv_namespaces]]binding = "crawler_links"id = "<GENERATED_NAMESPACE_ID>"
🌀 Creating namespace with title "web-crawler-crawler-screenshots"✨ Success!Add the following to your configuration file in your kv_namespaces array:[[kv_namespaces]]binding = "crawler_screenshots"id = "<GENERATED_NAMESPACE_ID>"
Add KV bindings to the Wrangler configuration file
Then, in your Wrangler file, add the following with the values generated in the terminal:
{ "kv_namespaces": [ { "binding": "CRAWLER_SCREENSHOTS_KV", "id": "<GENERATED_NAMESPACE_ID>" }, { "binding": "CRAWLER_LINKS_KV", "id": "<GENERATED_NAMESPACE_ID>" } ]}
kv_namespaces = [ { binding = "CRAWLER_SCREENSHOTS_KV", id = "<GENERATED_NAMESPACE_ID>" }, { binding = "CRAWLER_LINKS_KV", id = "<GENERATED_NAMESPACE_ID>" }]
Now, you need to set up your Worker for Browser Rendering.
In your current directory, install Cloudflare’s fork of Puppeteer and also robots-parser ↗:
npm i -D @cloudflare/puppeteer
yarn add -D @cloudflare/puppeteer
pnpm add -D @cloudflare/puppeteer
npm i robots-parser
yarn add robots-parser
pnpm add robots-parser
Then, add a Browser Rendering binding. Adding a Browser Rendering binding gives the Worker access to a headless Chromium instance you will control with Puppeteer.
{ "browser": { "binding": "CRAWLER_BROWSER" }}
browser = { binding = "CRAWLER_BROWSER" }
Now, we need to set up the Queue.
npx wrangler queues create queues-web-crawler
yarn wrangler queues create queues-web-crawler
pnpm wrangler queues create queues-web-crawler
Creating queue queues-web-crawler.Created queue queues-web-crawler.
Then, in your Wrangler file, add the following: