Skip to main content

Crate spider

Crate spider 

Source
Expand description

Website crawling library that rapidly crawls all pages to gather links via isolated contexts.

Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.

§How to use Spider

There are a couple of ways to use Spider:

  • crawl: start concurrently crawling a site. Can be used to send each page (including URL and HTML) to a subscriber for processing, or just to gather links.

  • scrape: like crawl, but saves the HTML raw strings to parse after scraping is complete.

§Examples

A simple crawl to index a website:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud");

    website.crawl().await;

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }
}

Subscribe to crawl events:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("- {}", res.get_url());
        }
    });

    website.crawl().await;
}

§Spider Cloud Integration

Use Spider Cloud for anti-bot bypass, proxy rotation, and high-throughput data collection. Enable the spider_cloud feature and set your API key:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://example.com")
        .with_spider_cloud("YOUR_API_KEY")
        .with_limit(10)
        .build()
        .unwrap();

    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }
}

§Chrome Rendering

Enable the chrome feature to render JavaScript-heavy pages. Use the env var CHROME_URL to connect to a remote instance:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud")
        .with_limit(10)
        .with_chrome_intercept(Default::default())
        .build()
        .unwrap();

    let mut rx = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            println!("{} - {}", page.get_url(), page.get_html_bytes_u8().len());
        }
    });

    website.crawl().await;
}

§Feature flags

§Core

  • ua_generator: Enables auto generating a random real User-Agent.
  • regex: Enables blacklisting paths with regex.
  • glob: Enables url glob support.
  • fs: Enables storing resources to disk for parsing (may greatly increase performance at the cost of temp storage). Enabled by default.
  • sitemap: Include sitemap pages in results.
  • time: Enables duration tracking per page.
  • encoding: Enables handling the content with different encodings like Shift_JIS.
  • serde: Enables serde serialization support.
  • sync: Subscribe to changes for Page data processing async.
  • control: Enables the ability to pause, start, and shutdown crawls on demand.
  • full_resources: Enables gathering all content that relates to the domain like CSS, JS, and etc.
  • cookies: Enables cookies storing and setting to use for request.
  • spoof: Spoof HTTP headers for the request.
  • headers: Enables the extraction of header information on each retrieved page. Adds a headers field to the page struct.
  • balance: Enables balancing the CPU and memory to scale more efficiently.
  • cron: Enables the ability to start cron jobs for the website.
  • tracing: Enables tokio tracing support for diagnostics.
  • cowboy: Enables full concurrency mode with no throttle.
  • llm_json: Enables LLM-friendly JSON parsing.
  • page_error_status_details: Enables storing detailed error status information on pages.
  • extra_information: Enables extra page metadata collection.
  • cmd: Enables tokio process support.
  • io_uring: Enables Linux io_uring support for async I/O (default on Linux).
  • simd: Enables SIMD-accelerated JSON parsing.
  • inline-more: More aggressive function inlining for performance (may increase compile times).

§Storage

  • disk: Enables SQLite hybrid disk storage to balance memory usage with no TLS.
  • disk_native_tls: Enables SQLite hybrid disk storage to balance memory usage with native TLS.
  • disk_aws: Enables SQLite hybrid disk storage to balance memory usage with AWS TLS.

§Caching

  • cache: Enables HTTP caching request to disk.
  • cache_mem: Enables HTTP caching request to persist in memory.
  • cache_openai: Enables caching the OpenAI request. This can drastically save costs when developing AI workflows.
  • cache_gemini: Enables caching Gemini AI requests.
  • cache_chrome_hybrid: Enables hybrid Chrome + HTTP caching to disk.
  • cache_chrome_hybrid_mem: Enables hybrid Chrome + HTTP caching in memory.

§Chrome / Browser

  • chrome: Enables Chrome headless rendering, use the env var CHROME_URL to connect remotely.
  • chrome_headed: Enables Chrome headful rendering.
  • chrome_cpu: Disable GPU usage for Chrome browser.
  • chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.
  • chrome_store_page: Store the page object to perform other actions like taking screenshots conditionally.
  • chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variable SCREENSHOT_DIRECTORY to adjust the directory.
  • chrome_intercept: Allows intercepting network requests to speed up processing.
  • chrome_headless_new: Use headless=new to launch the Chrome instance.
  • chrome_simd: Enables SIMD optimizations for Chrome message parsing.
  • chrome_tls_connection: Enables TLS connection support for Chrome.
  • chrome_serde_stacker: Enables serde stacker for deeply nested Chrome messages.
  • chrome_remote_cache: Enables remote Chrome caching in memory.
  • chrome_remote_cache_disk: Enables remote Chrome caching to disk.
  • chrome_remote_cache_mem: Enables remote Chrome caching in memory only.
  • adblock: Enables adblock support for Chrome to block ads during rendering.
  • real_browser: Enables the ability to bypass protected pages.
  • smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network requests by re-using the content.

§WebDriver

  • webdriver: Enables WebDriver support via thirtyfour. Use with chromedriver, geckodriver, or Selenium.
  • webdriver_headed: Enables WebDriver headful mode.
  • webdriver_stealth: Enables stealth mode for WebDriver.
  • webdriver_chrome: WebDriver with Chrome browser.
  • webdriver_firefox: WebDriver with Firefox browser.
  • webdriver_edge: WebDriver with Edge browser.
  • webdriver_screenshot: Enables screenshots via WebDriver.

§AI / LLM

  • openai: Enables OpenAI to generate dynamic browser executable scripts. Make sure to use the env var OPENAI_API_KEY.
  • gemini: Enables Gemini AI to generate dynamic browser executable scripts. Make sure to use the env var GEMINI_API_KEY.

§Spider Cloud

  • spider_cloud: Enables Spider Cloud integration for anti-bot bypass, proxy rotation, and API-based crawling.

§Agent

  • agent: Enables the spider_agent multimodal autonomous agent.
  • agent_openai: Agent with OpenAI provider.
  • agent_chrome: Agent with Chrome browser context.
  • agent_webdriver: Agent with WebDriver context.
  • agent_skills: Agent with dynamic skill system for web automation challenges.
  • agent_skills_s3: Agent skills with S3 storage.
  • agent_fs: Agent with filesystem support for temp storage.
  • agent_search_serper: Agent with Serper search integration.
  • agent_search_brave: Agent with Brave Search integration.
  • agent_search_bing: Agent with Bing Search integration.
  • agent_search_tavily: Agent with Tavily search integration.
  • agent_full: Full agent with all features enabled.

§Search

  • search: Enables search provider base.
  • search_serper: Enables Serper search integration.
  • search_brave: Enables Brave Search integration.
  • search_bing: Enables Bing Search integration.
  • search_tavily: Enables Tavily search integration.

§Networking

  • socks: Enables SOCKS5 proxy support.
  • wreq: Enables the wreq HTTP client alternative with built-in impersonation.

§Distributed

  • decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.
  • decentralized_headers: Enables the extraction of suppressed header information of the decentralized processing of IO. This is needed if headers is set in both spider and spider_worker.
  • firewall: Enables spider_firewall crate to prevent bad websites from crawling.

Additional learning resources include:

Re-exports§

pub extern crate auto_encoder;
pub extern crate bytes;
pub extern crate case_insensitive_string;
pub extern crate hashbrown;
pub extern crate lazy_static;
pub extern crate percent_encoding;
pub extern crate quick_xml;
pub extern crate reqwest;
pub extern crate smallvec;
pub extern crate spider_fingerprint;
pub extern crate string_concat;
pub extern crate strum;
pub extern crate tokio;
pub extern crate tokio_stream;
pub extern crate ua_generator;
pub extern crate url;
pub use client::Client;
pub use client::ClientBuilder;
pub use case_insensitive_string::compact_str;

Modules§

black_list
Black list checking url exist.
client
Client interface.
configuration
Configuration structure for Website.
features
Optional features to use.
packages
Internal packages customized.
page
A page scraped.
utils
Application utils.
website
A website to crawl.

Structs§

CaseInsensitiveString
case-insensitive string handling

Type Aliases§

RelativeSelectors
The selectors type. The values are held to make sure the relative domain can be crawled upon base redirects.