Expand description
Website crawling library that rapidly crawls all pages to gather links via isolated contexts.
Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.
§How to use Spider
There are a couple of ways to use Spider:
-
crawl: start concurrently crawling a site. Can be used to send each page (including URL and HTML) to a subscriber for processing, or just to gather links. -
scrape: likecrawl, but saves the HTML raw strings to parse after scraping is complete.
§Examples
A simple crawl to index a website:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud");
website.crawl().await;
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
}Subscribe to crawl events:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("- {}", res.get_url());
}
});
website.crawl().await;
}§Spider Cloud Integration
Use Spider Cloud for anti-bot bypass, proxy rotation, and high-throughput
data collection. Enable the spider_cloud feature and set your API key:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://example.com")
.with_spider_cloud("YOUR_API_KEY")
.with_limit(10)
.build()
.unwrap();
website.crawl().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
}§Chrome Rendering
Enable the chrome feature to render JavaScript-heavy pages. Use the env var
CHROME_URL to connect to a remote instance:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud")
.with_limit(10)
.with_chrome_intercept(Default::default())
.build()
.unwrap();
let mut rx = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx.recv().await {
println!("{} - {}", page.get_url(), page.get_html_bytes_u8().len());
}
});
website.crawl().await;
}§Feature flags
§Core
ua_generator: Enables auto generating a random real User-Agent.regex: Enables blacklisting paths with regex.glob: Enables url glob support.fs: Enables storing resources to disk for parsing (may greatly increase performance at the cost of temp storage). Enabled by default.sitemap: Include sitemap pages in results.time: Enables duration tracking per page.encoding: Enables handling the content with different encodings like Shift_JIS.serde: Enables serde serialization support.sync: Subscribe to changes for Page data processing async.control: Enables the ability to pause, start, and shutdown crawls on demand.full_resources: Enables gathering all content that relates to the domain like CSS, JS, and etc.cookies: Enables cookies storing and setting to use for request.spoof: Spoof HTTP headers for the request.headers: Enables the extraction of header information on each retrieved page. Adds aheadersfield to the page struct.balance: Enables balancing the CPU and memory to scale more efficiently.cron: Enables the ability to start cron jobs for the website.tracing: Enables tokio tracing support for diagnostics.cowboy: Enables full concurrency mode with no throttle.llm_json: Enables LLM-friendly JSON parsing.page_error_status_details: Enables storing detailed error status information on pages.extra_information: Enables extra page metadata collection.cmd: Enables tokio process support.io_uring: Enables Linux io_uring support for async I/O (default on Linux).simd: Enables SIMD-accelerated JSON parsing.inline-more: More aggressive function inlining for performance (may increase compile times).
§Storage
disk: Enables SQLite hybrid disk storage to balance memory usage with no TLS.disk_native_tls: Enables SQLite hybrid disk storage to balance memory usage with native TLS.disk_aws: Enables SQLite hybrid disk storage to balance memory usage with AWS TLS.
§Caching
cache: Enables HTTP caching request to disk.cache_mem: Enables HTTP caching request to persist in memory.cache_openai: Enables caching the OpenAI request. This can drastically save costs when developing AI workflows.cache_gemini: Enables caching Gemini AI requests.cache_chrome_hybrid: Enables hybrid Chrome + HTTP caching to disk.cache_chrome_hybrid_mem: Enables hybrid Chrome + HTTP caching in memory.
§Chrome / Browser
chrome: Enables Chrome headless rendering, use the env varCHROME_URLto connect remotely.chrome_headed: Enables Chrome headful rendering.chrome_cpu: Disable GPU usage for Chrome browser.chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.chrome_store_page: Store the page object to perform other actions like taking screenshots conditionally.chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the./storage/directory. Use the env variableSCREENSHOT_DIRECTORYto adjust the directory.chrome_intercept: Allows intercepting network requests to speed up processing.chrome_headless_new: Useheadless=newto launch the Chrome instance.chrome_simd: Enables SIMD optimizations for Chrome message parsing.chrome_tls_connection: Enables TLS connection support for Chrome.chrome_serde_stacker: Enables serde stacker for deeply nested Chrome messages.chrome_remote_cache: Enables remote Chrome caching in memory.chrome_remote_cache_disk: Enables remote Chrome caching to disk.chrome_remote_cache_mem: Enables remote Chrome caching in memory only.adblock: Enables adblock support for Chrome to block ads during rendering.real_browser: Enables the ability to bypass protected pages.smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network requests by re-using the content.
§WebDriver
webdriver: Enables WebDriver support via thirtyfour. Use with chromedriver, geckodriver, or Selenium.webdriver_headed: Enables WebDriver headful mode.webdriver_stealth: Enables stealth mode for WebDriver.webdriver_chrome: WebDriver with Chrome browser.webdriver_firefox: WebDriver with Firefox browser.webdriver_edge: WebDriver with Edge browser.webdriver_screenshot: Enables screenshots via WebDriver.
§AI / LLM
openai: Enables OpenAI to generate dynamic browser executable scripts. Make sure to use the env varOPENAI_API_KEY.gemini: Enables Gemini AI to generate dynamic browser executable scripts. Make sure to use the env varGEMINI_API_KEY.
§Spider Cloud
spider_cloud: Enables Spider Cloud integration for anti-bot bypass, proxy rotation, and API-based crawling.
§Agent
agent: Enables the spider_agent multimodal autonomous agent.agent_openai: Agent with OpenAI provider.agent_chrome: Agent with Chrome browser context.agent_webdriver: Agent with WebDriver context.agent_skills: Agent with dynamic skill system for web automation challenges.agent_skills_s3: Agent skills with S3 storage.agent_fs: Agent with filesystem support for temp storage.agent_search_serper: Agent with Serper search integration.agent_search_brave: Agent with Brave Search integration.agent_search_bing: Agent with Bing Search integration.agent_search_tavily: Agent with Tavily search integration.agent_full: Full agent with all features enabled.
§Search
search: Enables search provider base.search_serper: Enables Serper search integration.search_brave: Enables Brave Search integration.search_bing: Enables Bing Search integration.search_tavily: Enables Tavily search integration.
§Networking
socks: Enables SOCKS5 proxy support.wreq: Enables the wreq HTTP client alternative with built-in impersonation.
§Distributed
decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.decentralized_headers: Enables the extraction of suppressed header information of the decentralized processing of IO. This is needed ifheadersis set in both spider and spider_worker.firewall: Enables spider_firewall crate to prevent bad websites from crawling.
Additional learning resources include:
Re-exports§
pub extern crate auto_encoder;pub extern crate bytes;pub extern crate case_insensitive_string;pub extern crate hashbrown;pub extern crate lazy_static;pub extern crate percent_encoding;pub extern crate quick_xml;pub extern crate reqwest;pub extern crate smallvec;pub extern crate spider_fingerprint;pub extern crate string_concat;pub extern crate strum;pub extern crate tokio;pub extern crate tokio_stream;pub extern crate ua_generator;pub extern crate url;pub use client::Client;pub use client::ClientBuilder;pub use case_insensitive_string::compact_str;
Modules§
- black_
list - Black list checking url exist.
- client
- Client interface.
- configuration
- Configuration structure for
Website. - features
- Optional features to use.
- packages
- Internal packages customized.
- page
- A page scraped.
- utils
- Application utils.
- website
- A website to crawl.
Structs§
- Case
Insensitive String - case-insensitive string handling
Type Aliases§
- Relative
Selectors - The selectors type. The values are held to make sure the relative domain can be crawled upon base redirects.