Package index • crawlee

Crawler

Build, configure and run a crawler.

crawler(): Create a crawler
cr_options(): Set crawler options
cr_use_http(): Use the HTTP fetch backend
cr_use_browser(): Use the headless-browser fetch backend
cr_parallel(): Enable parallel (concurrent) fetching
cr_autoscale(): Enable autoscaled parallel fetching
cr_stream(): Enable the streaming scheduler
cr_run(): Run a crawl
cr_collect(): Collect crawl results

Discovery

Seed the queue from sitemaps and feeds.

cr_from_sitemap(): Discover URLs from a sitemap
cr_from_rss(): Discover URLs from an RSS or Atom feed

Handlers

Register handlers and act on fetched content.

cr_on_html(): Register an HTML handler
cr_on_pdf(): Register a PDF handler

RAG

Chunk, embed and export text for retrieval.

cr_chunk(): Chunk text for retrieval-augmented generation
cr_embed(): Attach embeddings to chunks
cr_export(): Export chunks (and embeddings) for retrieval

Persistence

Reproducible, resumable runs.

cr_persist(): Persist a crawl to a run directory (and resume it)
cr_dataset(): Configure the dataset backend
cr_close(): Release a crawler's resources

Storage & queue

Lower-level building blocks.

Crawler-class Crawler: Crawler
RequestQueue: Request queue
Dataset: Dataset
cr_store(): Configure the key-value store for binary content
KeyValueStore: Key-value store

Utilities

cr_normalize_url(): Normalise a URL into a canonical form