Skip to contents

Crawler

Build, configure and run a crawler.

crawler()
Create a crawler
cr_options()
Set crawler options
cr_use_http()
Use the HTTP fetch backend
cr_use_browser()
Use the headless-browser fetch backend
cr_parallel()
Enable parallel (concurrent) fetching
cr_autoscale()
Enable autoscaled parallel fetching
cr_stream()
Enable the streaming scheduler
cr_run()
Run a crawl
cr_collect()
Collect crawl results

Discovery

Seed the queue from sitemaps and feeds.

cr_from_sitemap()
Discover URLs from a sitemap
cr_from_rss()
Discover URLs from an RSS or Atom feed

Handlers

Register handlers and act on fetched content.

cr_on_html()
Register an HTML handler
cr_on_pdf()
Register a PDF handler

RAG

Chunk, embed and export text for retrieval.

cr_chunk()
Chunk text for retrieval-augmented generation
cr_embed()
Attach embeddings to chunks
cr_export()
Export chunks (and embeddings) for retrieval

Persistence

Reproducible, resumable runs.

cr_persist()
Persist a crawl to a run directory (and resume it)
cr_dataset()
Configure the dataset backend
cr_close()
Release a crawler's resources

Storage & queue

Lower-level building blocks.

Crawler-class Crawler
Crawler
RequestQueue
Request queue
Dataset
Dataset
cr_store()
Configure the key-value store for binary content
KeyValueStore
Key-value store

Utilities

cr_normalize_url()
Normalise a URL into a canonical form