The mental model
crawlee mirrors the architecture of Crawlee in pure R. A crawler owns:
- a request queue — a deduplicating, resumable list of URLs to visit;
- one or more handlers — functions run on each fetched page;
- a dataset — the structured records your handlers produce.
You build a crawler with crawler() and configure it with
cr_* verbs that compose through the native pipe
(|>).
A minimal crawl
resultado <- crawler("https://example.com") |>
cr_options(delay = 0.5, max_depth = 2) |>
cr_use_http() |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
))
ctx$enqueue_links()
}) |>
cr_run() |>
cr_collect()The handler context
Every handler receives a context object, conventionally named
ctx:
| Element | Description |
|---|---|
ctx$request |
The current request (url, label,
depth, …). |
ctx$response |
The raw httr2 response. |
ctx$page |
The parsed page (xml_document) for HTML/XML, else
NULL. |
ctx$push_data(data) |
Append a record (list or data frame) to the dataset. |
ctx$enqueue_links(...) |
Discover and enqueue links from the page. |
ctx$log |
Logging helpers (info(), success(),
warn(), error()). |
Controlling link discovery
enqueue_links() accepts glob,
include/exclude patterns and a
same_domain flag (on by default), so you only follow the
links you care about:
ctx$enqueue_links(
glob = "*/blog/*",
exclude = "*/tag/*",
label = "article"
)Requests enqueued with a label are routed to the
matching handler registered with
cr_on_html(..., label = "article").
Reproducibility
The request queue deduplicates URLs by a normalised key (see
cr_normalize_url()), so the same page is never fetched
twice and crawls are deterministic. Persistent, resumable storage
backends (DuckDB, Parquet) are on the roadmap. ```