Skip to contents

The mental model

crawlee mirrors the architecture of Crawlee in pure R. A crawler owns:

  • a request queue — a deduplicating, resumable list of URLs to visit;
  • one or more handlers — functions run on each fetched page;
  • a dataset — the structured records your handlers produce.

You build a crawler with crawler() and configure it with cr_* verbs that compose through the native pipe (|>).

A minimal crawl

resultado <- crawler("https://example.com") |>
  cr_options(delay = 0.5, max_depth = 2) |>
  cr_use_http() |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url    = ctx$request$url,
      titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
    ))
    ctx$enqueue_links()
  }) |>
  cr_run() |>
  cr_collect()

The handler context

Every handler receives a context object, conventionally named ctx:

Element Description
ctx$request The current request (url, label, depth, …).
ctx$response The raw httr2 response.
ctx$page The parsed page (xml_document) for HTML/XML, else NULL.
ctx$push_data(data) Append a record (list or data frame) to the dataset.
ctx$enqueue_links(...) Discover and enqueue links from the page.
ctx$log Logging helpers (info(), success(), warn(), error()).

enqueue_links() accepts glob, include/exclude patterns and a same_domain flag (on by default), so you only follow the links you care about:

ctx$enqueue_links(
  glob    = "*/blog/*",
  exclude = "*/tag/*",
  label   = "article"
)

Requests enqueued with a label are routed to the matching handler registered with cr_on_html(..., label = "article").

Reproducibility

The request queue deduplicates URLs by a normalised key (see cr_normalize_url()), so the same page is never fetched twice and crawls are deterministic. Persistent, resumable storage backends (DuckDB, Parquet) are on the roadmap. ```