Persist a crawl to a run directory (and resume it)

Wires a crawler to a directory on disk so a crawl is reproducible and resumable. It persists:

Usage

cr_persist(crawler, dir, dataset = c("jsonl", "duckdb", "memory"))

Arguments

crawler: A Crawler.
dir: Run directory (created if needed).
dataset: Dataset backend to use: "jsonl" (default), "duckdb" or "memory" (not persisted).

Value

The crawler, invisibly.

Details

the request queue state (queue.rds) — pending requests, seen keys and handled count, checkpointed during cr_run();
the dataset, via a persistent Dataset backend (dataset.jsonl or dataset.duckdb);
binary content saved by ctx$save_body() (under kv/);
a run manifest (manifest.rds, plus manifest.json when jsonlite is available).

If a queue state already exists in dir, the crawl resumes: the saved pending/seen/handled state is restored, so cr_run() continues where it left off and already-fetched URLs are not fetched again.

Call cr_persist() before cr_run(). For the "duckdb" backend, collect results with cr_collect() before cr_close().

Examples

if (FALSE) { # \dontrun{
crawler("https://example.com") |>
  cr_persist("runs/example", dataset = "duckdb") |>
  cr_on_html(\(ctx) ctx$push_data(list(url = ctx$request$url))) |>
  cr_run() |>
  cr_collect()
# Re-running the same pipeline resumes from runs/example.
} # }