Wires a crawler to a directory on disk so a crawl is reproducible and resumable. It persists:
Usage
cr_persist(crawler, dir, dataset = c("jsonl", "duckdb", "memory"))Arguments
- crawler
A Crawler.
- dir
Run directory (created if needed).
- dataset
Dataset backend to use:
"jsonl"(default),"duckdb"or"memory"(not persisted).
Details
the request queue state (
queue.rds) — pending requests, seen keys and handled count, checkpointed duringcr_run();the dataset, via a persistent Dataset backend (
dataset.jsonlordataset.duckdb);binary content saved by
ctx$save_body()(underkv/);a run manifest (
manifest.rds, plusmanifest.jsonwhen jsonlite is available).
If a queue state already exists in dir, the crawl resumes: the saved
pending/seen/handled state is restored, so cr_run() continues where it left
off and already-fetched URLs are not fetched again.
Call cr_persist() before cr_run(). For the "duckdb" backend, collect
results with cr_collect() before cr_close().
Examples
if (FALSE) { # \dontrun{
crawler("https://example.com") |>
cr_persist("runs/example", dataset = "duckdb") |>
cr_on_html(\(ctx) ctx$push_data(list(url = ctx$request$url))) |>
cr_run() |>
cr_collect()
# Re-running the same pipeline resumes from runs/example.
} # }