Storage and resumable runs

library(crawlee)

Crawlee separates three kinds of storage: the request queue (what to crawl), the dataset (structured results) and the key-value store (binary blobs). crawlee mirrors that split and adds a one-call setup for reproducible, resumable runs.

The dataset

Handlers call ctx$push_data() to append records; cr_collect() returns them as one tibble. By default the dataset lives in memory.

result <- crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
  }) |>
  cr_run() |>
  cr_collect()

For larger or longer crawls, choose a persistent backend with cr_dataset():

"jsonl" — append-only, schema-flexible, one JSON object per line;
"duckdb" — appended to a DuckDB table, ready for SQL.

crawler("https://books.toscrape.com/") |>
  cr_dataset(backend = "duckdb", path = "books.duckdb") |>
  cr_on_html(function(ctx) ctx$push_data(list(url = ctx$request$url))) |>
  cr_run()

Both persistent backends resume from an existing file: re-opening the same path keeps the rows already there.

The key-value store

Use the key-value store for raw, non-tabular content — PDFs, images, page snapshots. ctx$save_body() writes the current response there, and cr_store() sets the directory.

crawler("https://example.com/report.pdf") |>
  cr_store("downloads") |>
  cr_on_pdf(function(ctx) {
    ctx$push_data(list(url = ctx$request$url, pages = length(ctx$pdf_text())))
    ctx$save_body(ext = "pdf") # -> downloads/<sanitised-url>.pdf
  }) |>
  cr_run()

The request queue and reproducibility

The request queue deduplicates by a normalised key (see cr_normalize_url()), so each URL is fetched at most once and a crawl is deterministic. It can also persist its state — pending requests, seen keys, handled count — which is what makes a crawl resumable.

One-call setup: `cr_persist()`

cr_persist(dir) wires everything to a run directory:

the queue is checkpointed to queue.rds during the run;
the dataset uses a persistent backend (dataset.jsonl or dataset.duckdb);
ctx$save_body() writes under kv/;
a manifest (manifest.rds / manifest.json) records the start URLs, an options snapshot and run statistics.

crawl <- crawler("https://books.toscrape.com/") |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run()

data <- cr_collect(crawl)
cr_close(crawl) # release the DuckDB connection

Resuming

If a run is interrupted, run the exact same pipeline again. Because the state already exists in runs/books, cr_persist() restores it and the crawl continues where it left off — already-fetched URLs are skipped.

# Same code as above: it resumes instead of starting over.
crawler("https://books.toscrape.com/") |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run()

For the DuckDB backend, call cr_collect() before cr_close() — closing releases the connection.

The dataset

The key-value store

The request queue and reproducibility

One-call setup: cr_persist()

Resuming

One-call setup: `cr_persist()`