Storage and resumable runs
Source:vignettes/storage-and-resumable-runs.Rmd
storage-and-resumable-runs.RmdCrawlee separates three kinds of storage: the request queue (what to crawl), the dataset (structured results) and the key-value store (binary blobs). crawlee mirrors that split and adds a one-call setup for reproducible, resumable runs.
The dataset
Handlers call ctx$push_data() to append records;
cr_collect() returns them as one tibble. By default the
dataset lives in memory.
result <- crawler("https://books.toscrape.com/") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
}) |>
cr_run() |>
cr_collect()For larger or longer crawls, choose a persistent
backend with cr_dataset():
-
"jsonl"— append-only, schema-flexible, one JSON object per line; -
"duckdb"— appended to a DuckDB table, ready for SQL.
crawler("https://books.toscrape.com/") |>
cr_dataset(backend = "duckdb", path = "books.duckdb") |>
cr_on_html(function(ctx) ctx$push_data(list(url = ctx$request$url))) |>
cr_run()Both persistent backends resume from an existing file: re-opening the same path keeps the rows already there.
The key-value store
Use the key-value store for raw, non-tabular content — PDFs, images,
page snapshots. ctx$save_body() writes the current response
there, and cr_store() sets the directory.
The request queue and reproducibility
The request queue deduplicates by a normalised key (see
cr_normalize_url()), so each URL is fetched at most once
and a crawl is deterministic. It can also persist its state — pending
requests, seen keys, handled count — which is what makes a crawl
resumable.
One-call setup: cr_persist()
cr_persist(dir) wires everything to a run directory:
- the queue is checkpointed to
queue.rdsduring the run; - the dataset uses a persistent backend (
dataset.jsonlordataset.duckdb); -
ctx$save_body()writes underkv/; - a manifest (
manifest.rds/manifest.json) records the start URLs, an options snapshot and run statistics.
crawl <- crawler("https://books.toscrape.com/") |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run()
data <- cr_collect(crawl)
cr_close(crawl) # release the DuckDB connectionResuming
If a run is interrupted, run the exact same pipeline
again. Because the state already exists in
runs/books, cr_persist() restores it and the
crawl continues where it left off — already-fetched URLs are
skipped.
# Same code as above: it resumes instead of starting over.
crawler("https://books.toscrape.com/") |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run()For the DuckDB backend, call
cr_collect()beforecr_close()— closing releases the connection.