The stateful object at the center of crawlee. It holds the request queue,
the dataset, the registered handlers and the run configuration. You will
rarely create one with Crawler$new() directly; use crawler() and the
cr_* verbs, which return the crawler invisibly so they compose with the
native pipe (|>).
Public fields
options
Named list of run options.
queue
The RequestQueue.
dataset
The Dataset.
handlers
Named list of label-specific handlers.
defaults
Named list of default handlers by content kind
(html, pdf, any).
kv
Lazily-created KeyValueStore for binary content.
mode
Fetch mode, "http" (default) or "browser".
stats
Named list of run statistics.
Methods
Crawler$new()
Create a crawler.
Arguments
start_urls
Character vector of seed URLs.
...
Options forwarded to cr_options().
Crawler$set_options()
Update one or more options.
Arguments
...
Named options to override.
Crawler$set_handler()
Register a handler for a content label or kind.
Usage
Crawler$set_handler(handler, label = NULL, kind = "html")
Arguments
handler
A function of one argument, the handler context.
label
Optional label; NULL registers a default handler.
kind
Content kind for the default handler ("html", "pdf",
"any"). Ignored when label is given.
Crawler$get_kv()
Get (lazily creating) the key-value store for binaries.
Crawler$set_persist_dir()
Set the run directory where the manifest is written.
Usage
Crawler$set_persist_dir(dir)
Crawler$close()
Release resources (browser session, DuckDB connection).
Crawler$run()
Run the crawl until the queue drains or a limit is hit.
Crawler$clone()
The objects of this class are cloneable with this method.
Usage
Crawler$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.