This article covers the two sides of crawling at scale, following Crawlee’s Scaling our crawlers
and Avoid getting blocked guides: going faster
(concurrency) while staying polite (rate limits and
robots.txt).
Being a good web citizen
By default crawlee is conservative and respectful:
-
robots.txtis honoured (respect_robots = TRUE): disallowed URLs are skipped, and aCrawl-delaydirective is applied. - set a descriptive
user_agentso site owners can identify your crawler; -
delayadds a pause between requests; -
max_requestsandmax_depthbound the crawl; - failed requests are retried
(
max_retries) with backoff.
crawler("https://books.toscrape.com/") |>
cr_options(
user_agent = "my-research-bot (you@example.com)",
delay = 0.5, # seconds between requests
max_requests = 500,
max_depth = 4,
respect_robots = TRUE
) |>
cr_on_html(function(ctx) ctx$enqueue_links()) |>
cr_run()Going faster
The default engine is sequential. For higher throughput there are three concurrent engines; all keep handlers running sequentially in R (so your dataset and queue are never touched concurrently) — only the network I/O runs in parallel.
Fixed-concurrency batches — cr_parallel()
Drains the queue in batches whose network requests run together.
crawler("https://books.toscrape.com/") |>
cr_parallel(concurrency = 8) |>
cr_on_html(function(ctx) ctx$enqueue_links()) |>
cr_run()Adaptive batches — cr_autoscale()
Like cr_parallel(), but the batch size adapts at run
time (additive-increase on clean batches, halving on back-pressure such
as HTTP 429/503 or transport failures), staying within
[min, max].
crawler("https://books.toscrape.com/") |>
cr_autoscale(min = 2, max = 16) |>
cr_on_html(function(ctx) ctx$enqueue_links()) |>
cr_run()Continuous streaming pool — cr_stream()
Keeps concurrency requests in flight at all times: the
moment one finishes, its handler runs and the next request is pulled in.
This avoids the batch engines’ “wait for the slowest request in the
batch” stall and shines when response latency varies a lot.
crawler("https://books.toscrape.com/") |>
cr_stream(concurrency = 10) |>
cr_on_html(function(ctx) ctx$enqueue_links()) |>
cr_run()Choosing an engine
| Engine | When to use |
|---|---|
| sequential (default) | small crawls; strict per-request pacing |
cr_parallel() |
steady throughput with a known good concurrency |
cr_autoscale() |
unknown/variable server capacity — let it find the level |
cr_stream() |
many pages with widely varying latency; maximum throughput |
Concurrency and politeness pull in opposite directions. The batch engines apply
delay/Crawl-delaybetween batches; the streaming engine treats concurrency itself as the throttle and does not enforce per-request pacing. For strict rate limits, prefer a batch engine with adelay.
Combining with persistence
Any engine composes with cr_persist() for resumable,
checkpointed runs:
crawler("https://books.toscrape.com/") |>
cr_autoscale(min = 2, max = 16) |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) ctx$enqueue_links()) |>
cr_run()