This article follows the same path as the Crawlee fundamentals: start from a single
page, then teach the crawler to follow links,
control its scope, route different
page types and discover URLs from a sitemap. The
examples target books.toscrape.com, a
public sandbox built for practising web scraping.
The model
A crawler owns three things:
- a request queue — a deduplicating, resumable list of URLs to visit;
- one or more handlers — functions run on each fetched page;
- a dataset — the structured records your handlers produce.
You build a crawler with crawler() and configure it with
cr_* verbs that compose through the native pipe
(|>), then run it with cr_run().
Your first crawler
Fetch a single page and extract a couple of fields. The handler
receives a context object (ctx) exposing the parsed page
and the action push_data().
result <- crawler("https://books.toscrape.com/") |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
title = ctx$page |> rvest::html_element("title") |> rvest::html_text2()
))
}) |>
cr_run() |>
cr_collect()
resultFollowing links
Real crawls discover new URLs as they go.
ctx$enqueue_links() extracts links from the current page
and adds them to the queue; the crawler keeps going until the queue
drains. Because the queue deduplicates by a normalised URL, each page is
visited at most once.
crawler("https://books.toscrape.com/") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links() # follow every same-domain link
}) |>
cr_options(max_requests = 50) |>
cr_run()enqueue_links() only follows same-domain links by
default, so a crawl cannot wander off across the whole web.
Controlling scope
You rarely want every link. enqueue_links()
takes glob (a shorthand for include),
include/exclude patterns and a
same_domain flag; the crawler itself enforces
max_depth and max_requests.
crawler("https://books.toscrape.com/") |>
cr_options(max_depth = 3, max_requests = 200) |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url, depth = ctx$request$depth))
ctx$enqueue_links(
glob = "*/catalogue/*", # only follow catalogue pages
exclude = "*/category/*"
)
}) |>
cr_run() |>
cr_collect()Routing different page types
Most sites have a few kinds of page — listings vs. detail pages, say.
Give a label when enqueuing and register a handler for that
label. Listing pages enqueue detail pages; detail pages extract the
data.
books <- crawler("https://books.toscrape.com/") |>
# listing pages: enqueue book detail pages, labelled "book"
cr_on_html(function(ctx) {
ctx$enqueue_links(glob = "*/catalogue/*index.html", label = "book")
ctx$enqueue_links(glob = "*/page-*.html") # pagination, default handler
}) |>
# detail pages
cr_on_html(label = "book", function(ctx) {
ctx$push_data(list(
title = ctx$page |> rvest::html_element("h1") |> rvest::html_text2(),
price = ctx$page |> rvest::html_element(".price_color") |> rvest::html_text2()
))
}) |>
cr_run() |>
cr_collect()
booksA request’s label always wins over the content-kind
default, so labelled routing and
cr_on_html()/cr_on_pdf() defaults compose
cleanly.
Crawling from a sitemap
When a site publishes a sitemap.xml, you can seed the
queue directly from it instead of discovering links page by page —
cr_from_sitemap() handles sitemap indexes and gzipped
sitemaps, and can filter by glob or by <lastmod>
date.
crawler() |>
cr_from_sitemap("https://books.toscrape.com/sitemap.xml", label = "book") |>
cr_on_html(label = "book", function(ctx) {
ctx$push_data(list(url = ctx$request$url))
}) |>
cr_run() |>
cr_collect()The companion cr_from_rss() does the same for RSS and
Atom feeds.
Rendering JavaScript pages
If a page builds its content with JavaScript, the plain HTTP backend
sees an empty shell. Switch to the headless-browser backend with
cr_use_browser() (requires the package and a
Chrome/Chromium install). Handlers are unchanged; you additionally get
ctx$screenshot().
crawler("https://example.com") |>
cr_use_browser(wait_selector = ".content") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$screenshot()
}) |>
cr_run()Where next
-
Politeness & speed —
robots.txtis respected by default;cr_options(delay = )rate-limits, andcr_parallel()fetches concurrently. -
Documents —
cr_on_pdf()extracts text from PDFs;ctx$save_body()stores raw files in a key-value store. -
Reproducible, resumable runs —
cr_persist(dir)checkpoints the queue and persists the dataset, so an interrupted crawl continues where it left off. -
RAG —
cr_chunk(),cr_embed()andcr_export()turn crawled text into a retrieval-ready table.