Fetches a sitemap (or sitemap index, recursively) and enqueues the page URLs
it lists. Supports gzipped sitemaps, glob filtering and a since filter on
<lastmod> for incremental re-crawls of large sites that publish dated
sitemaps.
Usage
cr_from_sitemap(
crawler,
url,
label = NULL,
include = NULL,
exclude = NULL,
since = NULL,
max = Inf,
max_levels = 3L
)Arguments
- crawler
A Crawler.
- url
URL of a
sitemap.xmlor sitemap index.- label
Optional handler label routing the enqueued URLs.
- include, exclude
Optional glob patterns (see
cr_on_html()).- since
Optional date (or
YYYY-MM-DDstring); only URLs whose<lastmod>is on or after this date are enqueued (URLs without alastmodare kept).- max
Maximum number of URLs to enqueue.
- max_levels
Maximum recursion depth into nested sitemap indexes.
Examples
if (FALSE) { # \dontrun{
crawler() |>
cr_on_html(\(ctx) ctx$push_data(list(url = ctx$request$url))) |>
cr_from_sitemap("https://example.com/sitemap.xml", since = "2026-01-01")
} # }