Beyond crawling, crawlee provides three helpers to turn collected
text into a retrieval-ready corpus for retrieval-augmented generation
(RAG): cr_chunk(), cr_embed() and
cr_export(). They operate on plain tibbles, so they slot in
right after cr_collect().
1. Crawl and collect text
pages <- crawler("https://books.toscrape.com/") |>
cr_options(max_requests = 100) |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
title = ctx$page |> rvest::html_element("title") |> rvest::html_text2(),
text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run() |>
cr_collect()2. Chunk
cr_chunk() splits text into overlapping windows. On a
data frame, name the text column; every other column is carried along as
per-chunk metadata (so each chunk keeps its url and
title).
chunks <- cr_chunk(pages, text = text, size = 1000, overlap = 200, by = "char")
chunks
#> columns: doc_id, chunk_id, chunk, text, n_chars, url, titleUse by = "word" to size chunks in words instead of
characters.
3. Embed
cr_embed() is provider-agnostic:
crawlee never calls an embedding service itself. You pass
embed_fn, a function that maps a character vector to a
numeric matrix (one row per input) or a list of numeric vectors. It is
applied in batches and adds an embedding list-column.
# A real embedder typically calls an HTTP API (any provider) with httr2:
embed_fn <- function(texts) {
# return a length(texts) x d numeric matrix
resp <- httr2::request("https://api.example.com/v1/embeddings") |>
httr2::req_auth_bearer_token(Sys.getenv("EMBEDDINGS_API_KEY")) |>
httr2::req_body_json(list(input = texts)) |>
httr2::req_perform()
do.call(rbind, lapply(httr2::resp_body_json(resp)$data, \(x) unlist(x$embedding)))
}
embedded <- cr_embed(chunks, embed_fn, batch_size = 32)For a quick local experiment you can pass any function — even a trivial one:
4. Export for retrieval
cr_export() writes the chunk table (with embeddings) to
a retrieval-friendly format. parquet and jsonl
preserve the embedding vectors natively; csv and
duckdb serialise them to a [...] string.
End to end
crawler("https://books.toscrape.com/") |>
cr_options(max_requests = 100) |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run() |>
cr_collect() |>
cr_chunk(text = text, size = 1000, overlap = 200) |>
cr_embed(embed_fn) |>
cr_export("corpus.parquet", format = "parquet")From here, load corpus.parquet into your vector store or
do nearest-neighbour search in R to retrieve chunks for a prompt.