A RAG pipeline • crawlee

library(crawlee)

Beyond crawling, crawlee provides three helpers to turn collected text into a retrieval-ready corpus for retrieval-augmented generation (RAG): cr_chunk(), cr_embed() and cr_export(). They operate on plain tibbles, so they slot in right after cr_collect().

1. Crawl and collect text

pages <- crawler("https://books.toscrape.com/") |>
  cr_options(max_requests = 100) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url   = ctx$request$url,
      title = ctx$page |> rvest::html_element("title") |> rvest::html_text2(),
      text  = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run() |>
  cr_collect()

2. Chunk

cr_chunk() splits text into overlapping windows. On a data frame, name the text column; every other column is carried along as per-chunk metadata (so each chunk keeps its url and title).

chunks <- cr_chunk(pages, text = text, size = 1000, overlap = 200, by = "char")
chunks
#> columns: doc_id, chunk_id, chunk, text, n_chars, url, title

Use by = "word" to size chunks in words instead of characters.

3. Embed

cr_embed() is provider-agnostic: crawlee never calls an embedding service itself. You pass embed_fn, a function that maps a character vector to a numeric matrix (one row per input) or a list of numeric vectors. It is applied in batches and adds an embedding list-column.

# A real embedder typically calls an HTTP API (any provider) with httr2:
embed_fn <- function(texts) {
  # return a length(texts) x d numeric matrix
  resp <- httr2::request("https://api.example.com/v1/embeddings") |>
    httr2::req_auth_bearer_token(Sys.getenv("EMBEDDINGS_API_KEY")) |>
    httr2::req_body_json(list(input = texts)) |>
    httr2::req_perform()
  do.call(rbind, lapply(httr2::resp_body_json(resp)$data, \(x) unlist(x$embedding)))
}

embedded <- cr_embed(chunks, embed_fn, batch_size = 32)

For a quick local experiment you can pass any function — even a trivial one:

fake_embed <- function(x) matrix(nchar(x), nrow = length(x), ncol = 1)
embedded <- cr_embed(chunks, fake_embed)

4. Export for retrieval

cr_export() writes the chunk table (with embeddings) to a retrieval-friendly format. parquet and jsonl preserve the embedding vectors natively; csv and duckdb serialise them to a [...] string.

cr_export(embedded, "corpus.parquet", format = "parquet")
cr_export(embedded, "corpus.jsonl", format = "jsonl")
cr_export(embedded, "corpus.duckdb", format = "duckdb", table = "chunks")

End to end

crawler("https://books.toscrape.com/") |>
  cr_options(max_requests = 100) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url  = ctx$request$url,
      text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run() |>
  cr_collect() |>
  cr_chunk(text = text, size = 1000, overlap = 200) |>
  cr_embed(embed_fn) |>
  cr_export("corpus.parquet", format = "parquet")

From here, load corpus.parquet into your vector store or do nearest-neighbour search in R to retrieve chunks for a prompt.