Split a document into RAG-ready chunks

Apply a Docling chunker to a converted document and return the chunks as a tidy tibble. The default "hybrid" chunker produces tokenization-aware, context-enriched chunks well suited to embedding and retrieval pipelines; the "hierarchical" chunker follows the document's structural hierarchy without a token budget.

Usage

docling_chunk(
  x,
  chunker = c("hybrid", "hierarchical"),
  tokenizer = NULL,
  max_tokens = NULL,
  contextualize = TRUE,
  ...
)

Arguments

x: A docling_document from docling_convert().
chunker: Either "hybrid" (default) or "hierarchical".
tokenizer: Hugging Face model id whose tokenizer is used to count tokens (hybrid chunker only). Defaults to a small sentence-embedding tokenizer when max_tokens is set; NULL uses Docling's built-in default.
max_tokens: Optional integer token budget per chunk (hybrid chunker only). When NULL, the tokenizer's own maximum is used.
contextualize: When TRUE (default), each chunk's text is enriched with surrounding headings and table context via the chunker's contextualize() method — the form you typically embed. The raw text is always also returned in raw_text.
...: Additional keyword arguments forwarded to the Python chunker constructor (for example merge_peers or repeat_table_header).

Value

A tibble::tibble with one row per chunk and columns:

chunk_id — 1-based index.
text — contextualized text (or raw text if contextualize = FALSE).
raw_text — the chunk's unmodified text.
n_chars — number of characters in text.
headings — list-column of heading paths for the chunk.
pages — list-column of integer page numbers the chunk spans.
n_doc_items — number of underlying document items in the chunk.

Details

The hybrid chunker is token-aware: it packs content up to a token budget and splits oversized passages. Control this with tokenizer (the model whose tokenizer defines "a token") and max_tokens (the budget). These are ignored by the hierarchical chunker.

Examples

if (FALSE) { # \dontrun{
doc <- docling_convert("paper.pdf")
chunks <- docling_chunk(doc, max_tokens = 512)
chunks$text[1]

# Match your embedding model's tokenizer
docling_chunk(doc, tokenizer = "BAAI/bge-small-en-v1.5", max_tokens = 512)
} # }

Usage

Arguments

Value

Details

See also

Examples