Apply a Docling chunker to a converted document and return the chunks as a
tidy tibble. The default "hybrid" chunker produces tokenization-aware,
context-enriched chunks well suited to embedding and retrieval pipelines; the
"hierarchical" chunker follows the document's structural hierarchy without
a token budget.
Usage
docling_chunk(
x,
chunker = c("hybrid", "hierarchical"),
tokenizer = NULL,
max_tokens = NULL,
contextualize = TRUE,
...
)Arguments
- x
A
docling_documentfromdocling_convert().- chunker
Either
"hybrid"(default) or"hierarchical".- tokenizer
Hugging Face model id whose tokenizer is used to count tokens (hybrid chunker only). Defaults to a small sentence-embedding tokenizer when
max_tokensis set;NULLuses Docling's built-in default.- max_tokens
Optional integer token budget per chunk (hybrid chunker only). When
NULL, the tokenizer's own maximum is used.- contextualize
When
TRUE(default), each chunk'stextis enriched with surrounding headings and table context via the chunker'scontextualize()method — the form you typically embed. The raw text is always also returned inraw_text.- ...
Additional keyword arguments forwarded to the Python chunker constructor (for example
merge_peersorrepeat_table_header).
Value
A tibble::tibble with one row per chunk and columns:
chunk_id— 1-based index.text— contextualized text (or raw text ifcontextualize = FALSE).raw_text— the chunk's unmodified text.n_chars— number of characters intext.headings— list-column of heading paths for the chunk.pages— list-column of integer page numbers the chunk spans.n_doc_items— number of underlying document items in the chunk.
Details
The hybrid chunker is token-aware: it packs content up to a token budget and
splits oversized passages. Control this with tokenizer (the model whose
tokenizer defines "a token") and max_tokens (the budget). These are ignored
by the hierarchical chunker.
Examples
if (FALSE) { # \dontrun{
doc <- docling_convert("paper.pdf")
chunks <- docling_chunk(doc, max_tokens = 512)
chunks$text[1]
# Match your embedding model's tokenizer
docling_chunk(doc, tokenizer = "BAAI/bge-small-en-v1.5", max_tokens = 512)
} # }