Overview
doclingr turns messy documents — PDF, DOCX, PPTX, HTML, images — into structured, AI-ready data. It wraps the Docling Python library through reticulate, giving you layout-aware parsing, table extraction and retrieval-ready chunking with a small, tidy R API.
This vignette walks the full path: document → structure → tables → chunks → embeddings, i.e. everything you need to stand up a retrieval-augmented generation (RAG) corpus from R.
One-time setup
doclingr needs the Docling Python package. Install it once into a managed environment, then restart R:
library(doclingr)
install_docling() # creates an "r-docling" Python environment
# ...restart R...
docling_available() # TRUE once the backend is readyConverting a document
docling_convert() runs Docling’s understanding pipeline
over a file path or URL and returns a lightweight handle:
doc <- docling_convert("https://arxiv.org/pdf/2408.09869")
doc
#> <docling_document>
#> source: https://arxiv.org/pdf/2408.09869
#> pages: 9
#> tables: 5
#> figures: 3Tune the pipeline when you need to. OCR and the accurate table model cost time; turn them down for born-digital documents or large batches:
doc <- docling_convert(
"report.pdf",
ocr = FALSE, # skip OCR for born-digital PDFs
table_mode = "fast", # "accurate" (default) or "fast"
device = "mps" # "auto", "cpu", "cuda", "mps"
)
# Convert many sources in one batch
docs <- docling_convert(c("a.pdf", "b.docx", "c.html"))Exporting structure
Render the understood document into the format your downstream tools expect:
as_markdown(doc) # layout-aware Markdown
as_text(doc) # plain text
as_html(doc) # HTML
as_json(doc) # structured DoclingDocument as a nested R list
as_doctags(doc) # Docling's DocTags representationTables as tibbles
Every detected table comes back as a tibble, in document order:
tables <- docling_tables(doc)
length(tables)
tables[[1]]
#> # A tibble: 12 x 4
#> Method Recall Precision F1
#> <chr> <chr> <chr> <chr>
#> 1 Baseline 0.81 0.78 0.79
#> ...Figures
Pull figure captions and pages, and optionally save the images
(requires images = TRUE at conversion time):
doc <- docling_convert("paper.pdf", images = TRUE)
figs <- docling_figures(doc, image_dir = "figures")
figs
#> # A tibble: 3 x 4
#> figure_id caption page image_path
#> <int> <chr> <int> <chr>
#> 1 1 "Figure 1: pipeline ..." 2 figures/figure-001.png
#> ...Chunking for retrieval
docling_chunk() splits the document into context-rich
chunks. The default hybrid chunker is token-aware: match its tokenizer
to your embedding model and set a budget so chunks fit your model’s
context.
chunks <- docling_chunk(
doc,
tokenizer = "BAAI/bge-small-en-v1.5",
max_tokens = 512
)
chunks
#> # A tibble: 84 x 7
#> chunk_id text raw_text n_chars headings pages n_doc_items
#> <int> <chr> <chr> <int> <list> <list> <int>
#> 1 1 "Docling: ..." "Docling..." 412 <chr [2]> <int [1]> 3
#> ...Each chunk’s text is contextualized — enriched
with its heading path and table context — which is the form you
typically embed. The unmodified text is kept in
raw_text.
From chunks to embeddings
doclingr is deliberately provider-agnostic about embeddings: you
supply a function that maps a character vector to vectors, and
docling_embed() handles batching and tidy assembly. Here is
a sketch against an OpenAI-style API:
embed_api <- function(texts) {
# Call your embedding endpoint; return a matrix with one row per text.
# e.g. httr2 -> a list of vectors, or a matrix.
}
corpus <- doc |>
docling_chunk(tokenizer = "BAAI/bge-small-en-v1.5", max_tokens = 512) |>
docling_embed(embed_api, batch_size = 64)
corpus
#> # ... your chunks plus `embedding` (list-column) and `n_dim`At this point corpus is a tidy table of chunks with
their headings, pages and embeddings — ready to write to a vector store,
a database, or an in-memory nearest-neighbor index for RAG.
Where to go next
- Use
as_json(doc)when you need the full structural detail Docling captured. - Persist
corpus(for example witharrow::write_parquet()) to avoid re-converting and re-embedding. - See the Docling documentation for the breadth of supported formats and pipeline options.