This vignette collects practical advice for reading large DATASUS
files efficiently with datasusr.
Column selection
The single most effective way to speed up reads is to select only the columns you need. The C parser skips over unneeded fields at the byte level, so fewer columns means less allocation and less parsing:
library(datasusr)
# Slow: reads all ~100+ columns
x <- read_datasus_dbc("RDPE2401.dbc")
# Fast: reads only 4 columns
x <- read_datasus_dbc(
"RDPE2401.dbc",
select = c("uf_zi", "ano_cmpt", "munic_res", "val_tot")
)Type inference
When guess_types = TRUE (the default), the reader scans
every value of each numeric field to decide whether it fits in an
integer or requires a double. Disabling this skips the scan and relies
on the DBF field metadata alone:
x <- read_datasus_dbc("RDPE2401.dbc", guess_types = FALSE)This is safe for most analyses and can save noticeable time on files with millions of rows.
Explicit types
When you know exactly which types you need, col_types
lets you specify them upfront and skip inference entirely:
x <- read_datasus_dbc(
"RDPE2401.dbc",
select = c("uf_zi", "dt_inter", "val_tot"),
col_types = c(uf_zi = "character", dt_inter = "date", val_tot = "double"),
parse_dates = TRUE,
guess_types = FALSE
)Date parsing
Date parsing (parse_dates = TRUE) adds a small overhead
per date field. Only enable it when you actually need Date
objects; otherwise the raw YYYYMMDD character strings are
often sufficient for filtering and grouping.
Benchmarking example
library(bench)
file <- "RDPE2401.dbc"
bench::mark(
default = read_datasus_dbc(file),
no_guess = read_datasus_dbc(file, guess_types = FALSE),
selected = read_datasus_dbc(file, select = c("uf_zi", "val_tot")),
typed = read_datasus_dbc(
file,
select = c("uf_zi", "dt_inter", "val_tot"),
col_types = c(uf_zi = "character", dt_inter = "date", val_tot = "double"),
parse_dates = TRUE,
guess_types = FALSE
),
check = FALSE,
iterations = 5
)