Skip to contents

Registers a handler invoked for responses classified as PDF — by Content-Type (application/pdf) or a .pdf URL. The handler context adds PDF-specific helpers on top of the usual ones.

Usage

cr_on_pdf(crawler, handler, label = NULL)

Arguments

crawler

A Crawler.

handler

A function of one argument (the context). See Context.

label

Optional handler label; NULL registers the default PDF handler.

Value

The crawler, invisibly.

Details

Requests carrying an explicit label are always routed to the handler registered for that label (regardless of content kind); label = NULL registers the default PDF handler.

Context

In addition to the elements documented in cr_on_html(), a PDF handler's context provides:

kind

"pdf".

pdf_text()

Extract text per page (requires the pdftools package), returning a character vector.

body_raw()

The raw PDF bytes.

save_body(key, ext)

Persist the PDF to the KeyValueStore.

Examples

if (FALSE) { # \dontrun{
crawler("https://example.com/report.pdf") |>
  cr_on_pdf(function(ctx) {
    text <- ctx$pdf_text()
    ctx$push_data(list(url = ctx$request$url, n_pages = length(text)))
    ctx$save_body(ext = "pdf")
  }) |>
  cr_run()
} # }