Capability · Framework — rag

pdfplumber

pdfplumber, maintained by Jeremy Singer-Vine, is the workhorse of Python PDF extraction. It gives you access to every character, line, and rectangle plus a robust table detector that handles ruled and unruled tables. It's deterministic, dependency-light, and pairs well with modern LLM pipelines for documents that already have a clean text layer (machine-generated PDFs).

Framework facts

Category: rag
Language: Python
License: MIT
Repository: https://github.com/jsvine/pdfplumber

Install

pip install pdfplumber

Quickstart

import pdfplumber

with pdfplumber.open('invoice.pdf') as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    table = page.extract_table()
print(text, table)

Alternatives

PyMuPDF — faster, LGPL
Marker — layout-aware
Docling — Apache-2.0, ML-backed

Frequently asked questions

When should I use pdfplumber vs Marker?

Use pdfplumber when the PDF already has a good text layer and you need deterministic, cheap extraction — invoices, forms, statements. Use Marker/Docling when layout is complex (multi-column papers, math, scans).

Does pdfplumber do OCR?

No — it only reads text that's already in the PDF. Combine with Tesseract, Surya, or OlmOCR for scans.

Sources

pdfplumber GitHub — accessed 2026-04-20
pdfminer.six — accessed 2026-04-20