Capability · Framework — rag
pdfplumber
pdfplumber, maintained by Jeremy Singer-Vine, is the workhorse of Python PDF extraction. It gives you access to every character, line, and rectangle plus a robust table detector that handles ruled and unruled tables. It's deterministic, dependency-light, and pairs well with modern LLM pipelines for documents that already have a clean text layer (machine-generated PDFs).
Framework facts
- Category
- rag
- Language
- Python
- License
- MIT
- Repository
- https://github.com/jsvine/pdfplumber
Install
pip install pdfplumber Quickstart
import pdfplumber
with pdfplumber.open('invoice.pdf') as pdf:
page = pdf.pages[0]
text = page.extract_text()
table = page.extract_table()
print(text, table) Alternatives
- PyMuPDF — faster, LGPL
- Marker — layout-aware
- Docling — Apache-2.0, ML-backed
Frequently asked questions
When should I use pdfplumber vs Marker?
Use pdfplumber when the PDF already has a good text layer and you need deterministic, cheap extraction — invoices, forms, statements. Use Marker/Docling when layout is complex (multi-column papers, math, scans).
Does pdfplumber do OCR?
No — it only reads text that's already in the PDF. Combine with Tesseract, Surya, or OlmOCR for scans.
Sources
- pdfplumber GitHub — accessed 2026-04-20
- pdfminer.six — accessed 2026-04-20