Capability · Framework — rag
Docling
Docling uses IBM's layout and table-structure models to convert complex documents into a faithful structured representation — preserving headings, lists, and tables. It has loaders for LangChain and LlamaIndex, runs fully on-prem, and includes specialised models for scientific papers and financial filings.
Framework facts
- Category
- rag
- Language
- Python
- License
- MIT
- Repository
- https://github.com/docling-project/docling
Install
pip install docling Quickstart
from docling.document_converter import DocumentConverter
conv = DocumentConverter()
result = conv.convert('https://arxiv.org/pdf/2408.09869')
print(result.document.export_to_markdown()[:2000]) Alternatives
- Unstructured.io — broader file-type coverage
- LlamaParse — hosted, strongest on slides
- Marker — academic PDF → Markdown specialist
- PyMuPDF / pdfplumber — lower-level libraries
Frequently asked questions
Does Docling need internet access?
No — all models run locally. The first run downloads weights from Hugging Face, after which Docling works fully offline.
What about OCR for scanned PDFs?
Docling includes an optional OCR pipeline (Tesseract and EasyOCR adapters). Enable it via `do_ocr=True` on the converter, or let Docling auto-detect image-only pages.
Sources
- Docling — docs — accessed 2026-04-20
- Docling GitHub — accessed 2026-04-20