Capability · Framework — rag

Docling

Docling uses IBM's layout and table-structure models to convert complex documents into a faithful structured representation — preserving headings, lists, and tables. It has loaders for LangChain and LlamaIndex, runs fully on-prem, and includes specialised models for scientific papers and financial filings.

Framework facts

Category: rag
Language: Python
License: MIT
Repository: https://github.com/docling-project/docling

Install

pip install docling

Quickstart

from docling.document_converter import DocumentConverter

conv = DocumentConverter()
result = conv.convert('https://arxiv.org/pdf/2408.09869')
print(result.document.export_to_markdown()[:2000])

Alternatives

Unstructured.io — broader file-type coverage
LlamaParse — hosted, strongest on slides
Marker — academic PDF → Markdown specialist
PyMuPDF / pdfplumber — lower-level libraries

Frequently asked questions

Does Docling need internet access?

No — all models run locally. The first run downloads weights from Hugging Face, after which Docling works fully offline.

What about OCR for scanned PDFs?

Docling includes an optional OCR pipeline (Tesseract and EasyOCR adapters). Enable it via `do_ocr=True` on the converter, or let Docling auto-detect image-only pages.

Sources

Docling — docs — accessed 2026-04-20
Docling GitHub — accessed 2026-04-20