Capability · Framework — rag

Docling

Docling uses IBM's layout and table-structure models to convert complex documents into a faithful structured representation — preserving headings, lists, and tables. It has loaders for LangChain and LlamaIndex, runs fully on-prem, and includes specialised models for scientific papers and financial filings.

Framework facts

Category
rag
Language
Python
License
MIT
Repository
https://github.com/docling-project/docling

Install

pip install docling

Quickstart

from docling.document_converter import DocumentConverter

conv = DocumentConverter()
result = conv.convert('https://arxiv.org/pdf/2408.09869')
print(result.document.export_to_markdown()[:2000])

Alternatives

  • Unstructured.io — broader file-type coverage
  • LlamaParse — hosted, strongest on slides
  • Marker — academic PDF → Markdown specialist
  • PyMuPDF / pdfplumber — lower-level libraries

Frequently asked questions

Does Docling need internet access?

No — all models run locally. The first run downloads weights from Hugging Face, after which Docling works fully offline.

What about OCR for scanned PDFs?

Docling includes an optional OCR pipeline (Tesseract and EasyOCR adapters). Enable it via `do_ocr=True` on the converter, or let Docling auto-detect image-only pages.

Sources

  1. Docling — docs — accessed 2026-04-20
  2. Docling GitHub — accessed 2026-04-20