Capability · Framework — rag
Marker
Marker is a high-accuracy PDF (plus EPUB, DOCX, PPTX) to Markdown / JSON / HTML converter from Datalab. It chains layout detection, OCR fallback (via Surya), table extraction, and equation reconstruction into a single pipeline, and consistently tops quality benchmarks for open-source parsers. Used in most serious open-source RAG stacks when Unstructured or Docling fall short on complex layouts.
Framework facts
- Category
- rag
- Language
- Python
- License
- GPL-3.0 (paid commercial license available via Datalab)
- Repository
- https://github.com/datalab-to/marker
Install
pip install marker-pdf Quickstart
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter('paper.pdf')
markdown, _, images = text_from_rendered(rendered) Alternatives
- Docling — IBM, Apache-2.0
- Unstructured — unified loader
- LlamaParse — LlamaIndex's hosted parser
Frequently asked questions
Marker vs Docling?
Both are excellent open-source parsers. Marker has historically led on complex math-heavy PDFs; Docling is Apache-2.0 (friendlier for commercial use) and has stronger structured JSON output. Many teams run both and compare on their corpus.
Is Marker's license GPL a problem?
Only if you embed Marker into closed-source software. Datalab offers a commercial license, and using Marker as a batch ingestion step outside your distributed product usually isn't a GPL trigger — but talk to counsel.
Sources
- Marker GitHub — accessed 2026-04-20
- Datalab docs — accessed 2026-04-20