Capability · Framework — rag

Marker

Marker is a high-accuracy PDF (plus EPUB, DOCX, PPTX) to Markdown / JSON / HTML converter from Datalab. It chains layout detection, OCR fallback (via Surya), table extraction, and equation reconstruction into a single pipeline, and consistently tops quality benchmarks for open-source parsers. Used in most serious open-source RAG stacks when Unstructured or Docling fall short on complex layouts.

Framework facts

Category
rag
Language
Python
License
GPL-3.0 (paid commercial license available via Datalab)
Repository
https://github.com/datalab-to/marker

Install

pip install marker-pdf

Quickstart

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter('paper.pdf')
markdown, _, images = text_from_rendered(rendered)

Alternatives

  • Docling — IBM, Apache-2.0
  • Unstructured — unified loader
  • LlamaParse — LlamaIndex's hosted parser

Frequently asked questions

Marker vs Docling?

Both are excellent open-source parsers. Marker has historically led on complex math-heavy PDFs; Docling is Apache-2.0 (friendlier for commercial use) and has stronger structured JSON output. Many teams run both and compare on their corpus.

Is Marker's license GPL a problem?

Only if you embed Marker into closed-source software. Datalab offers a commercial license, and using Marker as a batch ingestion step outside your distributed product usually isn't a GPL trigger — but talk to counsel.

Sources

  1. Marker GitHub — accessed 2026-04-20
  2. Datalab docs — accessed 2026-04-20