Capability · Framework — rag

olmOCR

olmOCR is the Allen Institute for AI's open-source document-to-text pipeline, originally built to extract clean training data for the OLMo family of open models. It fine-tunes a Qwen2-VL-7B backbone on ~260k PDF pages with human-verified output and reaches state-of-the-art quality on complex layouts at a fraction of the cost of commercial OCR APIs. Released with weights, training data, and inference code.

Framework facts

Category
rag
Language
Python
License
Apache-2.0
Repository
https://github.com/allenai/olmocr

Install

pip install olmocr
# requires a GPU and vLLM

Quickstart

# Convert a PDF on a GPU machine
python -m olmocr.pipeline ./workspace --pdfs paper.pdf
# → JSONL in ./workspace/results/

Alternatives

  • Marker — CPU-friendly, classic pipeline
  • Docling — IBM, Apache-2.0
  • Google Document AI — hosted commercial

Frequently asked questions

Is olmOCR good enough to replace AWS Textract or Google Document AI?

For general reading-order extraction, yes — benchmarks show olmOCR matching or beating commercial APIs. For forms and KV-pair extraction, commercial APIs still lead because they ship pretrained form schemas.

What hardware do I need?

A single GPU with ~20GB VRAM (e.g., A100 40GB, L40, 3090) runs the 7B model comfortably via vLLM. CPU-only is not supported.

Sources

  1. olmOCR GitHub — accessed 2026-04-20
  2. AllenAI olmOCR announcement — accessed 2026-04-20