Capability · Framework — rag
olmOCR
olmOCR is the Allen Institute for AI's open-source document-to-text pipeline, originally built to extract clean training data for the OLMo family of open models. It fine-tunes a Qwen2-VL-7B backbone on ~260k PDF pages with human-verified output and reaches state-of-the-art quality on complex layouts at a fraction of the cost of commercial OCR APIs. Released with weights, training data, and inference code.
Framework facts
- Category
- rag
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/allenai/olmocr
Install
pip install olmocr
# requires a GPU and vLLM Quickstart
# Convert a PDF on a GPU machine
python -m olmocr.pipeline ./workspace --pdfs paper.pdf
# → JSONL in ./workspace/results/ Alternatives
- Marker — CPU-friendly, classic pipeline
- Docling — IBM, Apache-2.0
- Google Document AI — hosted commercial
Frequently asked questions
Is olmOCR good enough to replace AWS Textract or Google Document AI?
For general reading-order extraction, yes — benchmarks show olmOCR matching or beating commercial APIs. For forms and KV-pair extraction, commercial APIs still lead because they ship pretrained form schemas.
What hardware do I need?
A single GPU with ~20GB VRAM (e.g., A100 40GB, L40, 3090) runs the 7B model comfortably via vLLM. CPU-only is not supported.
Sources
- olmOCR GitHub — accessed 2026-04-20
- AllenAI olmOCR announcement — accessed 2026-04-20