Curiosity · AI Model
Molmo 72B
Molmo 72B is the flagship open-weight multimodal language model from the Allen Institute for AI (Ai2), released in late 2024. Built on Qwen2-72B with a CLIP-ViT vision encoder, Molmo is notable for being trained exclusively on Ai2's own open PixMo dataset — no distillation from closed VLMs — and matching proprietary models like GPT-4V and Claude 3.5 Sonnet on several visual benchmarks.
Model specs
- Vendor
- Allen AI (Ai2)
- Family
- Molmo
- Released
- 2024-09
- Context window
- 32,000 tokens
- Modalities
- text, vision
Strengths
- Fully open training data (PixMo) — avoids distillation from closed VLMs
- Competitive with GPT-4V and Claude 3.5 Sonnet on many visual benchmarks
- Supports 'pointing' grounded outputs (pixel coordinates)
- Apache-2.0 licensed
Limitations
- 72B model requires serious GPU infra (multi-A100 / H100)
- No audio or video modalities
- Lags frontier closed multimodals on hardest reasoning tasks
Use cases
- Fully-open multimodal research and reproductions
- Visual Q&A and document understanding
- Fine-tuning for pointing and grounded tasks (Molmo's 'pointing' prompt)
- Academic benchmarking without closed-model contamination
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| MMMU | ~54% | 2026-04 |
| DocVQA | ~93% | 2026-04 |
| ChartQA | ~87% | 2026-04 |
Frequently asked questions
What is Molmo 72B?
Molmo 72B is the Allen Institute for AI's flagship open-weight vision-language model. It combines Qwen2-72B with a CLIP-ViT vision encoder and is trained entirely on the open PixMo dataset.
Why is Molmo's training data important?
Many open VLMs are distilled from closed frontier models, making them partially non-reproducible and licence-fragile. Molmo's PixMo dataset was collected from scratch by Ai2 annotators, giving researchers a fully open and legally clean baseline.
Sources
- Molmo 72B on HuggingFace — accessed 2026-04-20
- Molmo paper (arXiv) — accessed 2026-04-20