Curiosity · AI Model

Molmo 72B

Molmo 72B is the flagship open-weight multimodal language model from the Allen Institute for AI (Ai2), released in late 2024. Built on Qwen2-72B with a CLIP-ViT vision encoder, Molmo is notable for being trained exclusively on Ai2's own open PixMo dataset — no distillation from closed VLMs — and matching proprietary models like GPT-4V and Claude 3.5 Sonnet on several visual benchmarks.

Model specs

Vendor: Allen AI (Ai2)
Family: Molmo
Released: 2024-09
Context window: 32,000 tokens
Modalities: text, vision

Strengths

Fully open training data (PixMo) — avoids distillation from closed VLMs
Competitive with GPT-4V and Claude 3.5 Sonnet on many visual benchmarks
Supports 'pointing' grounded outputs (pixel coordinates)
Apache-2.0 licensed

Limitations

72B model requires serious GPU infra (multi-A100 / H100)
No audio or video modalities
Lags frontier closed multimodals on hardest reasoning tasks

Use cases

Fully-open multimodal research and reproductions
Visual Q&A and document understanding
Fine-tuning for pointing and grounded tasks (Molmo's 'pointing' prompt)
Academic benchmarking without closed-model contamination

Benchmarks

Benchmark	Score	As of
MMMU	~54%	2026-04
DocVQA	~93%	2026-04
ChartQA	~87%	2026-04

Frequently asked questions

What is Molmo 72B?

Molmo 72B is the Allen Institute for AI's flagship open-weight vision-language model. It combines Qwen2-72B with a CLIP-ViT vision encoder and is trained entirely on the open PixMo dataset.

Why is Molmo's training data important?

Many open VLMs are distilled from closed frontier models, making them partially non-reproducible and licence-fragile. Molmo's PixMo dataset was collected from scratch by Ai2 annotators, giving researchers a fully open and legally clean baseline.

Sources

Molmo 72B on HuggingFace — accessed 2026-04-20
Molmo paper (arXiv) — accessed 2026-04-20