Curiosity · AI Model

Molmo 72B

Molmo 72B is the flagship open-weight multimodal language model from the Allen Institute for AI (Ai2), released in late 2024. Built on Qwen2-72B with a CLIP-ViT vision encoder, Molmo is notable for being trained exclusively on Ai2's own open PixMo dataset — no distillation from closed VLMs — and matching proprietary models like GPT-4V and Claude 3.5 Sonnet on several visual benchmarks.

Model specs

Vendor
Allen AI (Ai2)
Family
Molmo
Released
2024-09
Context window
32,000 tokens
Modalities
text, vision

Strengths

  • Fully open training data (PixMo) — avoids distillation from closed VLMs
  • Competitive with GPT-4V and Claude 3.5 Sonnet on many visual benchmarks
  • Supports 'pointing' grounded outputs (pixel coordinates)
  • Apache-2.0 licensed

Limitations

  • 72B model requires serious GPU infra (multi-A100 / H100)
  • No audio or video modalities
  • Lags frontier closed multimodals on hardest reasoning tasks

Use cases

  • Fully-open multimodal research and reproductions
  • Visual Q&A and document understanding
  • Fine-tuning for pointing and grounded tasks (Molmo's 'pointing' prompt)
  • Academic benchmarking without closed-model contamination

Benchmarks

BenchmarkScoreAs of
MMMU~54%2026-04
DocVQA~93%2026-04
ChartQA~87%2026-04

Frequently asked questions

What is Molmo 72B?

Molmo 72B is the Allen Institute for AI's flagship open-weight vision-language model. It combines Qwen2-72B with a CLIP-ViT vision encoder and is trained entirely on the open PixMo dataset.

Why is Molmo's training data important?

Many open VLMs are distilled from closed frontier models, making them partially non-reproducible and licence-fragile. Molmo's PixMo dataset was collected from scratch by Ai2 annotators, giving researchers a fully open and legally clean baseline.

Sources

  1. Molmo 72B on HuggingFace — accessed 2026-04-20
  2. Molmo paper (arXiv) — accessed 2026-04-20