Curiosity · AI Model

Qwen2-VL 72B

Qwen2-VL 72B is Alibaba's flagship open vision-language model, released in September 2024 under the Qwen 2 family. It introduces a Naive Dynamic Resolution visual encoder and M-RoPE multimodal positional embeddings, letting it natively process arbitrary-resolution images and up to ~20-minute videos. The 72B instruct variant is competitive with GPT-4o on MMMU, DocVQA, and ChartQA while being downloadable from Hugging Face under the Qwen licence.

Model specs

Vendor: Alibaba / Qwen team
Family: Qwen 2
Released: 2024-09
Context window: 32,768 tokens
Modalities: text, vision, video

Strengths

Dynamic-resolution encoder avoids fixed-size image patches
M-RoPE gives strong temporal understanding for video
Competitive with GPT-4o on several benchmarks
Open weights for research and most commercial use

Limitations

72B inference needs 2× A100 / H100-class GPUs at reasonable latency
Qwen licence has monthly-active-user restrictions at scale
Video understanding degrades past ~20 minutes
English creative writing behind GPT-4o

Use cases

Document understanding and multilingual OCR
Long-form video QA up to ~20 minutes
Chart and diagram reasoning
Open-weights replacement for closed VLM APIs

Benchmarks

Benchmark	Score	As of
MMMU (val)	≈64%	2024-09
DocVQA	≈96%	2024-09
MathVista	≈70%	2024-09
Video-MME (w/ subtitles)	≈72%	2024-09

Frequently asked questions

What is Qwen2-VL 72B?

Qwen2-VL 72B is the flagship open vision-language model from Alibaba's Qwen team, released in September 2024. It natively handles arbitrary-resolution images and long videos.

How does Qwen2-VL compare to GPT-4o Vision?

On benchmarks like DocVQA, MathVista, and Video-MME, Qwen2-VL 72B is competitive with GPT-4o. It lags in some English creative tasks but offers open weights.

What licence does Qwen2-VL use?

The Qwen licence permits research and commercial use, with conditions for very large deployments (>100M MAU).

Sources

Qwen2-VL blog — accessed 2026-04-20
Qwen2-VL 72B on Hugging Face — accessed 2026-04-20