Curiosity · AI Model
Qwen2.5-VL 72B
Qwen2.5-VL 72B, released in January 2025 as part of Alibaba's Qwen2.5 family, is one of the most capable open-weights vision-language models. Beyond standard image captioning, it supports GUI-agent grounding (clickable-region prediction), long-video reasoning up to an hour, and structured document parsing at near-commercial quality.
Model specs
- Vendor
- Alibaba
- Family
- Qwen2.5
- Released
- 2025-01
- Context window
- 128,000 tokens
- Modalities
- text, vision, video
Strengths
- Leads open-weights VLMs on document VQA and chart reasoning
- Explicit GUI-agent grounding for click-target prediction
- Handles video up to roughly an hour in length
Limitations
- 72B footprint requires multi-GPU serving
- Still behind frontier closed models on pure visual reasoning
- License constraints on certain commercial deployments — check terms
Use cases
- Open-weights GUI agents operating desktop or phone UIs
- Long-form video understanding and summarisation
- Document and form extraction for enterprise pipelines
- Research on agentic vision-language models
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| MMMU | ≈70% | 2025-01 |
| MathVista | ≈74% | 2025-01 |
| DocVQA | ≈96% | 2025-01 |
Frequently asked questions
What is Qwen2.5-VL 72B?
Qwen2.5-VL 72B is Alibaba's top-tier open-weights vision-language model, released in January 2025 with strong document VQA, long-video reasoning, and agentic GUI grounding capabilities.
Can Qwen2.5-VL 72B run locally?
It is designed for multi-GPU serving. A4, A100, or H100 clusters are typical; quantised variants run on single 80GB H100s with reduced throughput.
What is Qwen2.5-VL 72B good at?
Structured document understanding, chart reasoning, long-video Q&A, and predicting interactive regions for GUI-agent loops — it is a strong open-weights alternative to closed frontier VLMs.
Sources
- Alibaba Qwen — Qwen2.5-VL blog — accessed 2026-04-20
- Hugging Face — Qwen/Qwen2.5-VL-72B-Instruct — accessed 2026-04-20