Curiosity · AI Model

Qwen2.5-VL 72B

Qwen2.5-VL 72B, released in January 2025 as part of Alibaba's Qwen2.5 family, is one of the most capable open-weights vision-language models. Beyond standard image captioning, it supports GUI-agent grounding (clickable-region prediction), long-video reasoning up to an hour, and structured document parsing at near-commercial quality.

Model specs

Vendor: Alibaba
Family: Qwen2.5
Released: 2025-01
Context window: 128,000 tokens
Modalities: text, vision, video

Strengths

Leads open-weights VLMs on document VQA and chart reasoning
Explicit GUI-agent grounding for click-target prediction
Handles video up to roughly an hour in length

Limitations

72B footprint requires multi-GPU serving
Still behind frontier closed models on pure visual reasoning
License constraints on certain commercial deployments — check terms

Use cases

Open-weights GUI agents operating desktop or phone UIs
Long-form video understanding and summarisation
Document and form extraction for enterprise pipelines
Research on agentic vision-language models

Benchmarks

Benchmark	Score	As of
MMMU	≈70%	2025-01
MathVista	≈74%	2025-01
DocVQA	≈96%	2025-01

Frequently asked questions

What is Qwen2.5-VL 72B?

Qwen2.5-VL 72B is Alibaba's top-tier open-weights vision-language model, released in January 2025 with strong document VQA, long-video reasoning, and agentic GUI grounding capabilities.

Can Qwen2.5-VL 72B run locally?

It is designed for multi-GPU serving. A4, A100, or H100 clusters are typical; quantised variants run on single 80GB H100s with reduced throughput.

What is Qwen2.5-VL 72B good at?

Structured document understanding, chart reasoning, long-video Q&A, and predicting interactive regions for GUI-agent loops — it is a strong open-weights alternative to closed frontier VLMs.

Sources

Alibaba Qwen — Qwen2.5-VL blog — accessed 2026-04-20
Hugging Face — Qwen/Qwen2.5-VL-72B-Instruct — accessed 2026-04-20