Curiosity · AI Model

Qwen2.5-VL 72B

Qwen2.5-VL 72B, released in January 2025 as part of Alibaba's Qwen2.5 family, is one of the most capable open-weights vision-language models. Beyond standard image captioning, it supports GUI-agent grounding (clickable-region prediction), long-video reasoning up to an hour, and structured document parsing at near-commercial quality.

Model specs

Vendor
Alibaba
Family
Qwen2.5
Released
2025-01
Context window
128,000 tokens
Modalities
text, vision, video

Strengths

  • Leads open-weights VLMs on document VQA and chart reasoning
  • Explicit GUI-agent grounding for click-target prediction
  • Handles video up to roughly an hour in length

Limitations

  • 72B footprint requires multi-GPU serving
  • Still behind frontier closed models on pure visual reasoning
  • License constraints on certain commercial deployments — check terms

Use cases

  • Open-weights GUI agents operating desktop or phone UIs
  • Long-form video understanding and summarisation
  • Document and form extraction for enterprise pipelines
  • Research on agentic vision-language models

Benchmarks

BenchmarkScoreAs of
MMMU≈70%2025-01
MathVista≈74%2025-01
DocVQA≈96%2025-01

Frequently asked questions

What is Qwen2.5-VL 72B?

Qwen2.5-VL 72B is Alibaba's top-tier open-weights vision-language model, released in January 2025 with strong document VQA, long-video reasoning, and agentic GUI grounding capabilities.

Can Qwen2.5-VL 72B run locally?

It is designed for multi-GPU serving. A4, A100, or H100 clusters are typical; quantised variants run on single 80GB H100s with reduced throughput.

What is Qwen2.5-VL 72B good at?

Structured document understanding, chart reasoning, long-video Q&A, and predicting interactive regions for GUI-agent loops — it is a strong open-weights alternative to closed frontier VLMs.

Sources

  1. Alibaba Qwen — Qwen2.5-VL blog — accessed 2026-04-20
  2. Hugging Face — Qwen/Qwen2.5-VL-72B-Instruct — accessed 2026-04-20