Curiosity · AI Model

InternVL 2.5

InternVL 2.5 is the third major release of OpenGVLab's open multimodal family. Released in December 2024, it ships in sizes from 1B to 78B parameters and is the first open VLM to exceed 70% on MMMU, matching GPT-4o's multimodal reasoning. Architecturally it pairs an InternViT-6B vision encoder with a Qwen-2.5 or LLaMA-based language backbone, and uses chain-of-thought rollouts at test time for long multimodal reasoning.

Model specs

Vendor
OpenGVLab (Shanghai AI Lab)
Family
InternVL
Released
2024-12
Context window
32,768 tokens
Modalities
text, vision, video

Strengths

  • First open VLM above 70% on MMMU
  • Broad size ladder — 1B to 78B — picks right quality/cost trade-off
  • Strong math and OCR results
  • Active release cadence with detailed technical reports

Limitations

  • Large sizes need 8× A100-class deployment
  • Chain-of-thought rollouts increase inference cost
  • English creative writing lags western closed models
  • Licence has some commercial restrictions at the top tier

Use cases

  • Multimodal reasoning research
  • Open fine-tuning backbone for vertical VLMs
  • Visual document QA and math reasoning
  • Model-card benchmarking and reproducibility

Benchmarks

BenchmarkScoreAs of
MMMU (val)≈70% (78B)2024-12
MathVista≈72%2024-12
OCRBench≈8522024-12

Frequently asked questions

What is InternVL 2.5?

InternVL 2.5 is OpenGVLab's December 2024 open multimodal family, covering sizes from 1B to 78B and designed to match GPT-4o on multimodal reasoning benchmarks.

How does InternVL 2.5 compare to Qwen2-VL?

Both are leading open VLMs. InternVL 2.5 leads slightly on MMMU and MathVista thanks to test-time scaling; Qwen2-VL leads on long-video understanding.

Can I use InternVL 2.5 commercially?

Smaller variants are permissively licensed; the 78B model has extra restrictions. Check the Hugging Face model card for specifics.

Sources

  1. InternVL 2.5 paper (arXiv) — accessed 2026-04-20
  2. InternVL 2.5 on Hugging Face — accessed 2026-04-20