Curiosity · AI Model

InternVL 2.5

InternVL 2.5 is the third major release of OpenGVLab's open multimodal family. Released in December 2024, it ships in sizes from 1B to 78B parameters and is the first open VLM to exceed 70% on MMMU, matching GPT-4o's multimodal reasoning. Architecturally it pairs an InternViT-6B vision encoder with a Qwen-2.5 or LLaMA-based language backbone, and uses chain-of-thought rollouts at test time for long multimodal reasoning.

Model specs

Vendor: OpenGVLab (Shanghai AI Lab)
Family: InternVL
Released: 2024-12
Context window: 32,768 tokens
Modalities: text, vision, video

Strengths

First open VLM above 70% on MMMU
Broad size ladder — 1B to 78B — picks right quality/cost trade-off
Strong math and OCR results
Active release cadence with detailed technical reports

Limitations

Large sizes need 8× A100-class deployment
Chain-of-thought rollouts increase inference cost
English creative writing lags western closed models
Licence has some commercial restrictions at the top tier

Use cases

Multimodal reasoning research
Open fine-tuning backbone for vertical VLMs
Visual document QA and math reasoning
Model-card benchmarking and reproducibility

Benchmarks

Benchmark	Score	As of
MMMU (val)	≈70% (78B)	2024-12
MathVista	≈72%	2024-12
OCRBench	≈852	2024-12

Frequently asked questions

What is InternVL 2.5?

InternVL 2.5 is OpenGVLab's December 2024 open multimodal family, covering sizes from 1B to 78B and designed to match GPT-4o on multimodal reasoning benchmarks.

How does InternVL 2.5 compare to Qwen2-VL?

Both are leading open VLMs. InternVL 2.5 leads slightly on MMMU and MathVista thanks to test-time scaling; Qwen2-VL leads on long-video understanding.

Can I use InternVL 2.5 commercially?

Smaller variants are permissively licensed; the 78B model has extra restrictions. Check the Hugging Face model card for specifics.

Sources

InternVL 2.5 paper (arXiv) — accessed 2026-04-20
InternVL 2.5 on Hugging Face — accessed 2026-04-20