Curiosity · AI Model
Pixtral 12B
Pixtral 12B, released in September 2024, was Mistral's first foray into open-weights multimodality. It pairs a 12-billion-parameter Mistral-NeMo decoder with a 400M vision encoder and supports arbitrary-resolution image inputs, setting a new open-weights bar on multimodal benchmarks like MathVista and DocVQA at launch.
Model specs
- Vendor
- Mistral AI
- Family
- Pixtral
- Released
- 2024-09
- Context window
- 128,000 tokens
- Modalities
- text, vision
Strengths
- Open weights under Apache 2.0
- Handles arbitrary image resolutions and aspect ratios
- Competitive document-VQA performance at launch
Limitations
- Small 12B backbone limits pure-text reasoning vs. larger models
- Vision quality below Pixtral Large and frontier closed models
- Audio is not supported
Use cases
- Document understanding and form extraction
- Chart and plot reasoning in analytics pipelines
- Open-weights alternatives to GPT-4o vision and Gemini Pro vision
- Education on multimodal transformer design
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| MMMU | ≈52% | 2024-09 |
| MathVista | ≈58% | 2024-09 |
| DocVQA | ≈90% | 2024-09 |
Frequently asked questions
What is Pixtral 12B?
Pixtral 12B is Mistral AI's first open-weights vision-language model, combining a 12-billion-parameter text decoder with a 400M vision encoder that accepts arbitrary-resolution images.
Where can I run Pixtral 12B?
Weights are on Hugging Face under 'mistralai/Pixtral-12B', and Mistral hosts it on la Plateforme. Community tools like vLLM and SGLang support it for local deployment.
What tasks is Pixtral 12B good at?
Document VQA, chart and table reasoning, image captioning, and visual instruction following. It is especially strong for its size on structured-document understanding.
Sources
- Mistral — Pixtral 12B launch — accessed 2026-04-20
- Hugging Face — mistralai/Pixtral-12B — accessed 2026-04-20