Curiosity · AI Model

Pixtral 12B

Pixtral 12B, released in September 2024, was Mistral's first foray into open-weights multimodality. It pairs a 12-billion-parameter Mistral-NeMo decoder with a 400M vision encoder and supports arbitrary-resolution image inputs, setting a new open-weights bar on multimodal benchmarks like MathVista and DocVQA at launch.

Model specs

Vendor: Mistral AI
Family: Pixtral
Released: 2024-09
Context window: 128,000 tokens
Modalities: text, vision

Strengths

Open weights under Apache 2.0
Handles arbitrary image resolutions and aspect ratios
Competitive document-VQA performance at launch

Limitations

Small 12B backbone limits pure-text reasoning vs. larger models
Vision quality below Pixtral Large and frontier closed models
Audio is not supported

Use cases

Document understanding and form extraction
Chart and plot reasoning in analytics pipelines
Open-weights alternatives to GPT-4o vision and Gemini Pro vision
Education on multimodal transformer design

Benchmarks

Benchmark	Score	As of
MMMU	≈52%	2024-09
MathVista	≈58%	2024-09
DocVQA	≈90%	2024-09

Frequently asked questions

What is Pixtral 12B?

Pixtral 12B is Mistral AI's first open-weights vision-language model, combining a 12-billion-parameter text decoder with a 400M vision encoder that accepts arbitrary-resolution images.

Where can I run Pixtral 12B?

Weights are on Hugging Face under 'mistralai/Pixtral-12B', and Mistral hosts it on la Plateforme. Community tools like vLLM and SGLang support it for local deployment.

What tasks is Pixtral 12B good at?

Document VQA, chart and table reasoning, image captioning, and visual instruction following. It is especially strong for its size on structured-document understanding.

Sources

Mistral — Pixtral 12B launch — accessed 2026-04-20
Hugging Face — mistralai/Pixtral-12B — accessed 2026-04-20