Curiosity · AI Model

Pixtral 12B

Pixtral 12B, released in September 2024, was Mistral's first foray into open-weights multimodality. It pairs a 12-billion-parameter Mistral-NeMo decoder with a 400M vision encoder and supports arbitrary-resolution image inputs, setting a new open-weights bar on multimodal benchmarks like MathVista and DocVQA at launch.

Model specs

Vendor
Mistral AI
Family
Pixtral
Released
2024-09
Context window
128,000 tokens
Modalities
text, vision

Strengths

  • Open weights under Apache 2.0
  • Handles arbitrary image resolutions and aspect ratios
  • Competitive document-VQA performance at launch

Limitations

  • Small 12B backbone limits pure-text reasoning vs. larger models
  • Vision quality below Pixtral Large and frontier closed models
  • Audio is not supported

Use cases

  • Document understanding and form extraction
  • Chart and plot reasoning in analytics pipelines
  • Open-weights alternatives to GPT-4o vision and Gemini Pro vision
  • Education on multimodal transformer design

Benchmarks

BenchmarkScoreAs of
MMMU≈52%2024-09
MathVista≈58%2024-09
DocVQA≈90%2024-09

Frequently asked questions

What is Pixtral 12B?

Pixtral 12B is Mistral AI's first open-weights vision-language model, combining a 12-billion-parameter text decoder with a 400M vision encoder that accepts arbitrary-resolution images.

Where can I run Pixtral 12B?

Weights are on Hugging Face under 'mistralai/Pixtral-12B', and Mistral hosts it on la Plateforme. Community tools like vLLM and SGLang support it for local deployment.

What tasks is Pixtral 12B good at?

Document VQA, chart and table reasoning, image captioning, and visual instruction following. It is especially strong for its size on structured-document understanding.

Sources

  1. Mistral — Pixtral 12B launch — accessed 2026-04-20
  2. Hugging Face — mistralai/Pixtral-12B — accessed 2026-04-20