Curiosity · AI Model
Grok 2 Vision
Grok 2 Vision is xAI's 2024 multimodal variant of Grok 2, adding image input alongside text. It remains xAI's main vision-capable offering while the team rolls native vision into the Grok 4 series, and is often the cheapest route for basic chart, screenshot, and photo understanding on the xAI API.
Model specs
- Vendor
- xAI
- Family
- Grok
- Released
- 2024-10
- Context window
- 32,000 tokens
- Modalities
- text, vision
- Input price
- $2/M tok
- Output price
- $10/M tok
- Pricing as of
- 2026-04-20
Strengths
- Tight integration with the X product surface
- Cheaper than GPT-4o or Gemini Flash for visual tasks
- Simple two-modality interface (image + text)
Limitations
- Outperformed on OCR and chart reading by Gemini 2.5 Pro and GPT-5
- 32k context is small vs. modern vision LLMs
- No video or audio modalities
Use cases
- Screenshot Q&A inside the X app
- Basic chart and infographic reading
- Photo description for accessibility
- Document OCR at modest scale
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| MMMU | ~66% | 2026-04 |
| DocVQA | ~92% | 2026-04 |
| ChartQA | ~85% | 2026-04 |
Frequently asked questions
What is Grok 2 Vision?
Grok 2 Vision is xAI's 2024 multimodal model, extending Grok 2 with image input. It supports photos, screenshots, charts, and document pages alongside text prompts.
Is Grok 2 Vision replaced by Grok 4?
Grok 4 has native vision baked in, so new builds should prefer Grok 4. Grok 2 Vision remains available for cost-sensitive workloads on the xAI API.
Sources
- xAI — Grok 2 announcement — accessed 2026-04-20
- xAI API vision docs — accessed 2026-04-20