Curiosity · AI Model

Grok 2 Vision

Grok 2 Vision is xAI's 2024 multimodal variant of Grok 2, adding image input alongside text. It remains xAI's main vision-capable offering while the team rolls native vision into the Grok 4 series, and is often the cheapest route for basic chart, screenshot, and photo understanding on the xAI API.

Model specs

Vendor: xAI
Family: Grok
Released: 2024-10
Context window: 32,000 tokens
Modalities: text, vision
Input price: $2/M tok
Output price: $10/M tok
Pricing as of: 2026-04-20

Strengths

Tight integration with the X product surface
Cheaper than GPT-4o or Gemini Flash for visual tasks
Simple two-modality interface (image + text)

Limitations

Outperformed on OCR and chart reading by Gemini 2.5 Pro and GPT-5
32k context is small vs. modern vision LLMs
No video or audio modalities

Use cases

Screenshot Q&A inside the X app
Basic chart and infographic reading
Photo description for accessibility
Document OCR at modest scale

Benchmarks

Benchmark	Score	As of
MMMU	~66%	2026-04
DocVQA	~92%	2026-04
ChartQA	~85%	2026-04

Frequently asked questions

What is Grok 2 Vision?

Grok 2 Vision is xAI's 2024 multimodal model, extending Grok 2 with image input. It supports photos, screenshots, charts, and document pages alongside text prompts.

Is Grok 2 Vision replaced by Grok 4?

Grok 4 has native vision baked in, so new builds should prefer Grok 4. Grok 2 Vision remains available for cost-sensitive workloads on the xAI API.

Sources

xAI — Grok 2 announcement — accessed 2026-04-20
xAI API vision docs — accessed 2026-04-20