Curiosity · AI Model

Grok 2 Vision

Grok 2 Vision is xAI's 2024 multimodal variant of Grok 2, adding image input alongside text. It remains xAI's main vision-capable offering while the team rolls native vision into the Grok 4 series, and is often the cheapest route for basic chart, screenshot, and photo understanding on the xAI API.

Model specs

Vendor
xAI
Family
Grok
Released
2024-10
Context window
32,000 tokens
Modalities
text, vision
Input price
$2/M tok
Output price
$10/M tok
Pricing as of
2026-04-20

Strengths

  • Tight integration with the X product surface
  • Cheaper than GPT-4o or Gemini Flash for visual tasks
  • Simple two-modality interface (image + text)

Limitations

  • Outperformed on OCR and chart reading by Gemini 2.5 Pro and GPT-5
  • 32k context is small vs. modern vision LLMs
  • No video or audio modalities

Use cases

  • Screenshot Q&A inside the X app
  • Basic chart and infographic reading
  • Photo description for accessibility
  • Document OCR at modest scale

Benchmarks

BenchmarkScoreAs of
MMMU~66%2026-04
DocVQA~92%2026-04
ChartQA~85%2026-04

Frequently asked questions

What is Grok 2 Vision?

Grok 2 Vision is xAI's 2024 multimodal model, extending Grok 2 with image input. It supports photos, screenshots, charts, and document pages alongside text prompts.

Is Grok 2 Vision replaced by Grok 4?

Grok 4 has native vision baked in, so new builds should prefer Grok 4. Grok 2 Vision remains available for cost-sensitive workloads on the xAI API.

Sources

  1. xAI — Grok 2 announcement — accessed 2026-04-20
  2. xAI API vision docs — accessed 2026-04-20