Capability · Comparison

Prompt Caching vs RAG

Prompt caching and RAG are sometimes pitched as alternatives, but they solve different problems. Prompt caching is about reusing the expensive compute of a repeated long prefix — system prompts, tool descriptions, stable context. RAG is about selecting the right chunk of knowledge to include in a context at inference time. Most production apps use both.

Side-by-side

Criterion Prompt Caching RAG (Retrieval-Augmented Generation)
What it optimizes Cost/latency of repeated prefixes Relevance of retrieved knowledge
Handles 'which docs'? No — prefix is fixed Yes — its whole purpose
Handles 'avoid re-processing'? Yes — KV cache hit No — every call re-processes retrieved chunks
Cost savings Up to 90% on cached tokens (Anthropic, OpenAI) Keeps context small vs dumping everything
Latency savings Significant on long prefixes Doesn't directly affect latency
Setup complexity Low — mark the cached prefix Medium — embedder, vector DB, chunker
Index freshness N/A Core concern — handled by index update
Works on closed APIs Yes (Anthropic, OpenAI, Gemini) Yes (provider-agnostic)
Combined use Cache stable prefix (tools, system prompt) Retrieve variable chunks

Verdict

Prompt caching and RAG are not competing strategies — they're complementary layers of the same optimization stack. Cache what's stable (system prompt, tool definitions, long framing documents) and RAG the dynamic parts (user-relevant passages). A typical production prompt looks like: [cached system prompt + tools] + [RAG-retrieved passages] + [user query]. That pattern minimizes cost without compromising freshness.

When to choose each

Choose Prompt Caching if…

  • You have a long, stable prefix that every call reuses.
  • Latency on long-context calls is hurting UX.
  • You want up to 90% token-cost savings on repeat traffic.
  • Your system prompt or tool definitions are >1000 tokens.

Choose RAG (Retrieval-Augmented Generation) if…

  • Your knowledge base is larger than your context window.
  • Knowledge changes regularly and freshness matters.
  • You need source citations for auditability.
  • Different queries need different passages of a large corpus.

Frequently asked questions

Can I use prompt caching with RAG?

Yes, and it's the best pattern. Cache the stable prefix (system prompt + tools), put your RAG-retrieved passages after the cache boundary, then user query. You get the savings without giving up retrieval.

How long does a cached prefix live?

Depends on provider. Anthropic's ephemeral cache is ~5 minutes; 1-hour caching is available as a paid tier. OpenAI and Google offer similar automatic caching with their own TTLs.

What's the minimum prefix for caching to pay off?

Roughly 1024 tokens on Anthropic; varies by provider. Below that, cache overhead cancels savings.

Sources

  1. Anthropic — Prompt caching — accessed 2026-04-20
  2. OpenAI — Prompt caching — accessed 2026-04-20