Curiosity · Concept

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLMs on private or fresh data without retraining. At query time, the system retrieves relevant passages from an index (vector, keyword, or hybrid) and stuffs them into the prompt. The model answers from the retrieved context, which reduces hallucinations and lets the system stay current as the knowledge base changes.

Quick reference

Proficiency
Beginner
Also known as
RAG, grounded generation, retrieval-grounded LLM
Prerequisites
Embeddings, Vector databases (basics)

Frequently asked questions

What is RAG?

Retrieval-Augmented Generation (RAG) is a pattern where an LLM is grounded on retrieved passages at query time — the system retrieves relevant chunks from an index and puts them in the prompt, so the model answers from real data rather than its parametric memory.

When should I use RAG vs fine-tuning?

Use RAG when knowledge changes frequently, you need citations, or retraining is expensive. Use fine-tuning when you need to change the model's style, tone, or task format rather than its facts. They can be combined.

What's a 'chunking strategy' in RAG?

Chunking is how you split documents before embedding. Common strategies: fixed-size with overlap, sentence/paragraph boundaries, hierarchical / parent-document, and semantic chunking that splits by topic. Chunk size dramatically affects retrieval quality.

What are the failure modes of RAG?

Retrieval miss (the right chunk is not surfaced), retrieval noise (irrelevant chunks drown out signal), chunk truncation mid-sentence, over-stuffing the context window, and prompt-injection attacks via retrieved content.

Sources

  1. Lewis et al. — RAG paper — accessed 2026-04-20
  2. LlamaIndex — Learning RAG — accessed 2026-04-20