Creativity · Comparison

Few-Shot Prompting vs Fine-Tuning

Few-shot prompting and fine-tuning both customize an LLM to your task, but they do it at opposite ends of the training/inference spectrum. Few-shot puts 2-20 demonstrations directly in the prompt and lets the model generalize. Fine-tuning actually updates model weights on thousands of examples. The choice shapes your whole pipeline: cost, latency, iteration speed, and what kinds of improvement are possible.

Side-by-side

Criterion Few-Shot Prompting Fine-Tuning
Setup time Minutes — write prompts Hours to days — data + training + eval
Training data required ~5-20 examples ~500-10,000+ high-quality examples
Per-request cost Higher — examples sit in every prompt Lower per request (after training)
Iteration speed Minutes — edit prompt and rerun Hours — retrain, reevaluate, redeploy
Works with closed API models Yes — any model Only where vendor allows (OpenAI, Bedrock, Vertex)
Latency impact Slight — longer prompt = longer TTFT No extra latency vs base model
Style / tone conformance Good with careful examples Excellent — deeper imprint on style
Handles distribution shift Edit examples and rerun Retrain on new data
Best fit Most tasks, especially with frontier models High-volume production, specific style/format, narrow domain

Verdict

In 2026, few-shot prompting should be the default — frontier models (Claude Opus 4.7, GPT-5, Gemini 2.5 Pro) are so strong at in-context learning that fine-tuning rarely beats a well-crafted few-shot prompt. The exception is high-volume production where per-request cost dominates, or narrow style/format tasks where you need the model to absolutely produce a specific output pattern. Fine-tuning also shines for smaller open-weights models — fine-tuning a 7B model on your task can beat few-shot-ing a frontier model for cost-sensitive workloads. Rule of thumb: start with few-shot, measure, fine-tune only when few-shot plateaus and volume justifies it.

When to choose each

Choose Few-Shot Prompting if…

  • You're early in development and need to iterate fast.
  • Your task benefits from frontier-model reasoning.
  • You have fewer than a few hundred examples available.
  • You don't want to manage training data and ops.

Choose Fine-Tuning if…

  • You have 1000+ labeled examples and a stable task definition.
  • High volume makes per-request cost matter — saving tokens by not including examples.
  • You need the model to absolutely conform to a specific format/style.
  • You're working with a small open-weights model that benefits from task adaptation.

Frequently asked questions

Isn't RAG better than both?

RAG solves a different problem — it provides external knowledge. Few-shot and fine-tuning shape how the model behaves. In production you often use all three: RAG for facts, few-shot for task structure, fine-tuning for format conformance.

How many examples for few-shot?

Typically 3-10. More is sometimes better but hits diminishing returns quickly with strong base models. What matters more than count is diversity — cover edge cases and failure modes, not just canonical cases.

Does prompt caching change the economics?

Yes. If your few-shot examples are stable, prompt caching (Anthropic, Google, OpenAI) dramatically reduces per-request cost — sometimes 90% off on the cached portion. This tips many decisions toward few-shot over fine-tuning.

Sources

  1. OpenAI — Prompt engineering guide — accessed 2026-04-20
  2. Anthropic — Fine-tuning overview — accessed 2026-04-20