Curiosity · Concept

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT), also called instruction tuning when the data is instruction-response pairs, is the stage where a pre-trained base LLM learns to actually follow user prompts. The training objective is ordinary next-token cross-entropy on curated demonstrations — (prompt, ideal response) pairs from humans or strong models. SFT is cheap, stable, and usually necessary before any preference-based alignment step like DPO or PPO-RLHF, because a strong SFT policy is both the initialisation and the reference model those methods need.

Quick reference

Proficiency: Beginner
Also known as: SFT, instruction fine-tuning
Prerequisites: Fine-tuning, Language modelling

Frequently asked questions

What is supervised fine-tuning?

SFT is the stage of LLM post-training where the model learns from curated (prompt, response) pairs using ordinary supervised next-token prediction. It's how a base language model becomes an instruction-following assistant.

How is SFT different from pre-training?

Both use next-token cross-entropy, but pre-training runs on raw web-scale text while SFT runs on a much smaller, carefully curated set of task demonstrations — usually thousands to millions of examples rather than trillions of tokens.

Is SFT the same as instruction tuning?

Instruction tuning is a specific kind of SFT where the training data is instruction + response pairs. People use the terms interchangeably in modern LLM pipelines.

Why do RLHF and DPO still need SFT first?

RLHF, DPO, and GRPO all need a starting policy and a reference model. The SFT checkpoint serves as both — a stronger SFT base gives preference methods more signal to work with and keeps the KL term well-behaved.

Sources

Ouyang et al. — Training language models to follow instructions (InstructGPT) — accessed 2026-04-20
Wei et al. — Finetuned Language Models Are Zero-Shot Learners (FLAN) — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related