Curiosity · Concept

LLM-as-Judge

Human evaluation of LLM output is the gold standard but is slow and expensive. LLM-as-judge uses a strong model (often GPT-4, Claude, or Gemini) to rate outputs on criteria like accuracy, helpfulness, or groundedness, or to pick the better of two candidates. It has become the default evaluation loop for RLHF data, leaderboards like MT-Bench and Arena-Hard, and production QA.

Quick reference

Proficiency
Intermediate
Also known as
LLM-as-a-judge, AI judge, model-graded evaluation
Prerequisites
LLM basics, Evaluation metrics

Frequently asked questions

What is LLM-as-judge?

A pattern where an LLM evaluates other LLM outputs. You give the judge a prompt, one or two candidate responses, and a rubric, and it produces a score (1-5, 1-10) or a winner. This lets you run thousands of evaluations cheaply where human labeling would be prohibitive.

When does LLM-as-judge work well?

Pairwise comparison ('is A or B better?') is more reliable than absolute scoring. Concrete criteria (groundedness, instruction following, syntax correctness) work better than vague ones ('quality'). Judge correlation with humans is strongest on tasks where the judge model is clearly more capable than the candidate models.

What are the known biases?

Position bias — judges often prefer the first response; mitigate by randomizing order or running both orders. Verbosity bias — longer answers score higher regardless of quality. Self-preference — a model rates its own outputs more highly. Style bias — well-formatted responses beat better but plainer ones. Use multiple judge models and structured rubrics to dampen these.

Should I replace human evaluation with LLM-as-judge?

No — use it to augment. Use LLM judges in the inner loop (daily iteration, regression tests, CI) and humans in the outer loop (release gates, user studies, adversarial red-team). Validate the judge on a small human-labeled set first to ensure it's measuring what you care about.

Sources

  1. Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — accessed 2026-04-20
  2. Liu et al. — G-Eval — accessed 2026-04-20
  3. LMSYS Chatbot Arena — accessed 2026-04-20
  4. Hugging Face — LLM-as-a-Judge — accessed 2026-04-20