Creativity · Agent Protocol

HaluEval — Hallucination Evaluation Benchmark

HaluEval provides 35,000+ annotated hallucination examples that let you measure an agent's factuality: does it answer from evidence or invent plausible-sounding claims? It's become a go-to benchmark for RAG pipelines and grounded agents, where hallucination directly translates to user trust loss.

Protocol facts

Sponsor
Academic (Renmin University + collaborators)
Status
stable
Spec
https://github.com/RUCAIBox/HaluEval
Interop with
RAG pipelines, LangChain, LlamaIndex

Frequently asked questions

What does HaluEval test?

It tests whether a model can detect hallucinated content in candidate responses and whether its own generations are grounded in provided context across QA, knowledge-grounded dialogue, and summarization.

How do I use HaluEval for an agent?

Run the agent against HaluEval queries with and without retrieval, then score outputs against gold-standard answers. The delta tells you how much your retrieval layer actually reduces hallucination.

Is HaluEval still current?

The original 2023 release is still widely cited, and newer variants (HaluEval-Wild, HaluEval 2.0) extend it with harder, more recent examples as frontier models improved.

Sources

  1. HaluEval GitHub — accessed 2026-04-20
  2. HaluEval paper (EMNLP 2023) — accessed 2026-04-20