Creativity · Agent Protocol
HaluEval — Hallucination Evaluation Benchmark
HaluEval provides 35,000+ annotated hallucination examples that let you measure an agent's factuality: does it answer from evidence or invent plausible-sounding claims? It's become a go-to benchmark for RAG pipelines and grounded agents, where hallucination directly translates to user trust loss.
Protocol facts
- Sponsor
- Academic (Renmin University + collaborators)
- Status
- stable
- Spec
- https://github.com/RUCAIBox/HaluEval
- Interop with
- RAG pipelines, LangChain, LlamaIndex
Frequently asked questions
What does HaluEval test?
It tests whether a model can detect hallucinated content in candidate responses and whether its own generations are grounded in provided context across QA, knowledge-grounded dialogue, and summarization.
How do I use HaluEval for an agent?
Run the agent against HaluEval queries with and without retrieval, then score outputs against gold-standard answers. The delta tells you how much your retrieval layer actually reduces hallucination.
Is HaluEval still current?
The original 2023 release is still widely cited, and newer variants (HaluEval-Wild, HaluEval 2.0) extend it with harder, more recent examples as frontier models improved.
Sources
- HaluEval GitHub — accessed 2026-04-20
- HaluEval paper (EMNLP 2023) — accessed 2026-04-20