Creativity · Agent Protocol
GAIA Benchmark for General AI Assistants
GAIA (General AI Assistants) is a benchmark introduced in late 2023 by Hugging Face, Meta-AI, and AutoGPT contributors that evaluates agents on 466 human-verified questions spanning reasoning, multi-modality, tool use, and long-horizon browsing. Humans score ~92%; the best public agents still trail significantly — making GAIA a durable yardstick for generalist agent capability.
Protocol facts
- Sponsor
- Hugging Face + Meta-AI
- Status
- stable
- Spec
- https://huggingface.co/gaia-benchmark
- Interop with
- AgentBench, WebArena, OpenAI Evals
Frequently asked questions
What makes GAIA different from other benchmarks?
GAIA tasks are conceptually simple for humans but require agents to chain many tools — browsing, file reading, code execution — across long horizons. It targets real-assistant capability rather than isolated skills.
How is GAIA scored?
Questions have exact-match string answers. Agents submit to a public Hugging Face leaderboard; scoring is automatic and transparent.
Can I use GAIA to evaluate my own agent?
Yes — the validation set is open and the test set accepts submissions. Many agent frameworks (LangGraph, CrewAI, smolagents) publish GAIA baseline scores.
Sources
- GAIA: a benchmark for General AI Assistants (arXiv 2311.12983) — accessed 2026-04-20
- GAIA Leaderboard on Hugging Face — accessed 2026-04-20