Creativity · Agent Protocol

GAIA Benchmark for General AI Assistants

GAIA (General AI Assistants) is a benchmark introduced in late 2023 by Hugging Face, Meta-AI, and AutoGPT contributors that evaluates agents on 466 human-verified questions spanning reasoning, multi-modality, tool use, and long-horizon browsing. Humans score ~92%; the best public agents still trail significantly — making GAIA a durable yardstick for generalist agent capability.

Protocol facts

Sponsor: Hugging Face + Meta-AI
Status: stable
Spec: https://huggingface.co/gaia-benchmark
Interop with: AgentBench, WebArena, OpenAI Evals

Frequently asked questions

What makes GAIA different from other benchmarks?

GAIA tasks are conceptually simple for humans but require agents to chain many tools — browsing, file reading, code execution — across long horizons. It targets real-assistant capability rather than isolated skills.

How is GAIA scored?

Questions have exact-match string answers. Agents submit to a public Hugging Face leaderboard; scoring is automatic and transparent.

Can I use GAIA to evaluate my own agent?

Yes — the validation set is open and the test set accepts submissions. Many agent frameworks (LangGraph, CrewAI, smolagents) publish GAIA baseline scores.

Sources

GAIA: a benchmark for General AI Assistants (arXiv 2311.12983) — accessed 2026-04-20
GAIA Leaderboard on Hugging Face — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related