Creativity · Agent Protocol

GAIA Benchmark for General AI Assistants

GAIA (General AI Assistants) is a benchmark introduced in late 2023 by Hugging Face, Meta-AI, and AutoGPT contributors that evaluates agents on 466 human-verified questions spanning reasoning, multi-modality, tool use, and long-horizon browsing. Humans score ~92%; the best public agents still trail significantly — making GAIA a durable yardstick for generalist agent capability.

Protocol facts

Sponsor
Hugging Face + Meta-AI
Status
stable
Spec
https://huggingface.co/gaia-benchmark
Interop with
AgentBench, WebArena, OpenAI Evals

Frequently asked questions

What makes GAIA different from other benchmarks?

GAIA tasks are conceptually simple for humans but require agents to chain many tools — browsing, file reading, code execution — across long horizons. It targets real-assistant capability rather than isolated skills.

How is GAIA scored?

Questions have exact-match string answers. Agents submit to a public Hugging Face leaderboard; scoring is automatic and transparent.

Can I use GAIA to evaluate my own agent?

Yes — the validation set is open and the test set accepts submissions. Many agent frameworks (LangGraph, CrewAI, smolagents) publish GAIA baseline scores.

Sources

  1. GAIA: a benchmark for General AI Assistants (arXiv 2311.12983) — accessed 2026-04-20
  2. GAIA Leaderboard on Hugging Face — accessed 2026-04-20