Creativity · Agent Protocol
GPQA for Agents — Graduate-Level Reasoning Benchmark
GPQA (Graduate-Level Google-Proof Q&A) asks questions that PhD experts in other fields can't answer even with unlimited web access. For agents, it tests whether tool use — literature search, calculator, code — actually helps on real scientific reasoning, not just factual recall. It's a key capability benchmark for scientific research agents.
Protocol facts
- Sponsor
- Academic (NYU + collaborators)
- Status
- stable
- Spec
- https://github.com/idavidrein/gpqa
- Interop with
- OpenAI Assistants, Claude, Gemini, agent frameworks with tool use
Frequently asked questions
Why 'Google-proof'?
The questions were designed so that non-experts — even with unlimited Google and 30 minutes — score below 34%. The answer isn't findable via a single search; it requires synthesizing domain knowledge.
How is GPQA used for agents?
Agents are given tool access (web search, calculator, code execution) and scored on the same questions. The delta vs. text-only LLMs shows whether the agent's scaffolding actually adds reasoning capability.
Has it been saturated?
GPQA Main is partially saturated by frontier models in 2026, but the Diamond subset (the hardest 198 questions) remains a discriminating benchmark where the top agents score 75–85%.
Sources
- GPQA GitHub — accessed 2026-04-20
- GPQA paper — accessed 2026-04-20