Creativity · Agent Protocol

GPQA for Agents — Graduate-Level Reasoning Benchmark

GPQA (Graduate-Level Google-Proof Q&A) asks questions that PhD experts in other fields can't answer even with unlimited web access. For agents, it tests whether tool use — literature search, calculator, code — actually helps on real scientific reasoning, not just factual recall. It's a key capability benchmark for scientific research agents.

Protocol facts

Sponsor: Academic (NYU + collaborators)
Status: stable
Spec: https://github.com/idavidrein/gpqa
Interop with: OpenAI Assistants, Claude, Gemini, agent frameworks with tool use

Frequently asked questions

Why 'Google-proof'?

The questions were designed so that non-experts — even with unlimited Google and 30 minutes — score below 34%. The answer isn't findable via a single search; it requires synthesizing domain knowledge.

How is GPQA used for agents?

Agents are given tool access (web search, calculator, code execution) and scored on the same questions. The delta vs. text-only LLMs shows whether the agent's scaffolding actually adds reasoning capability.

Has it been saturated?

GPQA Main is partially saturated by frontier models in 2026, but the Diamond subset (the hardest 198 questions) remains a discriminating benchmark where the top agents score 75–85%.

Sources

GPQA GitHub — accessed 2026-04-20
GPQA paper — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related