Creativity · Agent Protocol
MAgent / MAgentBench — Multi-Agent Benchmark
MAgentBench evaluates systems of multiple agents working together (or against each other) on tasks like team debugging, market simulation, and negotiation. It surfaces failures unique to multi-agent setups: goal drift, deadlock, echo chambers, and coordination overhead that single-agent benchmarks miss entirely.
Protocol facts
- Sponsor
- Academic consortium
- Status
- proposed
- Interop with
- AutoGen, CrewAI, LangGraph, Google A2A
Frequently asked questions
Why is multi-agent evaluation harder?
Because success depends on emergent group behavior. You can have 5 individually-excellent agents that collectively fail due to information hoarding, role confusion, or infinite message loops.
What metrics does MAgentBench report?
Task success rate, coordination efficiency (messages per solved task), robustness to adversarial or faulty teammates, and cost (total tokens / API calls across the team).
Does it work for LLM agents specifically?
Yes — newer MAgent-style benchmarks like MultiAgentBench and Collaborative-Coder specifically target LLM agent teams built on AutoGen, CrewAI, and similar frameworks.
Sources
- MAgent original paper — accessed 2026-04-20
- MultiAgentBench (2024) — accessed 2026-04-20