Creativity · Agent Protocol

MAgent / MAgentBench — Multi-Agent Benchmark

MAgentBench evaluates systems of multiple agents working together (or against each other) on tasks like team debugging, market simulation, and negotiation. It surfaces failures unique to multi-agent setups: goal drift, deadlock, echo chambers, and coordination overhead that single-agent benchmarks miss entirely.

Protocol facts

Sponsor: Academic consortium
Status: proposed
Interop with: AutoGen, CrewAI, LangGraph, Google A2A

Frequently asked questions

Why is multi-agent evaluation harder?

Because success depends on emergent group behavior. You can have 5 individually-excellent agents that collectively fail due to information hoarding, role confusion, or infinite message loops.

What metrics does MAgentBench report?

Task success rate, coordination efficiency (messages per solved task), robustness to adversarial or faulty teammates, and cost (total tokens / API calls across the team).

Does it work for LLM agents specifically?

Yes — newer MAgent-style benchmarks like MultiAgentBench and Collaborative-Coder specifically target LLM agent teams built on AutoGen, CrewAI, and similar frameworks.

Sources

MAgent original paper — accessed 2026-04-20
MultiAgentBench (2024) — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related