A2A Agent Card — Capability Manifest Spec
The Agent Card is A2A's capability manifest — a JSON document an agent publishes describing its name, skills, endpoints, auth requirements, and supported transports.
Creativity
How agents find each other, negotiate tasks, and share context — the emerging stack for multi-agent systems.
The Agent Card is A2A's capability manifest — a JSON document an agent publishes describing its name, skills, endpoints, auth requirements, and supported transports.
A2A leans on standard web auth — primarily OAuth 2.0 bearer tokens — so agents authenticate to one another the same way services do, with API keys and mTLS as alternatives.
Task handoff in A2A describes the full lifecycle of delegating work to another agent: create, assign, run, stream updates, return result — with support for long-running and multi-turn tasks.
Adept's ACT-1 was a 2022 action-transformer model that pioneered browser-controlling foundation models — a major intellectual precursor to modern computer-use agents.
AG-UI is an open event-based protocol for how a running agent streams its thoughts, tool calls, and partial outputs to a user-facing UI — the agent-to-UI counterpart of A2A.
Caching tool-call results and memoizing identical LLM prompts is how production agents cut cost and latency by 50–90% — turning repeated external calls into instant local lookups.
Agents can burn thousands of dollars in a single run if left unchecked — explicit token and cost budgets, per-step guards, context pruning, and cheaper-model routing are the patterns production teams use to keep spend sane.
The credential-vault pattern stores secrets — API keys, OAuth tokens, passwords — outside the agent's memory and injects them only into specific tool calls, limiting blast radius if the agent is compromised.
Episodic memory stores specific past events — 'on March 3, user asked X, agent did Y' — letting an agent recall concrete past interactions rather than only general facts.
Human-in-the-loop is the design pattern where agents pause for human approval, correction, or input at specific checkpoints — trading some autonomy for safety, accuracy, and regulatory fit in high-stakes workflows.
Agent identity uses OIDC and OAuth 2.1 to give AI agents their own cryptographically-verifiable identities — separate from user identities — with scoped permissions and full audit trails.
Map-reduce for agents: split a large input into chunks, process each in parallel with a 'map' agent, then combine results with a 'reduce' agent — the classic recipe for long-document work.
Agent mesh networking is an architecture where specialized agents discover each other via a registry, call each other directly over a standard protocol (A2A, MCP), and compose dynamically without central orchestration.
ANP is an open agent-to-agent protocol that treats agents as first-class peers on a decentralized network — using DIDs for identity and JSON-LD for capability discovery.
A PII redaction layer sits between an agent and its inputs/outputs, scrubbing personally-identifiable information — names, SSNs, card numbers — before it reaches the LLM or leaves the system.
The pipeline pattern chains agents in a fixed sequence, each transforming the previous agent's output — a Unix-pipe style composition that favors determinism over autonomy.
Procedural memory stores learned how-to knowledge — reusable skill snippets, successful tool-call sequences, corrected mistakes — that the agent can retrieve and apply to future similar tasks.
Prompt-injection defense is a layered set of techniques — input sanitization, instruction hierarchies, capability scoping, output firewalls — used to prevent attackers from hijacking an agent via untrusted text.
Rate limiting and quotas bound an agent's cost, blast radius, and abuse potential by capping tool calls, token spend, and external API use per user, session, or time window.
Retry-with-backoff is the core resilience pattern for agent tool calls: on transient failure, wait an exponentially growing interval before retrying, with jitter to avoid thundering-herd retries.
The router pattern puts a lightweight classifier at the front door of an agent system, dispatching each request to the cheapest model or most specialized sub-agent that can handle it.
Sandboxing is the foundational safety pattern for agents that run code or browse the web — isolating the agent's execution environment so compromised or hallucinating runs cannot damage host systems or exfiltrate data.
Self-critique is an agent design pattern where the agent reviews and scores its own draft output against a rubric or checklist before returning it, catching errors that slipped past the initial generation.
Semantic memory stores generalized facts — 'the user prefers Python', 'our prod DB is Postgres' — as structured knowledge the agent can retrieve and use in future interactions.
Production agents need durable state and checkpoints — snapshots of memory, tool outputs, and plan steps — so long-running tasks survive crashes, timeouts, and human interruptions without starting over.
The streaming pattern surfaces partial agent output — token-by-token text, interim tool results, status events — to the user as it happens, making multi-second agent tasks feel responsive.
Tool permissioning is the discipline of granting agents the narrowest possible capability set — per-tool allow-lists, confirmation prompts for destructive operations, scoped OAuth, and user-in-the-loop approvals.
The voting-ensemble pattern runs N agents in parallel on the same task and aggregates their answers by majority vote or a judge model, trading cost for robustness on high-stakes decisions.
AgentBench from Tsinghua evaluates LLMs as agents across eight distinct environments — OS, database, web shopping, games, and more — producing a single comparable score for agentic capability.
AgentOps is an open-source observability platform for LLM agents that captures every tool call, token, cost, and latency span — giving production teams tracing, session replay, and evals.
MCP (Model Context Protocol) has become the de-facto standard for exposing tools and data to agents — this entry covers how agent frameworks interoperate with MCP servers in practice.
The AI Engineer Foundation Agent Protocol is an open, vendor-neutral REST specification for running and controlling an agent — start a task, stream steps, list artifacts — backed by an open-source reference server.
Anchor Browser provides hosted, persistent, and programmatically-controllable browsers for AI agents — with built-in auth, CAPTCHA handling, session recording, and a standard CDP API.
Computer Use is Anthropic's API capability that lets Claude see the screen, move the mouse, and type — enabling the model to operate general-purpose software GUIs like a human user.
Arize Phoenix is an open-source LLM observability tool that traces agent runs via OpenTelemetry, clusters failures by embedding, and runs LLM-as-judge evals — all locally or self-hosted.
AutoGen's GroupChat puts several specialist agents around a virtual table with a manager that picks the next speaker — a flexible many-agent conversation primitive from Microsoft Research.
AutoGPT, released March 2023, was the first viral autonomous agent framework — a Python script that chained GPT-4 calls with tools to pursue goals without human steps, sparking the agent-framework era.
BabyAGI, released April 2023 by Yohei Nakajima, was a ~140-line Python script demonstrating task decomposition + prioritization + execution with GPT-4 — one of the first autonomous agent patterns shared widely.
The blackboard pattern uses a shared workspace where agents read and write partial results — a classical AI architecture now finding new life in LLM-agent systems.
Bolt.new by StackBlitz is a browser-based full-stack coding agent built on WebContainers — it runs Node.js, installs packages, edits files, and previews apps entirely in the browser, then deploys to Netlify.
Browser Use is an open-source Python library that gives LLM agents structured access to a real Playwright browser — they see the DOM, screenshots, and interactive elements, and act via a typed action space.
Browserbase provides headless Chrome browsers in the cloud purpose-built for AI agents, with session recording, stealth mode, file handling, and per-request isolation.
Claude Code's subagent pattern lets the main Claude agent spawn specialised sub-Claudes with their own prompts, tool allowlists, and contexts — a first-class multi-agent workflow in a coding CLI.
Claude subagents have moved from coding-CLI curiosity to production pattern — powering Anthropic's own research agent and an increasing share of real-world agent deployments.
Cognigy is an enterprise conversational AI platform for contact centers that builds voice and chat agents with low-code flows, LLM grounding, and deep telephony and CCaaS integration.
Devin is Cognition's autonomous software-engineer agent that plans long-horizon coding tasks, browses documentation, executes shell commands, and ships pull requests — the prototype of the fully-autonomous SWE agent category.
CrewAI's hierarchical process puts a manager agent in charge of a crew — assigning tasks, reviewing outputs, and iterating — contrasting with its simpler sequential process.
Cursor Composer (Agent mode) is the multi-file, multi-step coding agent inside the Cursor IDE — it plans edits across files, runs shell commands, and iterates on tests without leaving the editor.
Deep research is a now-standard agent pattern — a lead agent plans a research question, dispatches parallel sub-agents to explore, synthesises findings, and cites sources.
Devin, from Cognition AI, was the first widely-publicised autonomous coding agent — a long-running agent that plans, edits, tests, and ships code with human review at gates.
A DevOps/SRE agent triages alerts, investigates incidents, proposes (or executes) fixes, and writes postmortems — augmenting on-call engineers with always-on log/metric correlation.
A finance analyst agent pulls data from ERP, data warehouses, and market sources, builds models, and drafts variance and scenario analyses — augmenting FP&A and investment teams.
A recruiting agent sources candidates, screens resumes, drafts outreach, schedules interviews, and summarizes feedback — managing the top of the hiring funnel with bias auditing built in.
A legal research agent searches case law, statutes, and firm documents, drafts memoranda with citations, and flags relevant precedents — augmenting associates on research-heavy workflows.
A marketing campaign agent plans campaigns, drafts creative across channels, segments audiences, launches in ad platforms, and reports on performance — closing the loop on optimization.
An enterprise SDR agent autonomously researches accounts, drafts personalized outreach, books meetings, and updates the CRM — replacing or augmenting the first-line sales-development role.
A Tier-1 support agent autonomously resolves the bulk of inbound customer issues — password resets, billing questions, order status, how-to queries — and cleanly escalates the rest to humans.
FIPA ACL is the late-1990s IEEE/FIPA standard agent communication language — the intellectual ancestor of modern A2A protocols, built on speech-act theory and KQML.
GAIA is a benchmark from Hugging Face and Meta that tests general AI assistants on real-world, multi-step questions requiring reasoning, tool use, and web browsing — designed to be easy for humans and hard for current agents.
Glean is a work-assistant platform that indexes a company's SaaS stack — Google Drive, Slack, Jira, Notion, Salesforce — and provides search and agents grounded in that internal knowledge.
Google's A2A is an open protocol for agent interoperability — how independently-built agents discover each other, describe their capabilities, and exchange task state.
GPQA is a 448-question expert-authored benchmark of graduate-level biology, chemistry, and physics problems used to measure whether agents can reason through genuinely hard, Google-proof scientific questions.
GPT Researcher is an open-source autonomous research agent that drafts a plan, issues web queries across many sources, deduplicates, and writes a cited research report — all without a human in the loop.
HaluEval is a large-scale benchmark of hallucination examples across QA, dialogue, and summarization used to measure how often LLM agents invent facts versus ground them in retrieved sources.
Handoff and delegation look similar but differ: in a handoff, control transfers to another agent; in delegation, the original agent waits for a result and keeps control.
The hierarchical pattern stacks orchestrator-worker vertically: a top-level planner delegates to mid-level coordinators, who in turn delegate to leaf worker agents — structured delegation for complex tasks.
IBM's ACP is an open protocol for agent-to-agent messaging, discovery, and orchestration — developed under the BeeAI project and designed for enterprise-grade multi-agent systems.
Jules is Google's asynchronous coding agent, built on Gemini — it clones your repo, plans changes, runs in a cloud VM, and opens a pull request with tests and diffs for review.
LangGraph's supervisor pattern uses a top-level supervisor agent that routes messages to specialised worker agents in a graph — the idiomatic LangGraph way to build multi-agent systems.
LaVague is an open-source web agent framework built around a Large Action Model — a model fine-tuned to translate natural-language web instructions into Selenium/Playwright actions.
LongBench evaluates agents on tasks that span many steps, long documents, and extended time horizons — where short-horizon benchmarks fail to capture the real difficulty of agent work.
Lovable is a chat-driven full-stack app builder that generates React + Tailwind frontends wired to Supabase backends — turning a natural-language brief into a working, deployable SaaS product.
MAgent and its successors benchmark multi-agent systems on cooperative and competitive tasks — negotiation, resource allocation, team coding — where the failure mode is coordination, not individual agent skill.
Manus is a general-purpose agent platform from Monica that gained attention in early 2025 for running long, autonomous browser + compute workflows on behalf of users.
mem0 is an open-source memory layer for AI agents that extracts, deduplicates, and retrieves user- and session-scoped facts across multi-turn conversations with a simple SDK.
MLE-Bench is OpenAI's benchmark of 75 real Kaggle competitions used to measure whether agents can perform end-to-end ML engineering: data exploration, feature engineering, model training, and submission.
In the debate pattern, two or more agents argue different positions on a problem before a judge agent adjudicates — a technique shown to improve reasoning accuracy on hard problems.
A working map of the 2026 agent-interoperability landscape: A2A, ANP (Agent Network Protocol), NLWeb, and how MCP fits as the tool-access layer underneath.
MultiOn is a consumer-facing web-action agent that turns natural-language goals into real browser actions — booking tables, filling forms, placing orders — across any public site.
NLWeb is Microsoft's open protocol for turning websites into agent-accessible endpoints by exposing schema.org-backed content as natural-language APIs queryable by any agent.
OpenAI's Agents SDK and the underlying Responses API form an emerging de-facto agents protocol — typed tool calls, handoffs, tracing, and guardrails with portable concepts across providers.
OpenAI's Evals framework and hosted Evals API let teams define graders, run LLM-as-judge and programmatic evaluations, and track agent quality across prompt, model, and tool changes.
OpenAI Swarm was an educational multi-agent framework focused on lightweight, stateless, peer-to-peer handoffs — the conceptual precursor to the production OpenAI Agents SDK.
The orchestrator-worker pattern assigns a lead agent to plan and route work, while specialised worker agents execute individual steps — the workhorse pattern for most production agent systems.
OSWorld is a scalable benchmark that evaluates multi-modal agents on real computer tasks across Ubuntu, Windows, and macOS environments — clicking, typing, and navigating GUIs like a human user.
Perplexity Deep Research is an autonomous multi-step research agent that browses the web for several minutes, synthesizes dozens of sources, and writes a cited long-form report for a single prompt.
Playwright is Microsoft's cross-browser automation library — Chromium, Firefox, WebKit — widely used as the deterministic foundation underneath AI-powered browser agents like Stagehand and browser-use.
Reflection is the pattern where an agent critiques its own output — or has a reviewer agent critique it — before finalising, catching errors that a single forward pass would emit.
Rod is a Go-native Chrome DevTools Protocol library that provides high-performance browser automation without Node dependencies — popular for Go agent backends driving browsers at scale.
SafeBench is a benchmark suite that stress-tests autonomous agents on harmful-instruction compliance, indirect prompt injection, unsafe tool use, and jailbreak robustness across standardized scenarios.
AI Scientist by Sakana AI is an end-to-end agent pipeline that proposes ML research ideas, writes experiment code, runs experiments, analyzes results, and drafts a LaTeX paper — the first demonstration of fully autonomous ML research.
Selenium is the veteran cross-browser automation framework — WebDriver-based, language-agnostic — still used by AI agents operating in enterprise or legacy environments where Playwright isn't an option.
Skyvern is an open-source RPA platform that uses LLMs and vision models to automate browser workflows — form fills, portal logins, document uploads — without writing brittle XPath selectors.
Stagehand is an open-source browser automation framework from Browserbase that combines deterministic Playwright code with AI-powered steps like act(), extract(), and observe() for resilient web agents.
STORM is Stanford's open-source research agent that simulates multi-perspective expert interviews to generate Wikipedia-quality long-form articles with citations.
The swarm pattern, popularised by OpenAI, models multi-agent systems as a flat set of peer agents that hand off to one another via tool calls — no top-down orchestrator required.
SWE-bench evaluates autonomous coding agents on real GitHub issues from popular Python projects — the agent must produce a patch that resolves the issue and passes the project's own tests.
tau-bench is Sierra's benchmark for conversational agents that must use tools to complete real customer-support tasks like airline rebooking and retail returns, scored on policy compliance and task completion.
v0 by Vercel is a generative UI agent specialized in React, Next.js, Tailwind, and shadcn/ui — turning natural-language prompts and screenshots into production-ready components and deployable apps.
WebArena is a reproducible, self-hosted benchmark from Carnegie Mellon featuring four fully-functional websites — e-commerce, forums, Gitea, content management — where agents must complete natural-language tasks end-to-end.
Writer is a full-stack generative AI platform for enterprises, combining its own Palmyra LLM family with an agent builder, knowledge graph, and strict brand-voice and compliance controls.
Zep is a memory platform for AI agents that combines a temporal knowledge graph (Graphiti) with vector search to give agents persistent, queryable memory with fact-level provenance.