Why Multi-Agent Coordination Fails — and What Actually Prevents It
A team ships a three-agent workflow: a researcher that gathers information, a planner that decomposes a task, and an executor that calls the tools. It passes every integration test. It ships to production. Two weeks in, the monthly bill comes back triple what the pilot suggested. Half the excess traces back to a single failure mode: the planner occasionally fans out to sub-tasks that re-invoke the researcher, which re-invokes the planner, and nothing in the system catches the loop until the budget trips.
Multi-agent workflows fail like this so routinely that an industry literature has formed around the pattern. The UC Berkeley MAST study — a NeurIPS 2025 Spotlight paper that annotated 1,642 traces across seven frameworks — reported failure rates between 41% and 86.7% depending on the system (see the paper's Section 4 for the per-framework breakdown). Our earlier post, Multi-Agent Systems Fail Up to 87% of the Time, walks through the cost model for each failure category. This post is about the architectural choices that actually prevent the failures in the first place.
The short version: most multi-agent failures aren't prompt problems, and the fixes teams reach for — better system prompts, stricter role descriptions, explicit handoff instructions — don't prevent the failures because the failures don't live at the prompt layer. They live in how the system models coordination and authority across agents.
Three failure categories, three prevention layers
MAST grouped 14 observed failure modes into three categories, with distinct distributions (roughly 44 / 32 / 24% — see the paper's Section 4 breakdown for the precise per-mode counts):
| Category | Approx. share | Examples | Prevention layer |
|---|---|---|---|
| System design | ~44% | Role overlap, missing escalation, scope ambiguity, unclear termination condition | Architecture (before the agents are written) |
| Inter-agent misalignment | ~32% | Handoff losing context, contradictory plans, infinite loops across agent boundaries | Protocol (coordination contract) |
| Task verification | ~24% | No final check on output, premature completion, verification bypassed under time pressure | Runtime (post-execution checks) |
The categories don't prevent each other. A perfect architecture with no protocol will still produce misalignment failures. A perfect protocol with no verification will still produce task-verification failures. Each layer needs its own pattern.
What the industry has largely been trying to do is patch all three at the prompt layer — better role instructions, "make sure you verify the output before returning," "don't loop." The MAST numbers are what happens when prompt-layer patches are asked to carry structural weight.
Layer 1: structural prevention (system design)
The failures in the 44% bucket — the ones MAST called "system design" — are the ones that happen because the architecture never enforced a boundary that should have existed. The three that show up most:
Role overlap. Two agents both believe they're responsible for the same decision. The planner thinks it's dispatching work; the executor thinks it's deciding what to do next. The result is duplicate work at best, contradictory actions at worst. Prompt-level fixes ("the planner plans; the executor executes") fail because neither agent is wrong in isolation — the system just gave them both authority over the same space.
Missing escalation. An agent gets stuck, and nobody knows. The agent retries the same failing call, or falls back silently to a cheaper model, or returns a best-effort answer the downstream agent treats as definitive. Prompt-level fixes ("if you're unsure, escalate") depend on the agent's own judgment about whether it's unsure, which is exactly the faculty that's already degrading.
Scope ambiguity. An agent has access to a broader set of tools or data than the task requires. A "customer support agent" can reach the refund API. A "code review agent" has write access to the repository. Prompt-level fixes ("only use these tools for these tasks") work until the agent is prompt-injected, or until the task crosses a boundary the instructions didn't anticipate.
The structural answer is to make the boundaries enforceable rather than advisory. In Cycles, this is the scope hierarchy: every reservation derives a scope path like tenant:acme/workspace:support/app:triage/workflow:refund/agent:executor/toolset:refunds. Each node in that path is an independent budget boundary. An agent at one node can't reach budget allocated to a sibling node, no matter what its prompt says. Role overlap becomes materially bounded in the budget model — two agents trying to act on the same refund both need reservations against the agent:executor scope, and the budget ledger caps how much the overlap can cost even if both proceed. What the budget ledger does not do is prove that only one agent ultimately commits the refund against the downstream system; that still needs a workflow-level idempotency key or application-owned lock. Budget atomicity bounds the authority; business-object atomicity is still the application's job. For the full pattern, see Agent Delegation Chains and Authority Attenuation.
The other structural pattern is authority attenuation: the child agent's budget is a strict subset of the parent's, at the moment of delegation. A parent with $100 and access to toolset:refund + toolset:email can hand its child a sub-budget of $10 with access to toolset:refund only. A prompt-injected child can't escalate out of its parent's attenuated authority because the attenuation is enforced in the budget ledger, not the prompt.
Layer 2: protocol prevention (inter-agent misalignment)
The failures in the ~32% bucket are the ones where two agents both did their jobs correctly but disagreed about what the job was. The 2025 SEMAP paper (arXiv:2510.12120) shows that structured protocols can dramatically reduce coordination failures in software-engineering multi-agent systems — specifically a 69.6% reduction on function-level development tasks through explicit contracts, lifecycle-guided execution, and verification gates at each handoff. The paper's evidence is domain-specific, but the pattern likely generalizes: the same structural moves have shown up under different names in distributed-systems work for decades.
The pattern is that handoffs between agents need a contract, not just a message. Three concrete moves:
Typed handoff payloads. An agent passing work to another should pass a structured object with explicit fields, not a free-text summary. A planner handing off to an executor passes {task_id, action, budget_limit, timeout, context_refs}, not "please execute the plan we just discussed." When the contract is structured, missing fields are a failure at the handoff rather than a silent degradation three agents later.
Verification gates. The handoff is a checkpoint: did the previous agent satisfy the pre-conditions the next agent needs? Did it stay within its authority? Did it produce an output whose shape matches the downstream contract? MAST's "premature completion" failure mode — the 23% category — is exactly what happens when verification is the agent's own responsibility. Structural verification, run by the orchestrator rather than the agent, is what breaks that loop.
Bounded coordination budgets. The more agents that touch a task, the higher the coordination overhead: context tokens pass through each agent, each agent adds its own tokens to the running trace, and the total grows super-linearly in the chain depth. A separate budget — capped per workflow — for "coordination tokens" puts an explicit ceiling on how deep a delegation chain can go before the system intervenes. DeepMind's "Towards a Science of Scaling Agent Systems" measured that unstructured multi-agent networks amplify errors up to 17.2× relative to a single-agent baseline, with diminishing returns beyond roughly four agents; the coordination-budget cap is how you operationalize that finding.
Protocols live above the runtime, in the orchestration layer. Cycles provides the scaffolding — each agent's reservation emits events, each commit ties back to a scope path, each handoff can be audited — but the contract itself is the orchestrator's responsibility. The runtime enforces that the contract was possible; the orchestrator enforces that it was met.
Layer 3: runtime prevention (verification + backstop)
The third category — the ~24% of failures that are task-verification issues — is the one where a correctly-architected, correctly-protocoled agent chain still produces a wrong answer, and no internal check catches it. This is the "200 OK" category: the system looks healthy at every layer but the output is wrong.
Prompt-level verification is brittle here for the same reason it's brittle in the first two layers: the verifier is another LLM call, subject to the same failure modes as the agent it's verifying. The Cloud Security Alliance's industry analysis "Fixing AI Agent Delegation for Secure Chains" documents exploit patterns ("Agent Session Smuggling," "Cross-Agent Privilege Escalation") where a compromised sub-agent silently rewrites the verification state — every agent in the chain looks fine in isolation. Industry write-ups like the CSA's aren't peer-reviewed research, but the documented exploit classes are real and worth designing against.
What works at this layer is external enforcement that doesn't trust the agents. Three backstops:
- Pre-execution budget enforcement. The reserve-commit pattern means every tool call, model invocation, or delegation request runs a pre-execution check against the budget ledger. A compromised agent can't just keep calling tools until it runs out of money — each call must reserve first, and reservations are denied once the budget is exhausted. See AI Agent Budget Control: Enforce Hard Spend Limits for the mechanics.
- Action authority via risk points. Not all actions are equal — a read is cheap, a refund is expensive, a
rm -rfis terminal. Cycles tracks a separateRISK_POINTSbudget per toolset, so an agent can run out of "dangerous action" budget before it runs out of "LLM call" budget. This is what prevents a compromised agent from using remaining model budget to rack up catastrophic side effects. - Graceful degradation paths. When a reservation is denied, the agent has a defined fallback: downgrade model, narrow capability, queue for human review, or stop. When Budget Runs Out: Graceful Degradation Patterns covers the decision matrix.
The runtime layer is deliberately not the first layer of defense — architecture and protocol should catch most failures before they hit runtime. But it's the only layer that still works when the first two have been bypassed, which is why it's the one worth building hard enforcement against.
Where current frameworks stand
Every major multi-agent framework ships some subset of these prevention layers. None ships all three by default, and the gaps are worth understanding:
| Framework | Built-in budget attenuation | Built-in handoff contracts | Built-in pre-execution enforcement | Built-in coordination-budget cap |
|---|---|---|---|---|
| CrewAI | Per-agent roles; no built-in hierarchical budget attenuation | Role-based task assignment (loose) | App-defined | App-defined |
| LangGraph | Graph-based typed state; no budget hierarchy | Typed node inputs/outputs (partial) | App-defined | App-defined |
| AutoGen | Conversation model; budget pool is shared | Conversation turns (unstructured) | App-defined | App-defined |
| Semantic Kernel | Per-agent tool scoping (loose) | App-defined | App-defined | App-defined |
| MS Agent Framework | Checkpointing; scope supported, not enforced | Five built-in orchestration patterns | App-defined | App-defined |
All five do the orchestration well. What none of them ship natively is pre-execution budget enforcement at the scope boundary — it's solvable at the application layer in each of them, but nothing stops a prompt-injected sub-agent from draining the parent's budget before the orchestrator notices unless that enforcement has been explicitly wired in. This isn't a criticism of the frameworks — orchestration is a genuinely hard problem and they're solving it well — but it's why a coordination-failure incident at 2 AM usually bottoms out in the budget layer, not the orchestration layer.
The pattern teams converge on in practice is framework plus a runtime authority layer: use CrewAI or LangGraph for the orchestration, delegate the budget boundary enforcement to a dedicated layer like Cycles. See Multi-Agent Budget Control: CrewAI, AutoGen, OpenAI Agents SDK for the integration patterns.
A running example
Put the three layers together with a concrete workflow. A customer-support tenant runs a three-agent pipeline: a triage agent, a research agent, and an action agent that issues refunds.
Architecture (layer 1):
- Scope path:
tenant:acme/workspace:support/app:triage/workflow:refund/agent:{triage,research,action}/toolset:{refunds,kb,email} - Each agent has its own scope node; budgets attenuate downward.
- The
actionagent is the only one with access totoolset:refunds;researchonly seestoolset:kb;triageonly sees model-call budget.
Protocol (layer 2):
- The triage agent passes a typed
{customer_id, issue_category, severity}handoff to research. - Research returns
{findings[], confidence_score}; if confidence is below threshold, the orchestrator routes to a human rather than the action agent. - The action agent receives
{refund_amount, approved_by, customer_id}and its own reservation is pre-capped atrefund_amount.
Runtime (layer 3):
- Every reservation checks scope budget before executing.
- The
toolset:refundsbudget has aRISK_POINTScap independent of model budget; the action agent runs out of refund authority before it runs out of LLM budget if something goes wrong. - A denial triggers graceful degradation: escalate to human, don't retry in a loop.
None of these layers is novel individually. The value is applying all three simultaneously. The MAST distribution implies that omitting any one layer leaves at least one major failure class uncaught.
The takeaway
Multi-agent coordination fails for structural reasons more than prompt reasons, and the failure modes cluster predictably: bad architecture causes role overlap, weak protocols cause misalignment, trusting the agents as their own verifiers causes the "200 OK" category. Prevention isn't a single pattern — it's three layers, each addressing a distinct category. Architecture before the agents are written. Protocols at every handoff. Runtime authority as the backstop that doesn't trust either.
The frameworks do good work at one or two of these layers. Holding the third — the runtime enforcement — is what separates a multi-agent workflow that behaves like a rate limiter from one that behaves like an attack surface.
Related reading
- Multi-Agent Systems Fail Up to 87% of the Time — the MAST taxonomy with per-category cost models
- Agent Delegation Chains Need Authority Attenuation, Not Trust Propagation — the scope-hierarchy and attenuation patterns in depth
- Multi-Agent Budget Control: CrewAI, AutoGen, OpenAI Agents SDK — integrating runtime authority with specific frameworks
- AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response — the verification-layer failures
- AI Agent Budget Control: Enforce Hard Spend Limits — the reserve-commit mechanics that enforce scope boundaries
- When Budget Runs Out: Graceful Degradation Patterns — fallback patterns when a reservation is denied
- UC Berkeley MAST study — 1,642-trace annotated multi-agent failure taxonomy
- SEMAP: structured protocols for multi-agent coordination — 69.6% failure reduction via explicit contracts
- Cloud Security Alliance: Fixing AI Agent Delegation for Secure Chains — industry analysis with documented delegation-chain exploit patterns
- DeepMind: Towards a Science of Scaling Agent Systems — error amplification across unstructured multi-agent networks