Why Your AI Agent Tests Pass but Production Keeps Failing
Your agent aces every eval. The LLM-as-judge gives it 94%. CI is green. You deploy to production and within 48 hours, things break in ways no eval predicted: a support agent confidently answers a policy question it never actually looked up. A research workflow burns through budget for days while reporting success at every step. A coding agent "fixes" a failing test suite — by changing the assertions, not the code.
These aren't edge cases. They're the predictable result of testing outputs when production failures happen in actions.
The 2026 LangChain State of AI Agents report surveyed 1,300+ professionals and found a striking contradiction: only 52% of organizations run offline evaluations — and just 37% run online evals against live traffic. Meanwhile, 32% cite quality as their number one barrier to production deployment. Not cost. Not latency. Quality.
Teams are converging on a testing strategy — evals, benchmarks, LLM-as-judge — that validates what an agent says but not what it does. The result is a widening gap between test results and production outcomes, with no signal in between to explain why.
The Eval Trap: Testing the Wrong Thing
Traditional software testing works because of a simple contract: same input, same output. AI agents break that contract at every step. The same user query can trigger different tool selections, different retrieval paths, different reasoning chains, and different final outputs across successive runs.
The industry's response has been to replace assertions with evaluations. Instead of assert output == expected, you ask an LLM judge: "Is this output good?" This is an increasingly common pattern — and it has a fundamental flaw.
Evals test outputs. Production failures happen in actions.
An eval that scores a customer support response as "helpful and accurate" has no way to know the agent never actually queried the CRM — it fabricated the answer from the customer's name alone. An eval that checks a code fix for correctness has no way to know the agent modified the tests instead of fixing the code. An eval that validates a research summary has no way to know the agent looped 3,000 times to produce it.
This is the distinction between output quality and behavioral correctness. Evals measure the first. Production requires the second. For a deep look at the failure modes themselves, see AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response. This post focuses on why your testing strategy doesn't catch them — and what architectural pattern does.
Three Blind Spots in Agent Testing
Blind Spot 1: No checkpoint between reasoning and action
Most agent architectures flow like this:
User query → Agent reasons → Agent acts → Output returnedThere's no mandatory gate between "the agent decided to call Tool X" and "Tool X was called." If the reasoning is wrong, the action proceeds. If the action is skipped entirely and the agent fabricates the result, nothing notices.
Evals only see the final output. They cannot inspect whether intermediate steps actually happened, whether the right tools were called, or whether the agent's execution path matched its stated reasoning.
Blind Spot 2: Evals don't run at production scale
Running evals is expensive. One data team reported that their evaluation costs reached 10x the cost of running the agent itself. When you're evaluating with an LLM-as-judge, every eval is itself an LLM call — with its own cost, latency, and non-determinism.
The math doesn't work at scale. The following estimates are illustrative, based on typical per-call LLM pricing:
| Daily requests | Agent cost (est.) | Eval cost at 3x-10x (est.) | Feasibility |
|---|---|---|---|
| 1,000 | ~$150 | ~$450-$1,500 | Manageable |
| 50,000 | ~$7,500 | ~$22,500-$75,000 | Painful |
| 500,000 | ~$75,000 | ~$225,000-$750,000 | Unsustainable |
So teams sample. They run evals on sampled subsets of production traffic. The rest runs without semantic review — observed by dashboards, but never evaluated for correctness. As Fortune reported: "AI agents are getting more capable, but reliability is lagging." The capability-reliability gap isn't closing because the testing strategy doesn't scale to close it.
Blind Spot 3: Non-determinism destroys regression testing
In traditional software, when you fix a bug, you add a regression test. The test deterministically verifies the bug stays fixed. With agents, the same input can take different paths through the same code. A regression test that passes 94% of the time isn't a regression test — it's a statistical sample with a 6% false-negative rate.
The industry is scrambling to solve this. Docker launched Cagent in January 2026, using record-and-replay to make agent tests deterministic. TestMu AI launched an agent-to-agent testing platform in March 2026 with adversarial evaluators. Anthropic published guidance emphasizing that "teams without evals get bogged down in reactive loops — fixing one failure, creating another."
But every solution adds complexity, cost, and another non-deterministic layer to test. The eval-for-the-eval problem is real: one team reported having to test their tests — running each eval multiple times and discarding results when the delta was too large.
The Gap Evals Can't Close
Multi-step agent workflows compound errors exponentially — a 95% per-step success rate yields just 60% end-to-end success over 10 steps. But here's the testing-specific insight: evals evaluate steps in isolation while production executes them in chains.
A GitHub Blog analysis confirmed that multi-agent workflows fail primarily due to coordination breakdowns, not individual agent incompetence. And the CRMArena benchmark — which evaluates LLM agents on realistic CRM tasks — found that state-of-the-art agents succeed in fewer than 40% of tasks with ReAct and fewer than 55% even with function-calling. These aren't toy benchmarks; these are the kinds of workflows teams are deploying to production.
Your eval suite tests step 1 in isolation and it passes. Tests step 2 in isolation and it passes. Tests step 3 in isolation and it passes. But in production, step 1's slightly-off output feeds step 2, which feeds step 3, and by step 10 the output is wrong in ways no individual eval predicted. Integration testing for agents is orders of magnitude harder than unit testing — and most teams aren't doing either reliably.
Enforcement: The Deterministic Layer for Non-Deterministic Systems
The fix isn't better evals. It's a different architectural layer — one that is deterministic by design and can run on every request rather than a sampled subset.
Instead of evaluating outputs after the fact, you enforce constraints before every action. Every tool call, every LLM invocation, every side effect passes through a checkpoint that verifies the action is authorized, within budget, and structurally valid before it executes.
This is runtime authority. Here's why it addresses problems that evals can't:
Note: Cost figures below are illustrative estimates based on typical LLM pricing and policy-lookup overhead. Actual costs vary by provider, model, and workload.
| Evaluation (evals) | Enforcement (runtime authority) | |
|---|---|---|
| When | After execution, on sampled traffic | Before every action, on all traffic |
| What it checks | Output quality (semantic) | Behavioral correctness (structural) |
| Cost per check | ~$0.01-$0.10 (est. LLM-as-judge call) | Fraction of a cent (est. policy lookup) |
| Deterministic | No (LLM judge varies run-to-run) | Yes (policy rules produce same result) |
| Catches loops | No (agent never reaches the eval) | Yes (budget cap stops iteration N+1) |
| Catches fabrication | Sometimes (if the judge notices) | Flags anomalies (near-zero cost on a step that should have external API cost is a strong signal, though not proof — caching or free-tier tools can also produce low-cost commits) |
| Catches scope violations | No (eval sees output, not action) | Yes (unauthorized action kind blocked at reserve) |
Enforcement doesn't replace evals. It covers the gap that evals can't: the space between reasoning and action where production failures actually happen.
How Reserve-Commit Turns Agent Runs into Testable Sequences
The reserve-commit lifecycle creates something no eval framework provides: a deterministic, inspectable record of every action an agent attempted and what actually happened.
1. Reserve → Agent requests permission + budget before acting
2. Execute → Agent performs the action
3. Commit → Agent reports actual outcome; unused budget releasedEach cycle produces structured data: what was requested, what was allowed, what was executed, and what it cost. This creates three capabilities that agent testing currently lacks:
1. Structural anomaly detection without LLM judges
Every reserve-commit pair produces a cost and latency signature. Deviations from expected patterns flag problems automatically — no expensive LLM judge required:
// Expected: research agent calls 3 APIs, ~$0.45 total
// Actual: 3 reserves, 3 commits, costs match estimates
{ "run_summary": { "steps": 3, "total_cost": "$0.47", "status": "normal" } }
// Anomaly: agent reserved for API call but committed near-zero cost
// → strong signal the agent may have fabricated the result
// (could also indicate caching or a free-tier tool — worth investigating)
{ "step_3": { "reserved": "$0.15", "committed": "$0.001", "latency_ms": 3 } }
// Anomaly: 200+ reserves against same scope in one run
// → loop detected, budget cap stopped it at iteration 200
{ "run_summary": { "steps": 200, "status": "budget_exhausted" } }These signals are deterministic, generated automatically, and catch structural failures that semantic evals miss entirely. For the full taxonomy of cost-anomaly signals, see the cost-as-reliability-signal pattern.
2. Scope-based behavioral boundaries
Where evals ask "was the output good?", enforcement asks "was the agent allowed to do this?" — a fundamentally different and complementary question.
A coding agent authorized to read test files and write source files is blocked at the reservation step if it tries to modify tests. A support agent scoped to read customer records is blocked before it can issue a refund above its authority tier. A research agent limited to 10 API calls per run is stopped at call 11.
These aren't heuristics. They're deterministic rules evaluated before execution. The agent's non-deterministic reasoning produces a deterministic allow/deny decision at the enforcement layer. For concrete action authority patterns, see AI Agent Action Control: Hard Limits on Side Effects.
3. Integration testing via budget topology
The hardest testing problem — verifying that multi-step workflows produce correct end-to-end results — gets a structural proxy through hierarchical scope budgets.
When a workflow scope allocates budget across child agent scopes, the spending pattern can surface integration issues:
- Research agent consumes 15 reserve-commit cycles, analysis agent consumes 9 → possible handoff loss or workflow mismatch worth investigating (6 expected items unaccounted for)
- Three parallel agents each reserve successfully but total exceeds parent scope → atomic reservation prevents concurrent overspend
- Fan-out workflow spawns 50 sub-agents instead of expected 5 → parent scope budget caps total exposure regardless of spawn count
None of these require semantic evaluation. The structural economics of the run surface problems that output-focused testing misses.
Where This Fits in Your Testing Stack
Enforcement isn't a replacement for evals. It's the layer that covers the traffic your evals can't afford to reach. For the full comparison of how enforcement, guardrails, and observability compose, see Runtime Authority vs. Guardrails vs. Observability.
The practical stack:
- Enforcement catches structural failures on all traffic: loops, scope violations, budget overruns, anomalous cost patterns. Cost per check is estimated at a fraction of a cent.
- Evals catch semantic failures on sampled traffic: wrong answers, poor tone, factual errors. Cost: high but necessary for quality signal.
- Observability provides forensic data for debugging. Cost: moderate. Essential for tuning both other layers.
Most teams have observability. About half have evals. Far fewer teams have enforcement than either observability or evals. That's why tests pass and production fails.
What To Do This Week
Start with shadow mode — Run enforcement alongside your existing agents without blocking anything. Collect per-step cost data. Shadow mode often surfaces structural failures quickly that evals never caught — and it covers all traffic at a fraction of eval cost.
Set per-run budgets on your riskiest workflow — Pick the agent where a failure has real consequences. Add a budget ceiling. Review which runs would have been blocked. If the answer is "the ones that looped" — you've found the gap.
Add action scoping to one sensitive operation — Any agent that writes data, sends communications, or modifies infrastructure should require explicit action authority. The enforcement gates the action; the eval validates the output. Both layers, working together.
Run the 60-second demo — See enforcement stop a runaway agent in real time. Then compare that to your eval pipeline catching the same failure — hours later, on a sampled subset, if at all.
Further Reading
- AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response — The failure modes themselves, with cost-anomaly detection patterns
- Runtime Authority vs. Guardrails vs. Observability — How enforcement, guardrails, and observability compose
- What Is Runtime Authority for AI Agents? — The foundational concept behind pre-execution enforcement
- 5 AI Agent Failures Budget Controls Would Prevent — Concrete incident scenarios with cost math
- AI Agent Action Control: Hard Limits on Side Effects — Scope-based behavioral boundaries in practice