The 429 Classification Gap: Why Your Retry Logic Is Lying to You
The problem is not 429
A 429 is not a single condition.
It is a compressed signal carrying different underlying realities:
- short-term rate pressure
- concurrency saturation
- quota-window exhaustion
But most systems flatten all of them into one decision:
if status == 429:
retry()That is where the run stops understanding what state it is actually in.
Three different realities hiding behind "429"
In real agent stacks, 429s usually land in three buckets.
1. WAIT — transient pressure
Retry-After: 5–30s- the system is temporarily saturated
- waiting is likely to work
- retrying later is reasonable
Correct behavior:
delay → retry
2. CAP — concurrency overshoot
no Retry-After, or repeated short 429s under load- too many parallel requests are already in flight
- the client is amplifying its own pressure
- immediate retries make the condition worse
Correct behavior:
reduce concurrency or reschedule → then retry
3. STOP — quota or long-window exhaustion
Retry-After: minutes to hours
or an explicit quota-window message- this will not resolve inside the current execution window
- retrying is not recovery
- retrying only burns time, budget, and throughput
Correct behavior:
do not retry → surface immediately or route elsewhere
What actually happens today
Instead of:
429 → WAIT / CAP / STOPmost stacks do this:
429 → retryAnd once that happens, everything downstream inherits the wrong decision.
The failure cascade
This is not theoretical. The retry storm incident pattern — and the $1,800 CRM outage retry storm — are exactly this failure mode.
What this looks like in a real run
attempt 1 → 429 (retry-after: 10800)
attempt 2 → 429
attempt 3 → 429
...
attempt 97 → 429
attempt 98 → 429
attempt 99 → 429Once STOP is misclassified as retryable, the rest is predictable:
- retries consume remaining budget
- fallback may never trigger
- concurrency grows under pressure
- sessions look active but do not progress
This shows up as:
- retry loops
- cost spikes
- stuck agent runs
- long-tail latency explosions
Not because the system is missing retries.
Because it is retrying when it should have changed state.
Why enforcement cannot fix this by itself
Cycles is an enforcement layer.
It is built around a clean reserve/commit/release execution model:
- reserve
- commit
- release
That model works.
But it assumes the runtime already knows whether the next action should proceed, wait, or stop.
If an upstream layer sees a quota-window 429 and decides retry(), Cycles cannot repair that classification mistake. It can govern the next attempt correctly. It can reserve correctly. It can commit or release correctly.
But it is still governing the wrong next move.
That is the boundary.
Cycles is not a 429 classifier.
It is budget authority once the runtime has already decided what to do next.
The bug is not in enforcement.
The bug is before enforcement ever runs.
The missing primitive
What is missing is a classification step before retry, fallback, scheduling, or enforcement:
status + headers + provider context
↓
WAIT / CAP / STOP
↓
retry / scheduling / fallback
↓
enforcementThat is the right composition.
Mapped into a Cycles-style model:
- WAIT → hold or extend the in-flight reservation when appropriate
- CAP → reduce concurrency or reschedule before creating the next reservation
- STOP → release the reservation and fail, surface, or reroute
That is a clean division of responsibility.
Real example
A quota-window 429:
"You have exceeded your 5-hour usage quota"
Retry-After: 10800Treat that as transient and the run starts lying to itself:
- the same task gets attempted again
- fallback may never activate
- the session appears active
- no meaningful forward progress is happening
Correct behavior is simple:
- classify it as STOP
- release any live reservation tied to that path
- route elsewhere or fail clearly
The difference is one decision.
The effect is the entire run.
The deeper issue
This is not just about retries.
It is about semantic collapse.
Different failure modes get:
- flattened
- misclassified
- propagated as the same signal
Once that happens, every downstream layer starts doing the wrong thing in its own way:
- retries are wrong
- scheduling is wrong
- fallback is wrong
- enforcement may be locally correct, but it is acting on the wrong next action
- reporting is wrong because activity gets mistaken for progress
You do not just get bad retries.
You get systems that lose track of reality.
Why this keeps showing up
Because responsibility is fragmented:
- SDKs handle retries
- frameworks wrap providers
- orchestration layers schedule tasks
- enforcement layers govern budget and execution
No layer clearly owns classification.
So every layer assumes somebody else already normalized the signal.
Usually, nobody did.
What needs to happen
We do not need more retry loops.
We need a shared primitive:
Classify failure before acting on it
For 429s, that means:
- WAIT → delay
- CAP → reduce pressure
- STOP → terminate, surface, or reroute
Everything downstream gets better once that decision is honest:
- retry logic stops pretending all pressure is transient
- schedulers stop amplifying the problem
- fallback gets a chance to work
- budget authority governs the correct execution path instead of the wrong one
Where this fits
This should not live inside enforcement.
That is too late.
It also should not be buried differently inside every framework wrapper, because then the semantics fragment again.
It should exist as a small upstream layer that reads the real response:
- status
- headers
- provider error text
- execution context
and returns a normalized decision that the rest of the stack can trust.
Closing
Most systems don't fail because they lack retries.
They fail because they act on the wrong decision.
429 is not a retry signal.
It is a classification problem.
Until that is resolved, every layer above it is operating on a lie.
Further reading
- Retry Storms and Idempotency Failures — the dedicated incident pattern this post describes
- 5 AI Agent Failures Budget Controls Would Prevent — includes a $1,800 retry storm incident from a CRM outage
- The State of AI Agent Incidents — Category A4 catalogues retry-storm cost multiplication
- How Reserve-Commit Works in Cycles — the enforcement model referenced throughout this post
- Degradation Paths in Cycles — deny, downgrade, disable, or defer strategies for the STOP case
- What Is Runtime Authority for AI Agents? — the foundational concept this post extends
- Brenton Williams' retry and reliability repo: github.com/SirBrenton/pitstop-truth
Brenton Williams writes about retry semantics, agent reliability, and what happens when distributed systems lie to themselves. Find more of his work at github.com/SirBrenton.