Retry Storms & Idempotency in Agent Budget Systems
An AI agent calls an LLM. The call times out after 30 seconds. The framework retries. The retry also times out. The SDK inside the framework retries too. The orchestrator retries the whole step. Three layers of retry logic compound into 27 LLM calls where you expected 1.
This is a retry storm. It's not unique to AI agents — distributed systems have fought this for decades. But in agent systems, retries amplify three things at once:
- Spend. Every retry that makes it to the provider is a billable token chain. Retries that don't get deduplicated at the budget layer count against your budget twice.
- Risk (side effects). An agent tool call that retries can send the same email twice, write the same record twice, fire the same deploy twice. The retry succeeded — and now there are two of whatever it did.
- Blast radius. One user request can fan out into dozens of retry-amplified tool calls, each with its own duplicate-side-effect potential.
Cost is the visible failure. Risk is the one that hurts more. A retried LLM call costs $0.04 extra. A retried send_email tool call lands in a customer's inbox twice. A retried database mutation runs the update twice. A retried payment processor call charges the customer twice.
This post is about the primitive that prevents retry storms from becoming either kind of incident: idempotency keys.
Why Retries Amplify in Agent Systems
Retries amplify geometrically whenever retry loops are nested. Google's SRE Book warns about this explicitly in its cascading-failures chapter, and AWS uses a concrete illustration: a chain of five layers each retrying three times can drive 243x load on the bottom service. Google caps per-client retry ratios at 10% precisely because amplification beyond that cascades into outages.
Agent systems have more layers than typical request paths:
- SDK retries (OpenAI, Anthropic client libraries) — transparent, 1 retry counted as 1 call by the provider
- Tool-level retries (custom retry wrappers around individual tool calls)
- Step-level retries (LangGraph/CrewAI node retries)
- Orchestration retries (durable execution engines replaying whole runs)
- User-facing retries (users clicking "regenerate" after slow responses)
Stack these: 3 SDK retries × 3 tool retries × 3 step retries = 27 calls per user action. A tool call priced at $0.15 becomes $4.05 before anyone notices. Multiply across 100 concurrent users and you've burned $400 on what should have been $15 worth of work.
The amplification isn't theoretical. The custom rate limiter post-mortem from scalerX describes the TOCTOU race that made concurrent agents overspend by 10-30% of their caps. The competitive landscape post covers why proxy rate limits can't protect against this.
The Two Failures Retries Cause
Retries create two distinct failure modes:
Duplicate intent. The agent retries the same logical action because it couldn't confirm the first attempt succeeded. If the budget system counts this as two separate requests, you've double-charged a single operation. If the tool side doesn't dedupe, you've fired the same action twice — two emails, two writes, two payments.
Duplicate completion. The original request actually succeeded, but the response got lost (network partition, timeout, crashed client). The retry lands, the server processes it, both paths charge the budget and both paths fire the side effect. Same action, two debits, two side effects.
The cost consequence is obvious — budget depletes faster than work warranted. The risk consequence is less obvious and often worse. A duplicate payment processor call isn't just a budget issue; it's a duplicate charge against a customer. A duplicate database mutation isn't just a wasted API call; it's two UPDATE statements running on the same row. A duplicate send_email tool call lands twice in a user's inbox and erodes trust.
Both failure modes require the same solution: a unique key that identifies the logical operation, so every system in the chain — budget layer, tool layer, downstream services — can recognize duplicates and return the original result instead of executing again.
That key is the idempotency key.
Idempotency Keys as Budget and Action Primitives
Payment processors solved this decades ago. Stripe's idempotency documentation is the canonical reference:
"We suggest using V4 UUIDs, or another random string with enough entropy to avoid collisions. Idempotency keys are up to 255 characters long."
"The idempotency layer compares incoming parameters to those of the original request and errors if they're not the same to prevent accidental misuse."
Stripe retains keys for 24 hours. Shopify does the same. The pattern is industry standard: generate a unique key per logical operation, send it with every retry of that operation, and the server deduplicates against the key.
Cycles applies the same pattern to budget reservations. Every write operation (reserve, commit, release, extend) requires an idempotency_key field. Retry a request with the same key, you get the original response — no second reservation, no duplicate commit, no double-charged budget.
The key insight: idempotency isn't a transport-layer optimization. It's the foundational primitive that makes both budget accounting and action execution safe under failure. Without it:
- Concurrent agents racing for budget can both claim the same capacity
- Retries after timeouts double-count spend and double-execute side effects
- Crashed clients leave orphaned reservations that get duplicated on restart
- A retried email tool call sends the message twice
- A retried database mutation runs the update twice
- Recovery from network partitions becomes guesswork
With it, the budget layer has one source of truth per logical operation. If the same key is propagated and honored end-to-end, the tool layer and downstream services can also deduplicate retries instead of executing twice. The agent can retry as aggressively as it needs to — and every layer that participates stays consistent.
Idempotency Key Structure Patterns
A random UUID per request works, but it loses information. Structured idempotency keys let you encode operational signal directly into the key:
| Pattern | Example | Enables |
|---|---|---|
| Random UUID | 550e8400-e29b-41d4-a716-446655440000 | Simplicity, guaranteed uniqueness |
| Run-scoped | run-abc123-step-5 | Tie retries to a run for observability |
| Attempt-numbered | run-abc123-step-5-attempt-3 | Detect retry loops from key patterns |
| Content-hashed | sha256(run_id + tool_name + args)-attempt-1 | Catch identical tool calls across runs |
The attempt-numbered pattern has a specific operational value: querying for keys with high attempt numbers tells you which workflows are retry-storming. If your logs show attempt-8 or attempt-12 suffixes, you have a loop — independent of whatever your retry framework reports.
Design tip: Generate the idempotency key before making the reserve call, and persist it alongside your work unit. If the reserve call fails or times out, you still have the key and can retry the reserve with it. Without persistence, you can't safely retry — you don't know whether the original reserve landed.
TTL Coordination: Where Idempotency Breaks
The subtle failure mode is TTL misalignment.
Three timers interact (coordinate these for your deployment):
- Reservation TTL — how long Cycles holds a reservation before auto-expiring
- Idempotency key retention window — how long your server retains the key for replay (Stripe and Shopify use 24 hours as an industry-standard reference)
- Client retry window — how long your code will keep retrying
If these don't align, you get stale retries. Example: reservation TTL is 5 minutes, idempotency key retention is 24 hours, client retries for 30 minutes. At minute 10, the reservation has expired. The client retries with the same idempotency key. The server returns the original response, which references a reservation that no longer exists. Your client thinks it has a valid reservation; the budget system already released it.
The defensive pattern:
client_retry_window ≤ reservation_ttl < idempotency_retentionYour client should stop retrying before the reservation could expire. The reservation should outlive the retry window so retries still find it live. The idempotency key retention should exceed both, so even late recovery queries can find the original operation.
When these windows fall out of alignment, you get the worst of both worlds: duplicate work and lost reservations.
Detection: Spotting Retry Storms in Production
Idempotency keys aren't just a dedup mechanism — they're a monitoring signal. The key itself tells you what's happening:
High attempt counts. If your key format encodes attempt number, alert when attempt counts exceed expected retry depth. run-*-attempt-10 appearing in logs means something is wrong.
Key reuse rate. Track the ratio of unique keys to total reservation requests. A healthy system sees 1:1 — every reservation has a unique key. A ratio climbing toward 2:1 or 3:1 indicates clients retrying rapidly. Climbing beyond 5:1 means a loop.
Duplicate match rate on reserves. Track duplicate-match rate on reserves in your implementation. A spike means retries are being deduplicated at the budget layer, which is often the clearest signal that a retry storm is underway.
Time between retries. If a key appears with retries spaced at exponential backoff intervals (1s, 2s, 4s, 8s), that's healthy retry behavior. If it appears hammering at sub-second intervals, that's a tight loop with no backoff.
These signals catch retry storms from the budget-layer perspective, independent of what your retry frameworks report.
What Cycles Does Automatically
The reserve-commit lifecycle is designed to make idempotency enforcement cheap:
- Unique reservation IDs. Each successful reserve returns a
reservation_id. Commits and releases reference this ID — retries with the same idempotency key return the original reservation_id, so your client knows it's looking at the same reservation. - Atomic operations. The Lua-scripted reserve-commit check-and-decrement is atomic, preventing the TOCTOU race where two concurrent retries both see "budget has $5 remaining" and both proceed.
- Idempotency-key lookup.
GET /v1/reservations?idempotency_key=...lets crashed clients recover by looking up the original reservation ID, so they can commit or release without creating a new reservation. - Mismatched parameter detection. If a retry arrives with the same idempotency key but different parameters (different scope, different amount), Cycles returns an
IDEMPOTENCY_MISMATCHerror rather than silently using the original.
None of this eliminates retry storms — client code still has to retry responsibly. But it ensures retries that do happen don't compound into duplicate budget charges.
Responsible Retry Defaults: Backoff, Jitter, Circuit Breakers
Idempotency keys prevent double-counting. They don't prevent the storm from happening. For that, your client code needs responsible retry defaults:
Exponential backoff with jitter. AWS has documented this extensively — plain exponential backoff synchronizes retries across clients, causing thundering-herd effects. Jitter spreads retries across time.
Bounded retry counts. Hard cap retries per layer. Google SRE guidance caps at 3 attempts per layer with a 10% retry budget. If retries aren't succeeding at that level, the system is too degraded for retries to help.
Circuit breakers. After N consecutive failures for a given operation class, open the circuit and stop retrying for a cooldown period. Prevents sustained retry storms.
Retry-After respect. When a service returns Retry-After, honor it. Ignoring rate limit guidance accelerates the storm.
These aren't Cycles-specific — they're universal distributed systems hygiene. But combined with idempotency keys at the budget layer, they make retry storms survivable rather than catastrophic.
The Take
Enforcement without idempotency is hopeful accounting. Agent side effects without idempotency are unbounded blast radius. Retries and concurrency are real failure modes that hit every production agent system — timeouts happen, networks partition, clients crash, frameworks double-retry. The question isn't whether your agents will retry; it's whether every layer handles it correctly when they do.
Idempotency keys are the primitive that makes the answer "yes." They tie logical operations to unique keys so the budget layer charges once, the tool layer fires once, and downstream services see one canonical request — regardless of how many retries hit the wire. Combined with atomic reserve-commit and responsible client retry defaults, they turn retry storms from dual-impact incidents (budget and blast radius) into operational noise.
The rule: every write that affects budget or fires a side effect needs an idempotency key. No exceptions.