We Built a Custom Agent Rate Limiter. Here's Why We Stopped.
At scalerX, we built a custom rate limiter to track spend across LLMs, image generation, video generation, and a growing list of paid third-party APIs — stock charts, market data, web search. We built three versions of it over several months. Each version fixed the problem the previous one had. Each version revealed a new problem we hadn't anticipated.
By the time we were planning v4, the realization was hard to avoid: we weren't building a rate limiter anymore. We were building a runtime authority platform, badly.
This is a post-mortem of why that happened, and why I think most teams building custom multi-provider rate limiters are on the same trajectory.
Why Build It Ourselves
The reasoning at the time was unremarkable:
- Our agents called many paid providers — OpenAI, Stable Diffusion, Kling, Google, stock data APIs, web search. No off-the-shelf proxy covered all of them.
- Existing LLM proxies (LiteLLM, Helicone) solved the LLM layer, not the tool-call layer. They wouldn't see the $2 stock chart API call or the $0.30 web search.
- We needed per-user caps — a single customer's runaway agent shouldn't consume another customer's budget.
- "It's just Redis and some counters. How hard could it be?"
We started with a straightforward design, hit a wall, rebuilt, hit another wall, rebuilt again. Here's the story of each wall.
v1: The Naive Counter — TOCTOU Race Conditions
Architecture: Redis hash per user, incremented on each API call. Check before the call, increment after.
# simplified
def check_budget(user_id, cost):
current = redis.get(f"spend:{user_id}")
if current + cost > user_budget:
raise BudgetExceeded()
return True
def record_spend(user_id, cost):
redis.incrby(f"spend:{user_id}", cost)This worked in development. It worked in the first few weeks of production. Then we started seeing users exceed their monthly caps by 10-30%.
The wall: time-of-check-to-time-of-use (TOCTOU) race conditions.
When a user had 10 concurrent agents running, all 10 could read "budget has $5 remaining", all 10 could decide to proceed, and all 10 could execute — spending $50 against a $5 budget. The check and the increment were two separate Redis operations, and the check was non-binding.
We weren't the first team to hit this. Figma publicly documented the same failure: "In a distributed environment, the 'read-and-then-write' behavior creates a race condition, which means the rate limiter can at times be too lenient. If only a single token remains and two servers' Redis operations interleave, both requests would be let through." Redis's own documentation warns about this pattern and recommends Lua scripts or MULTI/EXEC for atomicity.
We fixed it the way everyone eventually does: Lua script for atomic check-and-decrement.
v2: Atomic Check with Lua — Multi-Provider Coordination Breaks
Architecture: v1 plus a Lua script that reads the current balance, checks against the cap, and decrements in a single atomic operation. GitHub's engineering team describes the same progression: they moved their storage logic into Lua "to guarantee atomicity of operations."
v2 fixed the race condition. Concurrent agents could no longer double-spend. We thought we'd solved it.
Then the second wall hit.
The wall: multi-provider coordination.
A single user request often fanned out across multiple providers:
- GPT-4o to generate the plan
- DALL-E or Stable Diffusion to generate an image
- Google/Kling to generate a short video
- A stock data API for context
- A web search API for grounding
Each provider had different pricing models, different rate limits, different response formats, and different failure modes. Our rate limiter checked against a single aggregate dollar budget — but each provider call needed its own cost estimate before the call, because you can't undo a video generation after it's been paid for.
We ended up with a fragmented system:
- Custom integration code for every provider
- Manual cost-estimation logic that diverged from real pricing as providers changed their APIs
- No way to enforce provider-specific limits ("this user can spend $50/month on video but unlimited on search")
- Cost estimates drifted: what we predicted vs. what the provider actually billed diverged over time
Worse, OpenAI's own rate limits are organization- and project-scoped, vary by model, and some model families share limits. That meant our single-budget design couldn't even represent OpenAI's constraints cleanly, let alone the five other providers we were integrating.
Every provider addition was a week of work. Every provider pricing change was a fire drill. We called this the whack-a-mole phase.
v3: The Per-Provider, Per-User Hierarchy
Architecture: v2 plus per-provider budgets, per-user scopes, and a coordination layer that enforced both a provider-specific cap and an aggregate user cap.
We were now maintaining:
- 7+ provider-specific rate limit integrations
- A per-user budget hierarchy
- Atomic decrement Lua scripts per scope
- Cost estimate tables updated manually per provider
- An internal dashboard to track it all
v3 held for a while. Then the third wall showed up — and this one changed how we thought about the whole system.
The wall: risk wasn't cost.
An agent got stuck in a loop and sent 200 emails to customers. Token cost: $1.40. Business damage: much larger.
Our rate limiter never fired. Because it measured dollars, and the emails were cheap. The harm was in the action, not the spend.
We started sketching a v4 that would:
- Track a separate "action budget" alongside the dollar budget
- Score different tool calls by risk (send_email = high, search = low)
- Return something richer than ALLOW/DENY — maybe a "proceed but with these restrictions" response
- Support per-run, per-user, per-tenant scopes atomically
- Emit events so downstream systems could react to budget exhaustion
- Handle delegation — when agent A spawns agent B, B shouldn't inherit A's full budget
That's when we stopped.
The Realization: We Were Building Runtime Authority
Each of those v4 requirements has a name in infrastructure engineering:
| What we were building | What it's actually called |
|---|---|
| Atomic check-and-decrement per scope | Reserve-commit lifecycle |
| ALLOW / restricted-ALLOW / DENY responses | Three-way decision model |
| Risk-scored tool calls with per-tool limits | Action authority with RISK_POINTS |
| Per-run, per-user, per-tenant hierarchies | Hierarchical scopes with attenuation |
| Budget sub-allocation for sub-agents | Authority attenuation |
| Events on budget exhaustion | Webhook event emission on budget state transitions |
Put together, those aren't features of a rate limiter. They're the core primitives of a runtime authority platform — the pre-execution enforcement layer that sits between an agent's decision to act and the action itself.
We were building infrastructure we didn't want to own. Every week we added to it was a week not building product. And the hard parts — the concurrency correctness, the multi-provider coordination, the risk scoring — are problems other people were already solving as general infrastructure.
That's why I started building Cycles. Not because rate limiters are bad. They're fine for what they are. But if what you actually need is pre-execution enforcement across providers, across tenants, across risk tiers, with atomic reservations and delegation-aware scoping — you're not building a rate limiter. You're building a runtime authority platform. And there's no advantage to each team rebuilding it in isolation.
The Build-vs-Buy Pattern for AI Agent Rate Limiters
Looking back, every wall we hit had the same shape:
A rate limiter becomes a runtime authority platform the moment you take it seriously in production.
- Take it seriously enough to survive concurrency → you need atomic operations
- Take it seriously enough to handle multiple providers → you need per-provider scopes with hierarchical aggregation
- Take it seriously enough to handle risk, not just cost → you need action-level authority, not just spend counters
- Take it seriously enough to handle multi-tenant isolation → you need scoped budgets with per-tenant limits
- Take it seriously enough for multi-agent systems → you need attenuation, not trust propagation
Each individual requirement is implementable. The combination is a general infrastructure layer that most product teams don't want to own and shouldn't need to.
When It's Still Fine to Build Your Own
Custom rate limiters are the right choice when:
- You have a single provider. A wrapper around one API with per-user token budgets is a reasonable weekend project.
- You don't need atomic concurrency. Prototypes, internal tools, low-traffic agents — race conditions won't bite you for a while.
- You don't enforce action-level risk. If all failures are purely financial and bounded, a spend tracker works.
- You don't have multi-tenancy. One customer, one budget — the scoping is trivial.
The moment any one of those changes — a second provider, concurrent users, a damaging tool call, a second tenant — the complexity curve bends upward. Not linearly. Categorically.
The Take
If you're building a custom rate limiter for AI agents right now, you're probably in v1 or v2 of the scalerX arc. That's fine. That's where everyone starts.
Just be honest about what you're trending toward. The v3→v4 transition we almost made is where the build vs. buy calculus changes — because v4 isn't a rate limiter anymore. It's a general authority platform that happens to do rate limiting as one of its features. And general authority platforms are worth building once, not once per team.