Your First Week with Cycles Budget Guard for OpenClaw
You ran openclaw plugins install @runcycles/openclaw-budget-guard. You enabled it. You opened openclaw.json to fill in the config and stalled out — the typical examples set failClosed: true with carefully tuned toolBaseCosts and modelBaseCosts, and you don't have those numbers yet because you haven't run anything in production. Picking them blind is how teams end up rolling back enforcement on day one.
This post is the day-2 playbook. The earlier OpenClaw posts cover the why (Your OpenClaw Agent Has No Spending Limit), the what graceful degradation looks like (the $5 budget walkthrough), and the plugin-author internals (Five Lessons). They all stop at the moment of install. This one picks up there: five steps, six days, simulated dry-run toward failClosed, every config value derived from your own session data or provider telemetry instead of guessed.
The principle: don't tune what you haven't observed
The plugin's toolBaseCosts, modelBaseCosts, toolCallLimits, and lowBudgetThreshold aren't independent knobs. They're a fitted curve to your workload. The agent that reads PDFs all day has different toolCallLimits than the agent that drafts emails. The team running mostly Sonnet has a different lowBudgetThreshold than the team running mostly Opus. A config copied from someone else's blog post is a config copied from someone else's traffic.
The cure is to flip the order. Run in plugin dry-run with the event log on, then derive the numbers, then decide whether to turn on hard enforcement. This is the same pattern the shadow-to-enforcement decision tree applies generally — applied here specifically to the OpenClaw plugin's surface.
Day 1: Dry-run with the event log on
Start with the smallest config that produces useful data:
{
"plugins": {
"entries": {
"openclaw-budget-guard": {
"config": {
"tenant": "your-tenant",
"cyclesBaseUrl": "http://unused",
"cyclesApiKey": "unused",
"dryRun": true,
"dryRunBudget": 1000000000,
"enableEventLog": true,
"logLevel": "info",
"defaultModelName": "anthropic/claude-sonnet-4-20250514"
}
}
}
}
}Three things matter here:
dryRun: truewith a largedryRunBudget. This is a serverless simulation path, not a no-op shadow mode. The plugin still classifies budget state, creates simulated reservations, applies fallback and limit logic, and can deny once the simulated budget is exhausted. The high budget is what keeps the observation run from shaping behavior: at this stage, you want to see what natural spend looks like, not what degradation looks like.enableEventLog: true. The session summary includes the reserve, commit, downgrade, and decision path. Without it, the summary tells you the totals but not the path that produced them.defaultModelName. Per Lesson 1, OpenClaw'sbefore_model_resolveevent doesn't include the model name. SetdefaultModelNameto whatever your agent actually uses, or every model call shows up unattributed.
Run normally for a day. Don't tune. Don't flip switches. Just collect.
Day 2–3: Read the session summary and derive cost estimates
At agent_end, the plugin builds a SessionSummary, attaches the full object to ctx.metadata["openclaw-budget-guard"], and can POST it to analyticsWebhookUrl if you configure one. The ordinary log line is compact (remaining, spent, reservation count); the full JSON comes from metadata, your analytics webhook, or a wrapper that reads the metadata at session end. The shape from a representative session looks like:
{
"remaining": 850000000,
"spent": 150000000,
"costBreakdown": {
"model:anthropic/claude-sonnet-4-20250514": { "count": 22, "totalCost": 66000000 },
"model:anthropic/claude-opus-4-20250514": { "count": 4, "totalCost": 60000000 },
"tool:web_search": { "count": 12, "totalCost": 12000000 },
"tool:code_execution": { "count": 3, "totalCost": 6000000 },
"tool:read_file": { "count": 18, "totalCost": 1800000 }
},
"unconfiguredTools": [
{ "name": "read_file", "callCount": 18, "estimatedTotalCost": 1800000 },
{ "name": "format_markdown", "callCount": 9, "estimatedTotalCost": 900000 }
]
}Two things to do with this:
Promote unconfiguredTools into toolBaseCosts. Every entry in that list is a tool that fell back to the plugin's default estimate (100000 USD_MICROCENTS, per the integration reference). The default is rarely right. For each entry, decide a more accurate estimate:
| Tool category | Reasonable starting estimate (USD_MICROCENTS) |
|---|---|
| Local file read / format / math | 10,000 – 50,000 |
| In-process compute, no I/O | 50,000 – 200,000 |
| External API (search, scrape, single call) | 500,000 – 2,000,000 |
| Code execution sandbox | 500,000 – 1,000,000 baseline; 1,000,000 – 10,000,000 for paid or long-running sandboxes |
| LLM-as-tool (sub-agent, summarizer) | priced like a model call |
The integration guide gives the baseline band: "External API tools (web search, code execution) typically cost 500K-1M. Lightweight tools (text formatting, math) cost 10K-50K." Start there unless your sandbox provider, timeout, or container lifecycle makes code execution materially more expensive. The plugin's session summaries will tell you which tools are being used and how often; provider telemetry or a custom estimator tells you whether the unit price is too low.
Confirm or update modelBaseCosts. The plugin reserves a fixed amount per model call regardless of token count, and the $5 walkthrough flagged that this produces ±20% variance. That's fine for budget enforcement — you're approximating, not billing — but the estimates need to be in the right ballpark relative to each other or downgrade_model won't pick the right fallback. In the JSON-configured OpenClaw path, costBreakdown.totalCost / count usually reflects the configured estimate that was committed, not independent provider billing. Use provider token/billing telemetry, an LLM proxy, or a programmatic modelCostEstimator when you need measured per-call cost. A rough Anthropic-pricing-anchored ratio:
| Model | Starting estimate (USD_MICROCENTS) |
|---|---|
| Claude Opus 4 | 15,000,000 |
| Claude Sonnet 4 | 3,000,000 |
| Claude Haiku 4.5 | 1,000,000 |
These are per-call averages, not per-token. Adjust upward if your prompts run long. After a few sessions, use the summary's model count values to understand call mix, then compare against external/provider cost data. Only treat totalCost / count as an observed average if you have wired in a real estimator; otherwise it is just the estimate you configured being charged back through the summary.
A concrete fitted config after day 3 looks like:
{
"modelBaseCosts": {
"anthropic/claude-opus-4-20250514": 15000000,
"anthropic/claude-sonnet-4-20250514": 3000000,
"anthropic/claude-haiku-4-5-20251001": 1000000
},
"modelFallbacks": {
"anthropic/claude-opus-4-20250514": ["anthropic/claude-sonnet-4-20250514", "anthropic/claude-haiku-4-5-20251001"]
},
"toolBaseCosts": {
"web_search": 1000000,
"code_execution": 5000000,
"read_file": 50000,
"format_markdown": 10000
}
}Don't enforce yet. Re-run with this config in dry-run for another day and watch unconfiguredTools shrink toward empty.
Day 4: Set toolCallLimits from observed call counts
toolBaseCosts controls spend. toolCallLimits controls side-effects. They're independent knobs and you need both — an agent can exhaust budget without hitting call limits, or hit call limits while well under budget. The integration guide's tip on this is worth reading in place; the short version is that send_email: 10 blocks the eleventh email regardless of how cheap email is.
Pull invocation counts from the session summaries you've collected. For each consequential tool — anything that writes, sends, deploys, or calls a paid third-party API — pick a limit at roughly the p95 of observed call counts, not the mean. The mean represents normal behavior; the p95 represents the upper bound of what your healthy sessions actually do. Anything above p95 is more likely a tool loop than legitimate work.
| Tool kind | Suggested limit shape |
|---|---|
send_email, post_message, notify | Tight — single-digit caps; over-sending is almost always a bug |
deploy, create_resource, delete_* | Tight — caps at or near 1; rare-by-design |
web_search, read_url | Generous — 20–50; agents legitimately search a lot, but a runaway hits the cap before the budget |
read_file, format_*, in-process tools | Usually no limit needed; budget catches loops indirectly |
The pattern is: tight on side-effects, generous on reads, none on cheap utilities. The first two categories are where blast radius lives.
A worked example from the session above (web_search: 12 calls, code_execution: 3 calls) might land on:
{
"toolCallLimits": {
"web_search": 25,
"code_execution": 10,
"send_email": 5,
"deploy": 2
}
}The web_search cap is roughly 2× the observed call count; code_execution is ~3×; the send_email and deploy limits are policy, not data, because you weren't using those tools in shadow. As more sessions accumulate, switch to the actual p95 across sessions instead of a single-session multiplier.
Day 5: Pick lowBudgetThreshold from your spend curve
lowBudgetThreshold is the inflection point where the plugin switches from "pass everything through" to "apply degradation strategies" (reference). The default is 10,000,000 USD_MICROCENTS ($0.10). For most production workloads the default is too low — by the time you're $0.10 from the wall, there's no runway left for downgrade_model to do meaningful work.
A useful heuristic: set lowBudgetThreshold to roughly the cost of the most expensive 5–10 calls you'd want to gracefully complete under degradation. If your typical Opus call is $0.15, ten of them is $1.50 — that's your threshold. The agent crosses into low-budget mode early enough to actually adapt.
Read this off your event log. With enableEventLog: true, every reservation logs the running balance:
Model reserved: anthropic/claude-sonnet-4-20250514 (estimate=3000000, remaining=147000000)Look at the remaining values across a representative session. The threshold you want is somewhere between "agent is two thirds done" and "agent has one Opus call left". That's where degradation has time to matter.
Then pick the strategies. The conservative starting set is:
{
"lowBudgetStrategies": ["downgrade_model", "reduce_max_tokens", "disable_expensive_tools"],
"maxTokensWhenLow": 1024,
"expensiveToolThreshold": 5000000
}downgrade_model requires modelFallbacks. disable_expensive_tools requires that toolBaseCosts is populated for the tools you might want to disable — the plugin compares against expensiveToolThreshold, so an unconfigured tool falling back to the default estimate won't be disabled even if it's actually expensive in reality. This is one more reason day 2–3 has to come before day 5.
Day 6: Cutover decision — failClosed: true
Now apply the cutover decision tree to your collected dry-run data. The OpenClaw-specific reading of its four signal categories:
| Category | OpenClaw-specific check | Green when |
|---|---|---|
| Cost calibration | Compare configured toolBaseCosts and modelBaseCosts against provider telemetry, billing logs, or estimator output. Use costBreakdown.totalCost / count only when a real estimator is feeding actuals. | Per-call observations within ~20% of estimates for a representative sample; extend to a steady week before high-risk workflows |
| Policy coverage | unconfiguredTools list across recent session summaries | List is empty (or only contains tools you've explicitly chosen not to budget) |
| Operational readiness | Has anyone on the team run a denial rehearsal and seen the relevant BudgetExhaustedError, ToolBudgetDeniedError, or tool block in logs? | Yes — at least one rehearsed denial |
| Reversion readiness | Can you flip failClosed: false (or dryRun: true) without a deploy? | Yes — config-toggle path tested |
If those are all green for a low-risk canary workflow, flip to:
{
"dryRun": false,
"failClosed": true,
"cyclesBaseUrl": "${CYCLES_BASE_URL}",
"cyclesApiKey": "${CYCLES_API_KEY}"
}Note the env-var interpolation. Per Lesson 4, the plugin no longer reads process.env directly — OpenClaw resolves ${...} before passing config in.
The first 24 hours after cutover, treat any of these as a rollback signal:
- A sustained denial rate noticeably higher than what dry-run predicted. The dry-run data is your baseline; significant deviation means an estimate is wrong, not the policy.
- Per-call observed cost from provider telemetry or estimator output on any specific tool more than 2× its
toolBaseCostsestimate. That's estimate drift — fix the number, don't tighten the threshold. - Any
BudgetExhaustedErroron a workflow without a graceful degradation path. Add the path before re-enforcing on that workflow.
The softest rollback is failClosed: false with dryRun: false: the plugin stays connected to real Cycles budget data while budget-exhaustion handling becomes warning-oriented where the plugin supports it. If tool reservation denials, toolCallLimits, or explicit access-list blocks are interrupting traffic, loosen those controls or move the affected workflow back to high-budget dry-run while you recalibrate. The general rollback discussion in the cutover post applies — this is just the OpenClaw-shaped version.
Sidebar: tool call limits as supply-chain protection
OpenClaw's ClawHub marketplace had 1,184 malicious skills flagged in early 2026. The agent-framework market is converging on the same supply-chain risk shape that npm and PyPI have been living with for a decade. Budget enforcement isn't a substitute for skill-vetting, but toolCallLimits and disable_expensive_tools are a meaningful blast-radius limiter when a skill misbehaves.
A compromised skill can't send 10,000 emails if toolCallLimits.send_email: 10. A skill that's secretly running an expensive sub-agent gets caught by disable_expensive_tools once budget tightens. This is a side benefit, not the primary purpose — but it's a reason to be slightly more aggressive with limits on tools you don't fully trust.
What you now have
After six days, the config in openclaw.json is no longer copy-paste. Every number in it is traceable to session data or provider telemetry you can point at. toolBaseCosts is tied to what tools actually cost in your traffic. toolCallLimits matches your healthy upper bound. lowBudgetThreshold is set to where degradation can still do something useful. And the cutover decision from dry-run to failClosed happens on data, not on a calendar.
The session summary keeps doing this work after cutover, too. Treat it as a weekly tuning ritual: open the latest one, look for new tools showing up in unconfiguredTools, look for count values approaching their toolCallLimits, and compare provider telemetry or estimator output against the configured costs. The numbers move as your agents change. The discipline of letting the data set the config is what keeps enforcement healthy past day six.
Resources
- Integrating Cycles with OpenClaw — full configuration reference
cycles-openclaw-budget-guardon GitHub — source and issue tracker- Deploying the Full Cycles Stack — when you're ready to leave dry-run
Related reading
- Your OpenClaw Agent Has No Spending Limit — Here's How to Fix That — the awareness post; the five problems this plugin solves
- We Gave Our OpenClaw Agent a $5 Budget and Watched It Adapt — what graceful degradation looks like once the config is tuned
- Five Lessons from Building a Production OpenClaw Plugin — the engineering field notes referenced throughout this post
- Shadow Mode to Hard Enforcement: The Cutover Decision Tree — the general framework applied here
- Estimate Drift: The Silent Killer of Budget Enforcement — what to do when observed cost diverges from
toolBaseCosts - When Budget Runs Out: Graceful Degradation Patterns — the decision matrix for DENY and ALLOW_WITH_CAPS handling
- Operating Budget Enforcement in Production — what to do when the first denial fires