Why Local-First Agent Runtimes Need Runtime Authority

A twelve-engineer software shop standardizes on local-first agent tools. Three engineers prefer Cline in VS Code. Four use Aider in a terminal. The rest are split between Continue and Cursor's agent mode. Most are running BYOK — each developer supplies an OpenAI, Anthropic, OpenRouter, or local-model key from their own machine. Even on the runtimes that also offer managed account-based setup, agent execution is distributed across local processes, editor windows, and developer-controlled credentials. One Wednesday, an engineer runs a research-mode Cline session that gets stuck on a refactor, calls the API two hundred times overnight, and burns $187 of personal Anthropic budget by morning. The same session also fired a deploy tool — twice — that took down staging until the on-call engineer rolled back. Nobody noticed the spend, and nobody had a way to know the deploy tool was reachable from a research-mode session in the first place. There's no shared dashboard. There's no central operator. There's no rate limiter that aggregates across the twelve developers, because each developer is on their own key, on their own machine, in their own process.

Four days later, the engineer files a Slack message titled "is this normal?" The CTO realizes the answer they need isn't "fix it for Cline" — it's what governance layer applies across all four of these runtimes at once?

The shape of the answer is the topic of this post. Local-first, BYOK, on-device agent runtimes are converging into a recognizable category — and the category has a structural governance gap that no individual runtime, provider cap, or framework limit closes by itself. The gap closes by adding an authority at a layer above the runtime, the same way Agents Are Cross-Cutting closes the gap above any single tool.

What "local-first" means as a category

The term covers a family of agent runtimes that share four structural characteristics:

Runs on the user's machine. A CLI binary, a VS Code extension, a JetBrains plugin, a terminal-native tool, a desktop app. Not a hosted SaaS. The runtime process is the user's process.
BYOK economics. The user supplies their own API key for OpenAI, Anthropic, Google, OpenRouter, or a local model. The provider relationship is per-user, not per-organization. Spend hits the user's account.
No managed control plane. There's no central operator, no admin dashboard above the runtime, no service that sees all users' sessions at once. The runtime is the binary, full stop.
Often, a plugin or skill marketplace. OpenClaw has ClawHub. Cline has the MCP marketplace. Continue supports cloud-managed configuration flows and local YAML configuration. Cursor has its own MCP catalog. The marketplace is upstream of every individual user's runtime, and once a developer installs a plugin, the local runtime can treat it as trusted enough to participate in the workflow.

Representative members of the category:

Runtime	Surface	BYOK	Marketplace
OpenClaw	Local agent runtime; plugin lifecycle	Yes — user-supplied keys	ClawHub
Cline	VS Code / JetBrains / Cursor / Windsurf / Zed / Neovim sidebar; CLI preview	Supports BYOK across many providers, plus local Ollama / LM Studio; also offers managed setup	MCP Marketplace
Aider	Python CLI in the terminal	Yes — Claude / GPT / Gemini / Llama via Ollama / others	None (CLI-only)
Continue	VS Code / JetBrains extension and agent/check workflows	Supports local and hosted model configuration	Cloud-managed configs / local YAML

(Cursor's agent mode and Claude Code sit on the boundary — they support BYOK but route through their respective vendors' infrastructure, which is closer to the MCP-host shape covered in Budget Limits for Claude Code, Cursor, and Windsurf via MCP. The post here is about runtimes that route directly to providers from the user's machine.)

The category's product values — local privacy, no-vendor-lock-in, BYOK economics, free-tier-friendly — are real, and they're why developers and shops adopt these tools. The same values produce the governance gap.

Where each "obvious" control falls short for this category

Take the four controls a CTO would reach for first, and walk through why each one fails in the local-first / BYOK shape.

Provider-side spending caps

Provider controls are improving. Anthropic supports organization- and workspace-level limits — its Tier 4 ceiling is $200,000/month, with finer-grained limits configurable per workspace, and the API blocks further calls once the monthly cap is hit until the next billing period. OpenAI supports organization- and project-level spending limits in the dashboard. Both are useful safety nets against catastrophic monthly bills.

But four structural mismatches keep them blind to the local-first failure mode:

The cap measures against the calendar, not against the run. A monthly cap of $500 doesn't stop a single-session research loop that burns $187 in eight hours. The cap is checked against accumulated month-to-date spend, not against a per-session or per-task budget.
BYOK distributes the spend across accounts. Each engineer has their own key, on their own account. The provider's cap sees one engineer's monthly spend — never the team's. There is no API surface for "aggregate spend across our developers' personal Anthropic accounts," even when each developer is using their employer's workspace within the provider, because the spend lands in the account that owns the key.
The granularity is account- or workspace-wide, not session-wide. The cap can't say "let this user burn $187 on background research but block deploy-related tools." It can only say "stop everything when the month's spend hits X."
They don't see action risk at all. A provider cap is a cost protector. It has no concept of "this tool deletes data" vs "this tool reads files" — it can't deny a deploy call while allowing twenty read_file calls. The deploy that took down staging in the opening vignette was perfectly happy from the provider's point of view; the provider doesn't know what the tool does, only what the model call costs. The risk-tier framing is in AI Agent Risk Assessment.

Provider controls primarily protect usage within the provider's account model. They aren't designed to govern an organization's cross-user, cross-runtime, cross-session, cross-provider surface, and they don't make per-action decisions on either spend or risk. A longer treatment of the granularity / scope / delay analysis is in Cycles vs Provider Spending Caps.

Framework-internal limits

Each runtime ships its own internal ceilings. Aider has token budgets per request and chat-mode session settings. Cline has a max-requests-per-task setting. Continue offers per-assistant configuration. These are useful for the simplest runaway shape — an agent stuck calling the same tool forever.

The structural problem: each ceiling is per process, on one machine. A user can spawn three Cline windows. A user can run two Aider terminals at once. A user can switch from Continue to Cline mid-day and reset every counter. A worker crash drops the counter. A user reboots and starts fresh. There is no shared truth about how much this user — let alone this team — has spent today.

The limits also speak in step counts, not in money or risk. They don't know what an iteration costs. A max_requests=50 runs $0.50 on Haiku and $50 on Opus. They don't distinguish "read a file" from "deploy a service" — every step is one step, regardless of blast radius. They are local circuit breakers, not budget authority and not action authority. The same shortcoming was named — for hosted-orchestrator frameworks — in Agents Are Cross-Cutting. Your Controls Aren't. The local-first version of the failure is the same failure with the additional twist that the orchestrator runs in the user's process, so even the per-process counter is split across however many editor windows the user opens.

Observability and cost-attribution tools

Helicone, Langfuse, OpenLLMetry, and others give excellent retrospective visibility. They trace, attribute cost, surface anomalies, and produce dashboards.

For local-first / BYOK runtimes, two structural mismatches keep them in the wrong layer:

They sit on the proxy path. Capturing spend means routing the runtime's API calls through a proxy or wrapper that the observability tool controls. Local-first runtimes hit provider APIs directly from the user's machine. Each user can install a local proxy, but that's per-user infrastructure and produces a per-user dashboard, not a team view. A team-wide proxy means routing every developer's API traffic through a shared service, which violates the BYOK / privacy values the category was chosen for in the first place.
They are post-hoc by design. Even if every developer agreed to a shared proxy, the dashboard reads from already-completed traces. The runaway is recorded in faithful detail; it isn't stopped at call number 50, when the damage was still $1.50. The same retrospective-vs-pre-execution argument from Runtime Authority vs Guardrails vs Observability applies — the layer is wrong for enforcement, not just for this category.

Marketplace blast-radius controls

This is the local-first-specific failure mode that doesn't have a hosted-SaaS equivalent. Plugin and skill marketplaces — ClawHub, Cline's MCP marketplace, Continue's hosted configs, Cursor's MCP catalog — are upstream of every user's runtime. Once a developer installs a plugin, the local runtime can treat it as trusted enough to participate in the workflow.

This is the npm / PyPI / Docker Hub trajectory, ten years compressed. Our earlier analysis cited 1,184 malicious skills reported on OpenClaw's ClawHub in early 2026; the exact number matters less than the pattern. Marketplace operators do what every package registry eventually does — they scan, they delist, they warn — but the controls are detection-after-publish, not blast-radius-at-runtime. A skill that calls send_email ten thousand times is still going to call it ten thousand times unless something on the runtime side is willing to say no. Cost limits help here only incidentally; the real defense is action-tier classification — knowing that send_email, deploy, and delete_* belong to risk tiers that warrant tighter caps regardless of how cheap each call is.

Agent Skills Are the New Supply Chain makes the broader argument. The local-first version is sharper because each user's runtime is independently trusting the same marketplace, and there's no shared service to enforce per-skill caps across the team.

What's structurally missing

Each control above fails for a different reason. The reasons share a shape:

Control	What it can see	What it can't see
Provider cap	Account/workspace-month spend	Per-session, per-tool, per-team, per-action-risk decisions
Framework limit	One process	Other processes, other users, other runtimes
Observability	Past traces (if proxied)	Pre-execution decisions
Marketplace	Published skills	Per-runtime per-skill blast radius

Nothing in that table sees the team-level, cross-runtime, cross-user, pre-execution decision surface that local-first agents collectively act on — for either spend or action risk. No individual control evolves into that thing, for the same reason no provider cap evolves into a cross-provider authority — it's a layer mismatch, not a feature gap.

The missing piece has a recognizable shape:

External to the runtime. Lives outside Cline, Aider, Continue, OpenClaw, so it can see across whichever ones the team uses simultaneously.
Per-user and per-session scoped. Each developer's spend and action attempts are visible to the team's enforcement layer even though the API key isn't. Spend lands on the user's provider account; the decision lands on the team's authority.
Pre-execution. A reserve-commit primitive that runs before the action. By the time the bill — or the deploy — lands, it's too late.
Cost AND risk in the same primitive. A reservation isn't only "do you have $X budget left?" It's also "given this action's tier — read-only, write-local, write-external, mutation, execution — does the policy ALLOW, ALLOW_WITH_CAPS, or DENY?" Spend and side-effect blast-radius are decided in one place. The risk-tier framework is in AI Agent Risk Assessment.
Three-way decisional. ALLOW, ALLOW_WITH_CAPS, DENY — so a developer crossing a soft threshold sees graceful degradation (cheaper model, fewer tools, shorter responses) on the spend side, and capability narrowing (denylist of high-tier tools, max-steps shrinkage) on the risk side, instead of a hard stop in the middle of a refactor. The general framing is in Caps and the Three-Way Decision Model.
Marketplace-aware. Plugin / skill / MCP-server invocations are first-class budget and action events, not opaque tool calls. A team can cap send_email at 10/session and DENY deploy outright for every runtime using any skill that wraps those APIs, without auditing each skill individually.

These properties don't fall out of any one runtime's roadmap. They fall out of "the team needs to govern across the four runtimes its developers chose, on both cost and side-effect risk, and none of those runtimes is going to grow into the others."

The OpenClaw chapter

OpenClaw was the first runtime in this category where the Cycles team shipped a complete cost-and-action-authority integration end-to-end. Its five-hook plugin lifecycle — before_model_resolve → before_prompt_build → before_tool_call → after_tool_call → agent_end — is well-suited to hosting one: pre-execution decision points on every model and tool invocation, plus a session-end reconciliation hook. The same hooks that decide spend also decide whether a tool with a higher risk tier (send_email, deploy) is allowed in this session at all. (Cline and Continue have MCP support that could host an external authority via the MCP-server pattern; the OpenClaw integration just happened to ship first.)

The four existing posts in this series document what that integration looks like end to end:

Your OpenClaw Agent Has No Spending Limit — Here's How to Fix That — the awareness post: the five problems the integration solves.
We Gave Our OpenClaw Agent a $5 Budget and Watched It Adapt — the worked walkthrough: graceful degradation in action.
Five Lessons from Building a Production OpenClaw Plugin — the plugin-author internals: which hooks block cleanly, which don't, what's missing.
Your First Week with Cycles Budget Guard for OpenClaw — the operator playbook: dry-run, calibrate, cut over.

That stack is one runtime's chapter of the category-level pattern. The pattern itself — pre-execution decision, scoped budgets, three-way decisional, marketplace-aware — is what every other local-first runtime is going to need. The shape will look different in each.

What the rest of the category looks like at scale

Cline, Aider, and Continue today rely on user-side discipline: your own monthly cap, your own self-imposed session ceiling, your own habits. At single-developer scale this works. At team scale — five developers, twelve developers, fifty developers, all on personal keys — it doesn't, for the same reasons it didn't work for OpenClaw before the plugin existed.

Possible integration shapes when these runtimes need a governance layer:

Native plugin lifecycle, the OpenClaw shape. Cleanest. Requires the runtime to expose hook points; not all of these runtimes do.
MCP server, the Claude Code / Cursor / Windsurf shape. Works for any runtime with MCP support — Cline and Continue both speak MCP today. CLI-first runtimes that don't host MCP natively need a different shape.
CLI wrapper, the Aider-friendly shape. A script around the runtime that intercepts each API call and consults an authority. Works for any binary; trades elegance for portability.
Shared local proxy, the team-deployed shape. A small process every developer's runtime points at, which talks to a central authority. The proxy routes decision-making, not LLM traffic — it receives action metadata and estimated spend, asks the authority for ALLOW / ALLOW_WITH_CAPS / DENY, and lets the local runtime call the provider directly if allowed. Prompt bodies and provider responses don't pass through it. This resolves the BYOK-vs-team-visibility tension without violating the privacy framing.

None of these is the right answer for every runtime. The constant is the layer: an authority that sees what no individual runtime can see, that decides before the action runs, and that scopes per-user/session/team rather than per-account/month.

The takeaway

Local-first agent runtimes are a category, and the category has a governance gap that the runtimes themselves can't close, the providers can't close, and the observability tools can't close. The gap is on both axes — uncontrolled cost and uncontrolled action risk — and the same primitive (reserve, decide, commit, with risk-tier classification) handles both. OpenClaw's plugin and the four-post series above are one worked example of what the close looks like when a runtime exposes the right hook points. The other runtimes will need their own version, in whatever integration shape fits their architecture. Teams that adopt these tools across more than a handful of developers will discover the gap whether or not the runtimes have shipped a fix yet — usually as a Slack message titled "is this normal?"

The pattern recurs because it's structural. Provider caps are at the wrong granularity. Framework limits are at the wrong scope. Observability is at the wrong moment. Marketplaces are at the wrong layer. None of them sees both spend and side-effect risk on the same decision. The only thing that closes a layer gap is a layer.

Resources

Cycles overview — the open-source runtime authority for AI agents.
Integrating Cycles with OpenClaw — the worked integration guide for the runtime that has it.
Cline, Aider, Continue — official documentation for the comparison points used in this post.

Agents Are Cross-Cutting. Your Controls Aren't. — the same structural argument applied to provider × tool × tenant × worker dimensions.
Budget Limits for Claude Code, Cursor, and Windsurf via MCP — the MCP-host-shaped sibling of this category.
Runtime Authority vs Guardrails vs Observability — the lifecycle companion: why enforcement has to happen before the action.
Cycles vs Provider Spending Caps — granularity, scope, and delay analysis of provider caps.
AI Agent Risk Assessment — the risk-tier framework (read-only / write-local / write-external / mutation / execution) the post invokes for action-side decisions.
Agent Skills Are the New Supply Chain — why marketplace ecosystems amplify the supply-chain risk shape.
MCP Tool Poisoning — the ClawHub incident and what it implies for the category.
Your First Week with Cycles Budget Guard for OpenClaw — the operator playbook this post points at.

Why Local-First Agent Runtimes Need Runtime Authority ​

What "local-first" means as a category ​

Where each "obvious" control falls short for this category ​

Provider-side spending caps ​

Framework-internal limits ​

Observability and cost-attribution tools ​

Marketplace blast-radius controls ​

What's structurally missing ​

The OpenClaw chapter ​

What the rest of the category looks like at scale ​

The takeaway ​

Resources ​

Related reading ​

More from the Blog