Production Operations Guide
This guide covers what you need to run Cycles reliably in production. It assumes you've already deployed the stack per Deploy the Full Stack and are preparing for production traffic.
INFO
Cycles stores all state in Redis. Redis availability directly determines Cycles availability. Plan your Redis deployment accordingly.
Operations UI for incident response
For incident-response workflows — freeze a runaway budget, suspend a tenant, force-release hung reservations, replay missed webhooks, revoke a leaked API key — deploy the Cycles Admin Dashboard. It's a Vue 3 SPA with one-click actions (capability-gated, with confirm + blast-radius summaries) that's typically faster than crafting curl during a live incident. Pair with the Prometheus alerting in Monitoring and Alerting — alerts page you, dashboard helps you act.
Redis configuration for production
Cycles stores all state in Redis. Redis availability directly determines Cycles availability.
Always configure Redis authentication in production
Set REDIS_PASSWORD and provide it to all Cycles services. An unauthenticated Redis instance is a critical security vulnerability — anyone with network access can read budget state, modify reservations, and extract API keys. See Security Hardening — Redis Authentication for complete setup including TLS and ACLs.
Persistence
Enable both RDB snapshots and AOF append-only logging:
# redis.conf
save 900 1 # Snapshot every 15 min if at least 1 key changed
save 300 10 # Snapshot every 5 min if at least 10 keys changed
appendonly yes # Enable AOF
appendfsync everysec # Fsync once per second (good balance of safety and performance)In Docker Compose:
redis:
image: redis:7-alpine
command: redis-server --appendonly yes --save "900 1" --save "300 10"
volumes:
- redis-data:/dataMemory management
Set a max memory limit and eviction policy:
maxmemory 2gb
maxmemory-policy noeviction # IMPORTANT: never evict budget dataAlways use noeviction. Evicting budget keys silently loses budget state. It is better for Redis to reject writes (causing reservation failures that can be retried) than to silently drop data.
High availability
For production, consider:
- Redis Sentinel — automatic failover with a primary + replica setup. Good for most deployments.
- Redis Cluster — sharded across multiple nodes. Required for very large deployments.
Cycles uses Lua scripts for atomic operations. All keys for a single reservation operation are in the same Redis keyspace, so single-instance and Sentinel setups work out of the box. For Redis Cluster, ensure the key prefix strategy keeps related keys on the same shard.
Backup strategy
- Automated RDB snapshots stored offsite (S3, GCS, etc.)
- AOF backups for point-in-time recovery
- Test restores regularly — untested backups are not backups
Cycles Server configuration
Running multiple instances
The Cycles Server is stateless. You can run multiple instances behind a load balancer:
cycles-server-1:
image: ghcr.io/runcycles/cycles-server:0.1.25.17
environment:
REDIS_HOST: redis-primary
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD}
cycles-server-2:
image: ghcr.io/runcycles/cycles-server:0.1.25.17
environment:
REDIS_HOST: redis-primary
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD}Any load balancing strategy works (round-robin, least-connections). No sticky sessions required.
Health checks
All three services expose Spring Boot Actuator health at /actuator/health. Only the Admin Server also enables Spring's dedicated Kubernetes liveness/readiness probes (management.endpoint.health.probes.enabled=true); the runtime and events services expose only the aggregate endpoint.
# Cycles Server (aggregate only)
curl http://localhost:7878/actuator/health
# Admin Server (aggregate + Kubernetes probes)
curl http://localhost:7979/actuator/health
curl http://localhost:7979/actuator/health/liveness
curl http://localhost:7979/actuator/health/readiness
# Events Service (aggregate only)
curl http://localhost:7980/actuator/healthConfigure your load balancer or orchestrator to check these endpoints. On Kubernetes, wire the Admin Server's liveness/readiness probes to /actuator/health/liveness and /actuator/health/readiness. For the runtime and events services, probe /actuator/health directly. All three services rely on Spring Boot's default Redis health indicator — the aggregate /actuator/health status turns DOWN when Redis is unreachable. There is no custom queue-consumption health check on the Events Service today; for backlog monitoring, watch LLEN dispatch:pending (see Monitoring and Alerting).
JVM tuning
The default JVM settings work for most deployments. For high-throughput environments:
JAVA_OPTS="-Xms512m -Xmx1g -XX:+UseG1GC"Reservation expiry
The server runs a background sweep to expire stale reservations:
cycles:
expiry:
interval-ms: 5000 # Default: sweep every 5 secondsReduce the interval for tighter TTL enforcement. Increase it to reduce Redis load if TTL precision is not critical.
For listing and recovering stale or orphaned reservations after client crashes, see Reservation Recovery and Listing.
Events Service configuration
The Cycles Events Service (cycles-server-events, port 7980) delivers webhook notifications asynchronously. It is optional — if not deployed, admin and runtime servers continue operating normally, and events accumulate in Redis with TTL until the service starts.
Configuration
| Variable | Default | Description |
|---|---|---|
WEBHOOK_SECRET_ENCRYPTION_KEY | (empty) | AES-256-GCM key for signing secret encryption. Base64, 32 bytes. Same across all services. Generate: openssl rand -base64 32 |
EVENT_TTL_DAYS | 90 | Redis TTL for event records |
DELIVERY_TTL_DAYS | 14 | Redis TTL for delivery records |
MAX_DELIVERY_AGE_MS | 86400000 | Stale deliveries auto-fail after this age (24h default) |
dispatch.retry.poll-interval-ms | 5000 | How often the retry scheduler scans for ready-to-retry deliveries. |
dispatch.retry.batch-size | 100 | Max deliveries processed per retry-scan tick. |
dispatch.http.timeout-seconds | 30 | HTTP request timeout per delivery attempt. |
dispatch.http.connect-timeout-seconds | 5 | HTTP connect timeout per delivery attempt. |
The per-subscription retry policy (exponential backoff) defaults to max_retries=5, initial_delay_ms=1000, backoff_multiplier=2.0, max_delay_ms=60000. A delivery older than MAX_DELIVERY_AGE_MS is failed immediately without further retries. See the Events Service section in the Server Configuration Reference for the full knob list.
Running multiple instances
The Events Service is safe to run as multiple instances. Each instance consumes from the dispatch:pending Redis queue via BRPOP, which is atomic — each delivery job is processed by exactly one instance.
cycles-events-1:
image: ghcr.io/runcycles/cycles-server-events:0.1.25.10
environment:
REDIS_HOST: redis-primary
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD}
WEBHOOK_SECRET_ENCRYPTION_KEY: ${WEBHOOK_SECRET_ENCRYPTION_KEY}
cycles-events-2:
image: ghcr.io/runcycles/cycles-server-events:0.1.25.10
environment:
REDIS_HOST: redis-primary
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD}
WEBHOOK_SECRET_ENCRYPTION_KEY: ${WEBHOOK_SECRET_ENCRYPTION_KEY}Events Service down
If the Events Service is unavailable:
- Admin and runtime servers are unaffected — event dispatch is fire-and-forget
- Redis accumulates events with TTL (90-day events, 14-day deliveries)
- On restart: stale deliveries older than
MAX_DELIVERY_AGE_MS(default 24h) auto-fail; fresh ones deliver normally
Network architecture
Recommended topology
Network isolation
- Cycles Server (port 7878): Accessible to your application. Can be on an internal network or behind an API gateway.
- Admin Server (port 7979): Internal access only. This manages tenants, API keys, and budgets. Never expose to the public internet.
- Events Service (port 7980): Internal access only. Consumes from Redis and delivers webhooks outbound. Never needs inbound traffic from applications.
- Redis (port 6379): Internal access only. Never expose directly.
TLS termination
Terminate TLS at the load balancer or API gateway. The Cycles Server itself runs plain HTTP. Example with nginx:
server {
listen 443 ssl;
server_name cycles.internal.example.com;
ssl_certificate /etc/ssl/certs/cycles.crt;
ssl_certificate_key /etc/ssl/private/cycles.key;
location / {
proxy_pass http://cycles-server:7878;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}Capacity planning
Rules of thumb
- Redis memory: ~1 KB per active reservation, ~500 bytes per budget ledger. 1 GB of Redis memory supports roughly 500K concurrent reservations.
- Server CPU: Each reservation involves 1 Redis Lua script execution (~1ms). A single server instance can handle thousands of reservations per second.
- Latency: Expect <5ms for reservation operations on a well-configured setup (server co-located with Redis).
Scaling triggers
Add more Cycles Server instances when:
- Response latency exceeds 50ms at p99
- CPU utilization exceeds 70%
Add more Events Service instances when:
- The
dispatch:pendingqueue depth grows consistently (redis-cli LLEN dispatch:pending) - Webhook delivery latency exceeds acceptable thresholds
- Multiple instances are safe — BRPOP is atomic, so each delivery is processed exactly once
Scale Redis when:
- Memory utilization exceeds 80%
- Command latency exceeds 5ms
Upgrade procedures
Rolling upgrade
All three services (Cycles Server, Admin Server, Events Service) are stateless — all state lives in Redis. You can do rolling upgrades with zero downtime:
- Pull the new image:
docker pull ghcr.io/runcycles/cycles-server:NEW_VERSION - Stop one instance at a time
- Start the new version
- Verify health check passes (
/actuator/healthon ports 7878, 7979, 7980) - Repeat for remaining instances
The Events Service can be upgraded independently. While it is down, webhook deliveries queue in Redis and are processed when the new version starts.
Version compatibility
The Cycles protocol is versioned (/v1). Minor version upgrades (e.g., 0.1.23 → 0.1.24) are backward-compatible. Check the changelog for breaking changes before major upgrades.
Rollback
If an upgrade causes issues:
- Stop the new version
- Start the previous version
- Redis state is compatible across minor versions
Logging
Log levels
Configure via Spring Boot:
logging:
level:
io.runcycles: INFO # Application logs
org.springframework: WARN # Framework logsSet io.runcycles: DEBUG for troubleshooting (includes full request/response logging).
Structured logging
Add JSON logging for log aggregation systems:
logging:
pattern:
console: '{"timestamp":"%d","level":"%p","logger":"%c","message":"%m"}%n'Or use the Spring Boot JSON logging starter for full structured output.
Operational runbooks
Budget exhaustion alert
Symptom: Applications report BUDGET_EXCEEDED errors.
Response:
- Check which scope is exhausted:
GET /v1/balances?tenant=... - Determine if this is expected (legitimate traffic) or unexpected (runaway agent)
- If expected: fund the budget via admin API (
POST .../fundwithCREDIT) - If unexpected: check active reservations for anomalies (
GET /v1/reservations?status=ACTIVE)
Reservation leak
Symptom: Budget reserved amount grows but spent stays flat. Reservations are being created but never committed or released.
Response:
- List active reservations:
GET /v1/reservations?status=ACTIVE - Check for reservations past their expected TTL
- The expiry sweep should eventually clean these up. If it's not running, check the server logs.
- Investigate the client application — it may be failing to commit or release.
Commit failure after successful LLM call
Symptom: An LLM call (or other side-effecting action) completes successfully, but the subsequent commit to Cycles fails. The work happened and incurred real cost, but the budget ledger does not reflect it.
Why this happens:
- Transient network error between client and Cycles Server
- Cycles Server restart or Redis outage at commit time
- Client process crash after the LLM call but before commit
What the retry engine does:
All three clients (Python, TypeScript, Spring Boot) include a commit retry engine enabled by default. When a commit fails with a transport error or 5xx response, the engine retries with exponential backoff (default: 5 attempts over ~30 seconds). This handles most transient failures automatically.
When retry is not enough:
If all retries are exhausted or the client process crashes entirely, the reservation remains in ACTIVE state until it expires (based on TTL + grace period). After expiry, the reserved budget is returned to the pool. The actual cost is unaccounted for — the budget appears more available than it really is.
Response:
- Check for expired reservations that were never committed:bash
curl -s "http://localhost:7878/v1/reservations?tenant=acme-corp&status=EXPIRED" \ -H "X-Cycles-API-Key: $API_KEY" | jq '.reservations[] | {reservation_id, scope_path, estimate: .estimate.amount, created_at, expired_at}' - Reconcile using events: For each expired reservation that represents real work, record the actual cost as a standalone event:bash
curl -s -X POST http://localhost:7878/v1/events \ -H "Content-Type: application/json" \ -H "X-Cycles-API-Key: $API_KEY" \ -d '{ "idempotency_key": "reconcile-<reservation_id>", "subject": { "tenant": "acme-corp" }, "action": { "kind": "reconciliation", "name": "commit-failure-recovery" }, "actual": { "unit": "USD_MICROCENTS", "amount": <actual_cost> }, "overage_policy": "ALLOW_WITH_OVERDRAFT", "metadata": { "original_reservation_id": "<reservation_id>" } }' - Monitor commit failure rates. A sustained increase in commit failures signals infrastructure issues. Track the ratio of committed vs. expired reservations.
Prevention:
- Keep retry enabled (default) with aggressive settings for critical workloads
- Use
ALLOW_WITH_OVERDRAFToverage policy for must-record actions so reconciliation events are always accepted - Ensure client processes have graceful shutdown hooks that commit or release active reservations
- Set up alerts on the expired reservation count (see Monitoring and Alerting)
Redis connection loss
Symptom: All reservation operations fail with 500 errors. Events Service also stops processing deliveries.
Response:
- Check Redis connectivity:
redis-cli ping - Check server logs for connection errors on all three services (ports 7878, 7979, 7980)
- Restart services if Redis connection pool is exhausted
- Active reservations with remaining TTL are preserved in Redis and will resume when connectivity returns
- Queued webhook deliveries resume automatically when the Events Service reconnects
Webhook delivery failures
Symptom: Webhook endpoints are not receiving events. Queue depth grows.
Response:
- Check Events Service health:
GET http://localhost:7980/actuator/health - Check queue depth:
redis-cli LLEN dispatch:pending - Check if subscription was auto-disabled:
GET /v1/admin/webhooks/{subscription_id} - Re-enable if needed:
PATCH /v1/admin/webhooks/{subscription_id}with{"status": "ACTIVE"} - Verify
WEBHOOK_SECRET_ENCRYPTION_KEYmatches across all services
Next steps
- Webhook Integrations — PagerDuty, Slack, ServiceNow examples
- Client Performance Tuning — timeout, retry, and connection pool optimization
- Security Hardening — Redis AUTH, TLS, key rotation, webhook security
- Monitoring and Alerting — metrics and alerting setup
- Server Configuration Reference — all configuration properties