Production Operations Guide

This guide covers what you need to run Cycles reliably in production. It assumes you've already deployed the stack per Deploy the Full Stack and are preparing for production traffic.

INFO

Cycles stores all state in Redis. Redis availability directly determines Cycles availability. Plan your Redis deployment accordingly.

Operations UI for incident response

For incident-response workflows — freeze a runaway budget, suspend a tenant, force-release hung reservations, replay missed webhooks, revoke a leaked API key — deploy the Cycles Admin Dashboard. It's a Vue 3 SPA with one-click actions (capability-gated, with confirm + blast-radius summaries) that's typically faster than crafting curl during a live incident. Pair with the Prometheus alerting in Monitoring and Alerting — alerts page you, dashboard helps you act.

Redis configuration for production

Cycles stores all state in Redis. Redis availability directly determines Cycles availability.

Always configure Redis authentication in production

Set REDIS_PASSWORD and provide it to all Cycles services. An unauthenticated Redis instance is a critical security vulnerability — anyone with network access can read budget state, modify reservations, and extract API keys. See Security Hardening — Redis Authentication for complete setup including TLS and ACLs.

Persistence

Enable both RDB snapshots and AOF append-only logging:

conf

# redis.conf
save 900 1        # Snapshot every 15 min if at least 1 key changed
save 300 10       # Snapshot every 5 min if at least 10 keys changed
appendonly yes     # Enable AOF
appendfsync everysec  # Fsync once per second (good balance of safety and performance)

In Docker Compose:

yaml

redis:
  image: redis:7-alpine
  command: redis-server --appendonly yes --save "900 1" --save "300 10"
  volumes:
    - redis-data:/data

Memory management

Set a max memory limit and eviction policy:

conf

maxmemory 2gb
maxmemory-policy noeviction  # IMPORTANT: never evict budget data

Always use noeviction. Evicting budget keys silently loses budget state. It is better for Redis to reject writes (causing reservation failures that can be retried) than to silently drop data.

High availability

For production, consider:

Redis Sentinel — automatic failover with a primary + replica setup. Good for most deployments.
Redis Cluster — sharded across multiple nodes. Required for very large deployments.

Cycles uses Lua scripts for atomic operations. All keys for a single reservation operation are in the same Redis keyspace, so single-instance and Sentinel setups work out of the box. For Redis Cluster, ensure the key prefix strategy keeps related keys on the same shard.

Backup strategy

Automated RDB snapshots stored offsite (S3, GCS, etc.)
AOF backups for point-in-time recovery
Test restores regularly — untested backups are not backups

Cycles Server configuration

Running multiple instances

The Cycles Server is stateless. You can run multiple instances behind a load balancer:

yaml

cycles-server-1:
  image: ghcr.io/runcycles/cycles-server:0.1.25.17
  environment:
    REDIS_HOST: redis-primary
    REDIS_PORT: 6379
    REDIS_PASSWORD: ${REDIS_PASSWORD}

cycles-server-2:
  image: ghcr.io/runcycles/cycles-server:0.1.25.17
  environment:
    REDIS_HOST: redis-primary
    REDIS_PORT: 6379
    REDIS_PASSWORD: ${REDIS_PASSWORD}

Any load balancing strategy works (round-robin, least-connections). No sticky sessions required.

Health checks

All three services expose Spring Boot Actuator health at /actuator/health. Only the Admin Server also enables Spring's dedicated Kubernetes liveness/readiness probes (management.endpoint.health.probes.enabled=true); the runtime and events services expose only the aggregate endpoint.

bash

# Cycles Server (aggregate only)
curl http://localhost:7878/actuator/health

# Admin Server (aggregate + Kubernetes probes)
curl http://localhost:7979/actuator/health
curl http://localhost:7979/actuator/health/liveness
curl http://localhost:7979/actuator/health/readiness

# Events Service (aggregate only)
curl http://localhost:7980/actuator/health

Configure your load balancer or orchestrator to check these endpoints. On Kubernetes, wire the Admin Server's liveness/readiness probes to /actuator/health/liveness and /actuator/health/readiness. For the runtime and events services, probe /actuator/health directly. All three services rely on Spring Boot's default Redis health indicator — the aggregate /actuator/health status turns DOWN when Redis is unreachable. There is no custom queue-consumption health check on the Events Service today; for backlog monitoring, watch LLEN dispatch:pending (see Monitoring and Alerting).

JVM tuning

The default JVM settings work for most deployments. For high-throughput environments:

bash

JAVA_OPTS="-Xms512m -Xmx1g -XX:+UseG1GC"

Reservation expiry

The server runs a background sweep to expire stale reservations:

yaml

cycles:
  expiry:
    interval-ms: 5000  # Default: sweep every 5 seconds

Reduce the interval for tighter TTL enforcement. Increase it to reduce Redis load if TTL precision is not critical.

For listing and recovering stale or orphaned reservations after client crashes, see Reservation Recovery and Listing.

Events Service configuration

The Cycles Events Service (cycles-server-events, port 7980) delivers webhook notifications asynchronously. It is optional — if not deployed, admin and runtime servers continue operating normally, and events accumulate in Redis with TTL until the service starts.

Configuration

Variable	Default	Description
`WEBHOOK_SECRET_ENCRYPTION_KEY`	(empty)	AES-256-GCM key for signing secret encryption. Base64, 32 bytes. Same across all services. Generate: `openssl rand -base64 32`
`EVENT_TTL_DAYS`	90	Redis TTL for event records
`DELIVERY_TTL_DAYS`	14	Redis TTL for delivery records
`MAX_DELIVERY_AGE_MS`	86400000	Stale deliveries auto-fail after this age (24h default)
`dispatch.retry.poll-interval-ms`	5000	How often the retry scheduler scans for ready-to-retry deliveries.
`dispatch.retry.batch-size`	100	Max deliveries processed per retry-scan tick.
`dispatch.http.timeout-seconds`	30	HTTP request timeout per delivery attempt.
`dispatch.http.connect-timeout-seconds`	5	HTTP connect timeout per delivery attempt.

The per-subscription retry policy (exponential backoff) defaults to max_retries=5, initial_delay_ms=1000, backoff_multiplier=2.0, max_delay_ms=60000. A delivery older than MAX_DELIVERY_AGE_MS is failed immediately without further retries. See the Events Service section in the Server Configuration Reference for the full knob list.

Running multiple instances

The Events Service is safe to run as multiple instances. Each instance consumes from the dispatch:pending Redis queue via BRPOP, which is atomic — each delivery job is processed by exactly one instance.

yaml

cycles-events-1:
  image: ghcr.io/runcycles/cycles-server-events:0.1.25.10
  environment:
    REDIS_HOST: redis-primary
    REDIS_PORT: 6379
    REDIS_PASSWORD: ${REDIS_PASSWORD}
    WEBHOOK_SECRET_ENCRYPTION_KEY: ${WEBHOOK_SECRET_ENCRYPTION_KEY}

cycles-events-2:
  image: ghcr.io/runcycles/cycles-server-events:0.1.25.10
  environment:
    REDIS_HOST: redis-primary
    REDIS_PORT: 6379
    REDIS_PASSWORD: ${REDIS_PASSWORD}
    WEBHOOK_SECRET_ENCRYPTION_KEY: ${WEBHOOK_SECRET_ENCRYPTION_KEY}

Events Service down

If the Events Service is unavailable:

Admin and runtime servers are unaffected — event dispatch is fire-and-forget
Redis accumulates events with TTL (90-day events, 14-day deliveries)
On restart: stale deliveries older than MAX_DELIVERY_AGE_MS (default 24h) auto-fail; fresh ones deliver normally

Network architecture

Recommended topology

Load BalancerPort 7878 · Application trafficRuntime plane

Cycles ServerMultiple instances for HAStateless — all state in RedisRuntime plane

Admin ServerPort 7979 · Internal/VPN onlyManagement planeManagement plane

Both connect to the same instance

Redis 7+Single shared instance (or Redis Cluster) · Port 6379 · Internal network only

BRPOP (dispatch:pending)

Cycles Events ServicePort 7980 · Webhook delivery (optional)Multiple instances safe — BRPOP is atomicInternal only

Network isolation

Cycles Server (port 7878): Accessible to your application. Can be on an internal network or behind an API gateway.
Admin Server (port 7979): Internal access only. This manages tenants, API keys, and budgets. Never expose to the public internet.
Events Service (port 7980): Internal access only. Consumes from Redis and delivers webhooks outbound. Never needs inbound traffic from applications.
Redis (port 6379): Internal access only. Never expose directly.

TLS termination

Terminate TLS at the load balancer or API gateway. The Cycles Server itself runs plain HTTP. Example with nginx:

nginx

server {
    listen 443 ssl;
    server_name cycles.internal.example.com;

    ssl_certificate /etc/ssl/certs/cycles.crt;
    ssl_certificate_key /etc/ssl/private/cycles.key;

    location / {
        proxy_pass http://cycles-server:7878;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Capacity planning

Rules of thumb

Redis memory: ~1 KB per active reservation, ~500 bytes per budget ledger. 1 GB of Redis memory supports roughly 500K concurrent reservations.
Server CPU: Each reservation involves 1 Redis Lua script execution (~1ms). A single server instance can handle thousands of reservations per second.
Latency: Expect <5ms for reservation operations on a well-configured setup (server co-located with Redis).

Scaling triggers

Add more Cycles Server instances when:

Response latency exceeds 50ms at p99
CPU utilization exceeds 70%

Add more Events Service instances when:

The dispatch:pending queue depth grows consistently (redis-cli LLEN dispatch:pending)
Webhook delivery latency exceeds acceptable thresholds
Multiple instances are safe — BRPOP is atomic, so each delivery is processed exactly once

Scale Redis when:

Memory utilization exceeds 80%
Command latency exceeds 5ms

Upgrade procedures

Rolling upgrade

All three services (Cycles Server, Admin Server, Events Service) are stateless — all state lives in Redis. You can do rolling upgrades with zero downtime:

Pull the new image: docker pull ghcr.io/runcycles/cycles-server:NEW_VERSION
Stop one instance at a time
Start the new version
Verify health check passes (/actuator/health on ports 7878, 7979, 7980)
Repeat for remaining instances

The Events Service can be upgraded independently. While it is down, webhook deliveries queue in Redis and are processed when the new version starts.

Version compatibility

The Cycles protocol is versioned (/v1). Minor version upgrades (e.g., 0.1.23 → 0.1.24) are backward-compatible. Check the changelog for breaking changes before major upgrades.

Rollback

If an upgrade causes issues:

Stop the new version
Start the previous version
Redis state is compatible across minor versions

Logging

Log levels

Configure via Spring Boot:

yaml

logging:
  level:
    io.runcycles: INFO      # Application logs
    org.springframework: WARN # Framework logs

Set io.runcycles: DEBUG for troubleshooting (includes full request/response logging).

Structured logging

Add JSON logging for log aggregation systems:

yaml

logging:
  pattern:
    console: '{"timestamp":"%d","level":"%p","logger":"%c","message":"%m"}%n'

Or use the Spring Boot JSON logging starter for full structured output.

Operational runbooks

Budget exhaustion alert

Symptom: Applications report BUDGET_EXCEEDED errors.

Response:

Check which scope is exhausted: GET /v1/balances?tenant=...
Determine if this is expected (legitimate traffic) or unexpected (runaway agent)
If expected: fund the budget via admin API (POST .../fund with CREDIT)
If unexpected: check active reservations for anomalies (GET /v1/reservations?status=ACTIVE)

Reservation leak

Symptom: Budget reserved amount grows but spent stays flat. Reservations are being created but never committed or released.

Response:

List active reservations: GET /v1/reservations?status=ACTIVE
Check for reservations past their expected TTL
The expiry sweep should eventually clean these up. If it's not running, check the server logs.
Investigate the client application — it may be failing to commit or release.

Commit failure after successful LLM call

Symptom: An LLM call (or other side-effecting action) completes successfully, but the subsequent commit to Cycles fails. The work happened and incurred real cost, but the budget ledger does not reflect it.

Why this happens:

Transient network error between client and Cycles Server
Cycles Server restart or Redis outage at commit time
Client process crash after the LLM call but before commit

What the retry engine does:

All three clients (Python, TypeScript, Spring Boot) include a commit retry engine enabled by default. When a commit fails with a transport error or 5xx response, the engine retries with exponential backoff (default: 5 attempts over ~30 seconds). This handles most transient failures automatically.

When retry is not enough:

If all retries are exhausted or the client process crashes entirely, the reservation remains in ACTIVE state until it expires (based on TTL + grace period). After expiry, the reserved budget is returned to the pool. The actual cost is unaccounted for — the budget appears more available than it really is.

Response:

Check for expired reservations that were never committed:

bash

curl -s "http://localhost:7878/v1/reservations?tenant=acme-corp&status=EXPIRED" \
  -H "X-Cycles-API-Key: $API_KEY" | jq '.reservations[] | {reservation_id, scope_path, estimate: .estimate.amount, created_at, expired_at}'

Reconcile using events: For each expired reservation that represents real work, record the actual cost as a standalone event:

bash

curl -s -X POST http://localhost:7878/v1/events \
  -H "Content-Type: application/json" \
  -H "X-Cycles-API-Key: $API_KEY" \
  -d '{
    "idempotency_key": "reconcile-<reservation_id>",
    "subject": { "tenant": "acme-corp" },
    "action": { "kind": "reconciliation", "name": "commit-failure-recovery" },
    "actual": { "unit": "USD_MICROCENTS", "amount": <actual_cost> },
    "overage_policy": "ALLOW_WITH_OVERDRAFT",
    "metadata": { "original_reservation_id": "<reservation_id>" }
  }'

Monitor commit failure rates. A sustained increase in commit failures signals infrastructure issues. Track the ratio of committed vs. expired reservations.

Prevention:

Keep retry enabled (default) with aggressive settings for critical workloads
Use ALLOW_WITH_OVERDRAFT overage policy for must-record actions so reconciliation events are always accepted
Ensure client processes have graceful shutdown hooks that commit or release active reservations
Set up alerts on the expired reservation count (see Monitoring and Alerting)

Redis connection loss

Symptom: All reservation operations fail with 500 errors. Events Service also stops processing deliveries.

Response:

Check Redis connectivity: redis-cli ping
Check server logs for connection errors on all three services (ports 7878, 7979, 7980)
Restart services if Redis connection pool is exhausted
Active reservations with remaining TTL are preserved in Redis and will resume when connectivity returns
Queued webhook deliveries resume automatically when the Events Service reconnects

Webhook delivery failures

Symptom: Webhook endpoints are not receiving events. Queue depth grows.

Response:

Check Events Service health: GET http://localhost:7980/actuator/health
Check queue depth: redis-cli LLEN dispatch:pending
Check if subscription was auto-disabled: GET /v1/admin/webhooks/{subscription_id}
Re-enable if needed: PATCH /v1/admin/webhooks/{subscription_id} with {"status": "ACTIVE"}
Verify WEBHOOK_SECRET_ENCRYPTION_KEY matches across all services

Next steps

Webhook Integrations — PagerDuty, Slack, ServiceNow examples
Client Performance Tuning — timeout, retry, and connection pool optimization
Security Hardening — Redis AUTH, TLS, key rotation, webhook security
Monitoring and Alerting — metrics and alerting setup
Server Configuration Reference — all configuration properties

Production Operations Guide ​

Redis configuration for production ​

Persistence ​

Memory management ​

High availability ​

Backup strategy ​

Cycles Server configuration ​

Running multiple instances ​

Health checks ​

JVM tuning ​

Reservation expiry ​

Events Service configuration ​

Configuration ​

Running multiple instances ​

Events Service down ​

Network architecture ​

Recommended topology ​

Network isolation ​

TLS termination ​

Capacity planning ​

Rules of thumb ​

Scaling triggers ​

Upgrade procedures ​

Rolling upgrade ​

Version compatibility ​

Rollback ​

Logging ​

Log levels ​

Structured logging ​

Operational runbooks ​

Budget exhaustion alert ​

Reservation leak ​

Commit failure after successful LLM call ​

Redis connection loss ​

Webhook delivery failures ​

Next steps ​

Production Operations Guide

Redis configuration for production

Persistence

Memory management

High availability

Backup strategy

Cycles Server configuration

Running multiple instances

Health checks

JVM tuning

Reservation expiry

Events Service configuration

Configuration

Running multiple instances

Events Service down

Network architecture

Recommended topology

Network isolation

TLS termination

Capacity planning

Rules of thumb

Scaling triggers

Upgrade procedures

Rolling upgrade

Version compatibility

Rollback

Logging

Log levels

Structured logging

Operational runbooks

Budget exhaustion alert

Reservation leak

Commit failure after successful LLM call

Redis connection loss

Webhook delivery failures

Next steps