Prometheus Metrics Reference
This page enumerates every custom metric that Cycles' reference servers expose on their Prometheus endpoints, along with tag definitions, cardinality guidance, and a sample scrape config. For the higher-level observability playbook (alert recipes, SLOs) see Monitoring and Alerting.
Cycles' Micrometer instrumentation uses dotted source names (cycles.*) which Prometheus rewrites to underscores with a _total suffix on scrape. Source names below are the raw Micrometer identifier; the Prometheus column is what you actually query and alert on.
Tenant-tag cardinality flag
Every counter tagged with tenant respects a shared toggle: cycles.metrics.tenant-tag.enabled (env CYCLES_METRICS_TENANT_TAG_ENABLED, default true) in both cycles-server and cycles-server-events. Deployments with many thousands of tenants can flip it to false to drop the per-tenant series and keep Prometheus cardinality bounded; drop it consistently across services so dashboards can share the same tag schema.
Null or blank tag values are normalised to the sentinel UNKNOWN. Missing tags would otherwise collapse series — making it look like traffic moved when the upstream data actually just got sparse.
Scrape targets
| Service | Scrape port | Prometheus path |
|---|---|---|
cycles-server (runtime) | 7878 (same as API) | /actuator/prometheus |
cycles-server-events (dispatcher) | 9980 (dedicated management port, split from API 7980 in v0.1.25.9, env MANAGEMENT_PORT) | /actuator/prometheus |
cycles-server-admin | 7979 (same as API) | /actuator/prometheus |
Events-service port split. Starting with cycles-server-events v0.1.25.9, the health, info, and prometheus actuator endpoints moved from the public API port 7980 to a dedicated management port (default 9980, env MANAGEMENT_PORT). Scrape configs, kubelet probes, and Docker HEALTHCHECK commands must target :9980 — the published Docker image's HEALTHCHECK is already updated. Expose 7980 publicly; keep 9980 internal-only.
Runtime server (cycles-server)
Introduced in v0.1.25.10. All counters live under the cycles.* namespace.
| Source | Prometheus | Type | Tags | Description |
|---|---|---|---|---|
cycles.reservations.reserve | cycles_reservations_reserve_total | Counter | tenant, decision, reason, overage_policy | Every POST /v1/reservations outcome. |
cycles.reservations.commit | cycles_reservations_commit_total | Counter | tenant, decision, reason, overage_policy | Every POST /v1/reservations/{id}/commit outcome. |
cycles.reservations.release | cycles_reservations_release_total | Counter | tenant, actor_type, decision, reason | Every successful release. actor_type distinguishes tenant-driven releases from v0.1.25.8 admin-on-behalf-of releases. |
cycles.reservations.extend | cycles_reservations_extend_total | Counter | tenant, decision, reason | Every POST /v1/reservations/{id}/extend outcome. |
cycles.reservations.expired | cycles_reservations_expired_total | Counter | tenant | Each reservation the expiry sweep actually marks EXPIRED. Skipped reservations (still in grace, already finalised) do not increment. |
cycles.events | cycles_events_total | Counter | tenant, decision, reason, overage_policy | Every POST /v1/events outcome. |
cycles.overdraft.incurred | cycles_overdraft_incurred_total | Counter | tenant | Any commit or event that actually accrued non-zero debt. Unit-free signal — debt amount is tracked by the balance store, not here, to avoid leaking user-value distributions into metrics. |
Tag value reference (runtime)
| Tag | Values |
|---|---|
decision | ALLOW, ALLOW_WITH_CAPS, DENY, EXPIRED, RELEASED, COMMITTED, EXTENDED — the outcome of the decision machinery. |
reason | Spec-defined reason codes (INSUFFICIENT_FUNDS, OVER_LIMIT, POLICY_DENIED, etc.). UNKNOWN when the code path doesn't produce one. |
overage_policy | BLOCK, ALLOW_WITH_DEBT, ALLOW_WITH_CAP — which policy was in effect for the scope. |
actor_type | api_key (tenant-driven) or admin_on_behalf_of (admin-driven, v0.1.25.8+). |
Not instrumented (by design)
- HTTP-layer latency histograms — Spring Boot auto-emits
http.server.requestswithuri/method/statuslabels already. Use those for per-endpoint latency. - Lua-script execution time —
EVALSHAtimings would largely duplicate the HTTP timer for request-synchronous scripts. The expiry-sweep counter (cycles.reservations.expired) covers the one non-request-driven path.
Events service (cycles-server-events)
Introduced in v0.1.25.6. Mirrors the runtime's conventions: cycles.webhook.* rewrites to cycles_webhook_*_total (counters) or cycles_webhook_*_seconds (timer). Unlike the runtime, this service emits an explicit latency timer because it's the HTTP client — Spring's auto http.server.requests doesn't cover its primary I/O surface.
| Source | Prometheus | Type | Tags | Description |
|---|---|---|---|---|
cycles.webhook.delivery.attempts | cycles_webhook_delivery_attempts_total | Counter | tenant, event_type | Every outbound delivery attempt (first attempt + every retry). |
cycles.webhook.delivery.success | cycles_webhook_delivery_success_total | Counter | tenant, event_type, status_code_family | Successful deliveries. status_code_family: 2xx. |
cycles.webhook.delivery.failed | cycles_webhook_delivery_failed_total | Counter | tenant, event_type, reason | Failed deliveries. reason carries the transport-level code (timeout, connection_refused, 4xx, 5xx, etc.). |
cycles.webhook.delivery.retried | cycles_webhook_delivery_retried_total | Counter | tenant, event_type | Deliveries that re-entered the retry queue. |
cycles.webhook.delivery.stale | cycles_webhook_delivery_stale_total | Counter | tenant | Deliveries auto-failed on pickup for exceeding dispatch.max-delivery-age-ms (default 24h). |
cycles.webhook.subscription.auto_disabled | cycles_webhook_subscription_auto_disabled_total | Counter | tenant, reason | Subscriptions auto-disabled after consecutive failures crossed the threshold. Reason is typically consecutive_failures_exceeded_threshold. Always emitted together with a webhook.disabled Event (v0.1.25.11). |
cycles.webhook.delivery.latency | cycles_webhook_delivery_latency_seconds | Timer | tenant, event_type, outcome | Round-trip time on deliveries that actually produced a transport response. outcome: success or failure. Upstream failures (event_not_found, etc.) have no meaningful latency and do not record to this timer. |
cycles.webhook.events.payload.invalid | cycles_webhook_events_payload_invalid_total | Counter | type, rule | Non-fatal shape discrepancy found by EventPayloadValidator on an ingested event. No tenant dimension — the discrepancy is about payload shape, not tenant traffic. rule examples: trace_id_shape, correlation_id_shape, timestamp_shape. |
Tag value reference (events)
| Tag | Values |
|---|---|
event_type | Event kind from the Event Payloads Reference (e.g. reservation.reserved, budget.over_limit, webhook.disabled). Up to ~51 distinct values. |
status_code_family | 2xx (success bucket). Non-2xx responses land on cycles_webhook_delivery_failed_total with reason instead. |
reason (on _failed_total) | timeout, connection_refused, connection_reset, ssl_error, 4xx, 5xx, event_not_found, signing_key_unavailable. |
Admin server (cycles-server-admin)
Exposed since admin observability rollout (v0.1.25.9+). Metric names use the cycles_admin_* prefix.
| Prometheus | Type | Tags | Description |
|---|---|---|---|
cycles_admin_audit_writes_total | Counter | path_class, outcome | Audit-trail write accounting. outcome values: written, error (Redis write failed — alert on nonzero), sampled-out (pre-auth sampling dropped the entry per audit.sample.unauthenticated). path_class groups endpoints for coarse-grained triage. Shipped v0.1.25.20 alongside the audit-on-failure coverage. |
cycles_admin_events_emitted_total | Counter | type, result | Admin-emitted Event accounting. result: success or failure. |
cycles_admin_events_payload_invalid_total | Counter | type, expected_class | Jackson round-trip found an Event payload that didn't match its declared schema. Non-fatal — admin continues to accept the event. |
cycles_admin_webhook_dispatched_total | Counter | result | Enqueue-to-dispatcher accounting. The end-to-end delivery metric is cycles_webhook_delivery_* on the events service. |
Sample scrape config
scrape_configs:
- job_name: cycles-runtime
metrics_path: /actuator/prometheus
static_configs:
- targets: ['cycles-server:7878']
- job_name: cycles-events
metrics_path: /actuator/prometheus
static_configs:
- targets: ['cycles-server-events:9980'] # management port, NOT the API port (7980)
- job_name: cycles-admin
metrics_path: /actuator/prometheus
static_configs:
- targets: ['cycles-server-admin:7979']Cardinality guidance
The tenant tag is the dominant cardinality driver. A deployment with 10,000 tenants and all seven runtime counters produces ~70,000 time series just from the tenant dimension. If Prometheus memory / scrape duration becomes a concern:
- Flip
cycles.metrics.tenant-tag.enabledtofalseon runtime and events. Counters drop thetenanttag; you lose per-tenant drill-downs but keep decision / reason / outcome signals. - Aggregate at scrape time with
metric_relabel_configsto drop the tag selectively on high-cardinality metrics while keeping it on the ones you still want tenant-sliced. - Keep per-tenant on Timer, drop on Counters if delivery-latency-per-tenant is the signal you care about most.
event_type is bounded by the spec (51 values today, additive over time). reason and decision are enum-bounded and safe.
Quick alert recipes
| Signal | Query | Typical threshold |
|---|---|---|
| Audit-write error | sum(rate(cycles_admin_audit_writes_total{outcome="error"}[5m])) > 0 | Any nonzero — audit trail has a gap. |
| Webhook auto-disable rate | sum(rate(cycles_webhook_subscription_auto_disabled_total[15m])) > 0 | Any nonzero — a subscription was just auto-disabled. |
| Overdraft rate spike | sum(rate(cycles_overdraft_incurred_total[5m])) / sum(rate(cycles_reservations_commit_total[5m])) > 0.05 | Over 5% of commits are going into overdraft. |
| Dispatch p95 latency | histogram_quantile(0.95, sum(rate(cycles_webhook_delivery_latency_seconds_bucket{outcome="success"}[5m])) by (le)) | Over 10s — something downstream is struggling. |
| Stale-delivery rate | sum(rate(cycles_webhook_delivery_stale_total[1h])) > 0 | Any nonzero — deliveries are sitting in-queue past dispatch.max-delivery-age-ms. |
For the full set of alerts and SLOs see Monitoring and Alerting and Production Operations.