LLM Troubleshooting Guides

Diagnostic guides for common production issues with LLM applications and AI agents. Each page covers what the error means, why it happens, how to fix it now, and how to prevent the class of problem from recurring.

These are tactical pages: start with the symptom, diagnose the failure mode, fix the incident, then add runtime controls so the same class does not recur.

Provider rate limits

OpenAI 429 Too Many Requests — TPM and RPM limits, tier-based quotas, and how to handle them under load
Anthropic API rate limit errors — input/output token-per-minute limits, 529 overload, and graceful degradation

Cost and budget incidents

Debugging sudden LLM cost spikes — agent loops, prompt regressions, model upgrades, retry storms, and tenant leakage

Cycles for Cost Control — what runtime budget enforcement actually prevents
Incident Patterns — full incident-pattern catalog for production AI systems
Cycles vs Rate Limiting — why provider rate limits do not prevent cost overruns

Was this page helpful?