LLM Troubleshooting Guides
Diagnostic guides for common production issues with LLM applications and AI agents. Each page covers what the error means, why it happens, how to fix it now, and how to prevent the class of problem from recurring.
These are tactical pages: start with the symptom, diagnose the failure mode, fix the incident, then add runtime controls so the same class does not recur.
Provider rate limits
- OpenAI 429 Too Many Requests — TPM and RPM limits, tier-based quotas, and how to handle them under load
- Anthropic API rate limit errors — input/output token-per-minute limits, 529 overload, and graceful degradation
Cost and budget incidents
- Debugging sudden LLM cost spikes — agent loops, prompt regressions, model upgrades, retry storms, and tenant leakage
Related
- Cycles for Cost Control — what runtime budget enforcement actually prevents
- Incident Patterns — full incident-pattern catalog for production AI systems
- Cycles vs Rate Limiting — why provider rate limits do not prevent cost overruns