Quick Start
How It Works
The classification system converts every LLM exception into a structuredFailoverDecision that tells retry logic exactly what to do.
Error Categories
All LLM failures are classified into these 11 typed categories:| Kind | Triggers (examples) | Default Action | Retryable |
|---|---|---|---|
auth | unauthorized, api key, authentication failed | rotate_profile (if failover enabled) | Yes |
auth_permanent | invalid api key, incorrect api key | surface_error | No |
rate_limit | rate limit, 429, resource_exhausted | retry (with parsed/exponential backoff, max 60s) | Yes |
overloaded | 503, 502, 500, service unavailable | retry (2s→4s→8s, capped at 30s) | Yes |
context_overflow | maximum context length, context window is too long | surface_error | No |
idle_timeout | timeout, timed out, deadline exceeded | retry until breaker hits 3, then surface_error | Yes (until breaker) |
billing | insufficient quota, quota exceeded, payment required | surface_error | No |
model_not_found | model not found, unknown model | surface_error | No |
empty_response | empty response, json decode error | retry (limited) | Limited |
format_error | validation error, invalid json, schema error | surface_error | No |
unknown | anything else | retry for attempt ≤ 2, then surface_error | Limited |
FailoverDecision Structure
Every classification produces aFailoverDecision with these fields:
| Field | Type | Description |
|---|---|---|
action | "retry" | "rotate_profile" | "surface_error" | What action to take |
reason | AgentErrorKind | The classified error type |
backoff_ms | int | Milliseconds to wait before retry (0 = immediate) |
is_retryable | bool | Whether this error is worth retrying |
Idle-Timeout Circuit Breaker
The idle-timeout circuit breaker is separate from the per-tool circuit breaker. It protects against LLM provider stalls:- Default: Stops after 3 consecutive
idle_timeoutfailures - Auto-resets: On any successful LLM call
- Only triggered by:
idle_timeouterror kind (not other timeouts)
Choosing Between Options
Common Patterns
Log Every Classified Failure
Custom Breaker for Slow Models
Gate Alerts by Error Type
Legacy Migration
The old
error_category string values still work but emit a DeprecationWarning. Update to the new typed categories for cleaner code.Old error_category | New AgentErrorKind | Migration |
|---|---|---|
"tool" | "unknown" | Update error classification logic |
"llm" | "unknown" | More specific classification available |
"budget" | "billing" | Direct replacement |
"validation" | "format_error" | Direct replacement |
"network" | "unknown" | Use specific network error kinds |
"handoff" | "unknown" | Agent handoff errors are separate |
Best Practices
Treat Permanent Errors as Config Issues
Treat Permanent Errors as Config Issues
Errors classified as
auth_permanent, model_not_found, and context_overflow indicate configuration problems, not transient failures. Set up monitoring to catch these during development.Tune Circuit Breaker by Model Speed
Tune Circuit Breaker by Model Speed
Fast cloud models can use the default
max_consecutive=3. Slow self-hosted models should increase this to avoid premature circuit breaking.Use Typed Categories Over String Matching
Use Typed Categories Over String Matching
Instead of parsing error messages, use the typed
error_category field for reliable error handling.Pair with Model Failover for Cross-Provider Resilience
Pair with Model Failover for Cross-Provider Resilience
Combine error classification with Model Failover to automatically switch providers on
auth errors.Related
Structured LLM Errors
Foundation error handling with LLMError structure
Model Failover
Cross-provider failover with FailoverManager
Tool Circuit Breaker
Per-tool circuit breaking for tool execution
Execution Config
Configure max_iter and other execution parameters

