LLM failures use provider-aware error classification with structured recovery routing, enabling intelligent retry policies, context compression, credential rotation, and model fallback strategies.
LLM errors now also carry typed AgentErrorKind classifications for more precise error handling. See LLM Error Classification for the complete taxonomy and failover decision system.
Quick Start
Simple Agent with Structured Errors
from praisonaiagents import Agent
agent = Agent (
name = " Error Handler " ,
instructions = " Process user requests with automatic error recovery " ,
on_error = lambda error : print ( f "Error: { error . context [ ' user_message ' ] } " )
)
result = agent . start ( " Hello world " )
Advanced Error Classification
from praisonaiagents import Agent
from praisonaiagents . llm . error_classifier import classify_llm_error
def handle_structured_error ( error ):
""" Handle errors with structured classification """
category = error . context . get ( " error_category " , " unknown " )
user_msg = error . context . get ( " user_message " , " Unknown error " )
print ( f "Category: { category } " )
print ( f "Message: { user_msg } " )
if category == " rate_limit " :
print ( " Rate limit hit - will retry with backoff " )
elif category == " context_overflow " :
print ( " Context too large - will compress and retry " )
elif category == " auth " :
print ( " Authentication failed - check API keys " )
agent = Agent (
name = " Advanced Handler " ,
instructions = " Process requests with detailed error handling " ,
on_error = handle_structured_error
)
Error Categories
The new classifier recognizes seven distinct error categories with specific recovery actions:
Category Error Type Recovery Action Retryable rate_limitToo many requests Jittered backoff with provider-specific delays ✅ context_overflowInput exceeds model limits Compress context to 70% of limit ✅ authInvalid API credentials Credential rotation (not yet implemented) ❌ overloadedService temporarily unavailable Model fallback + jittered backoff ✅ model_errorMalformed request/parameters Surface to user for correction ❌ permanentUnrecoverable error Surface to user ❌ unknownUnclassified error Default retry with backoff ✅
LLMErrorClassification Structure
The new structured classification provides detailed recovery routing information:
Field Type Description error_categorystrOne of: rate_limit, context_overflow, auth, overloaded, model_error, permanent, unknown is_retryableboolWhether retry is safe should_compress_contextboolIf True, compress messages then retry should_rotate_credentialboolIf True, credentials should be rotated should_fallback_modelboolIf True, switch to alternate model backoff_secondsfloatWait before retry (jittered) user_messagestrFriendly explanation for the end user
from praisonaiagents . llm . error_classifier import classify_llm_error
classification = classify_llm_error (
exc , # The exception
provider = " openai " , # "openai" | "anthropic" | "azure"
model = " gpt-4 " ,
prompt_tokens = 0 , # optional
context_length = 0 , # optional
retry_depth = 0 , # optional
)
print ( f "Category: { classification . error_category } " )
print ( f "Should compress: { classification . should_compress_context } " )
print ( f "Backoff time: { classification . backoff_seconds } " )
Provider-Aware Backoff
Different providers have different rate limiting patterns, so the classifier uses provider-specific base delays:
Provider Rate Limit Base Delay Service Unavailable Delay openai60 seconds 15 seconds anthropic20 seconds 15 seconds azure45 seconds 15 seconds Default 30 seconds 15 seconds
Backoff times include ±50% jitter to prevent thundering herd problems when multiple agents hit rate limits simultaneously.
Recovery Actions
Jittered Backoff
The retry system uses exponential backoff with ±50% jitter to avoid thundering herd problems:
from praisonaiagents . llm . retry_utils import jittered_backoff
# Calculate delay with jitter
delay = jittered_backoff ( attempt = 1 , base = 5.0 , cap = 120.0 ) # ~2.5-7.5 seconds
delay = jittered_backoff ( attempt = 2 , base = 5.0 , cap = 120.0 ) # ~5.0-15.0 seconds
delay = jittered_backoff ( attempt = 3 , base = 5.0 , cap = 120.0 ) # ~10.0-30.0 seconds
Advanced Classification Usage
For users who want direct access to the classifier:
from praisonaiagents . llm . error_classifier import classify_llm_error
from praisonaiagents . llm . retry_utils import calculate_backoff_with_retry_after
def custom_retry_loop ( exc , provider = " openai " , model = " gpt-4 " ):
""" Custom retry logic using structured classification """
classification = classify_llm_error (
exc ,
provider = provider ,
model = model ,
retry_depth = 0
)
print ( f "Error category: { classification . error_category } " )
print ( f "User message: { classification . user_message } " )
if not classification . is_retryable :
print ( " Error is not retryable " )
return False
if classification . should_compress_context :
print ( " Context compression needed " )
# Implement context compression logic
if classification . should_fallback_model :
print ( " Model fallback suggested " )
# Implement model fallback logic
if classification . backoff_seconds > 0 :
print ( f "Waiting { classification . backoff_seconds :.1f } seconds..." )
import time
time . sleep ( classification . backoff_seconds )
return True # Proceed with retry
How It Works
Async Support
Structured error classification works with both sync and async agents:
import asyncio
from praisonaiagents import Agent
async def async_error_example ():
agent = Agent (
name = " Async Agent " ,
instructions = " Process requests asynchronously " ,
on_error = lambda error : print ( f "Async error: { error . context [ ' error_category ' ] } " )
)
result = await agent . astart ( " Hello async world " )
return result
# The same structured classification applies to async flows
# Rate limiting, context compression, and other recovery actions
# are handled automatically with asyncio.sleep for delays
Typed Error Classification (AgentErrorKind)
LLM errors are automatically classified into typed AgentErrorKind categories for precise handling. For the complete system, see LLM Error Classification .
Quick Reference
Retryable : rate_limit, overloaded, idle_timeout, auth
Non-retryable : auth_permanent, model_not_found, format_error, context_overflow, billing
Limited retry : unknown, empty_response
Legacy Support
The simple two-bucket classification (retryable/non-retryable) remains available for backward compatibility, but typed categories provide much more control.
LLMError Structure
The LLMError class provides structured error information:
Field Type Description messagestrError description model_namestrLLM model that failed agent_idstrAgent identifier session_idstrSession identifier is_retryableboolWhether error can be retried error_categoryAgentErrorKindTyped classification — see LLM Error Classification
Error Context
The on_error handler receives enhanced context information:
def enhanced_error_handler ( error ):
""" Access structured error information """
context = error . context
# New structured fields
category = context . get ( " error_category " , " unknown " )
user_message = context . get ( " user_message " , "" )
# Original fields still available
model = error . model_name
agent_id = error . agent_id
retryable = error . is_retryable
print ( f "Category: { category } " )
print ( f "Model: { model } " )
print ( f "User-friendly message: { user_message } " )
print ( f "Can retry: { retryable } " )
agent = Agent (
name = " Enhanced Error Agent " ,
instructions = " Process with enhanced error context " ,
on_error = enhanced_error_handler
)
Retry Depth Limits
The system limits retry depth to prevent infinite loops:
Maximum retry depth : 2 attempts
Context compression : Triggered on context_overflow category
Bounded recovery : After 2 failed retries, errors become non-retryable
# Retry behavior:
# 1st failure: Classify and retry with recovery action
# 2nd failure: Classify and retry with recovery action
# 3rd failure: Mark as non-retryable and surface to user
Unimplemented Recovery Actions
Some recovery actions are planned but not yet implemented in the core system:
Credential Rotation : When should_rotate_credential=True, the user message includes “Credential rotation is not yet implemented”. Users must handle credential management manually.Model Fallback : When should_fallback_model=True, the user message includes “Model fallback is not yet implemented”. Users can implement custom fallback logic in their on_error handlers.
Migration from Binary Classification
If you were previously checking error.is_retryable only, you can now access richer classification:
Before (binary classification):
def simple_handler ( error ):
if error . is_retryable :
print ( " Will retry " )
else :
print ( " Won't retry " )
After (structured classification):
def structured_handler ( error ):
category = error . context . get ( " error_category " , " unknown " )
user_msg = error . context . get ( " user_message " , "" )
# Still works
if error . is_retryable :
print ( f "Will retry ( { category } ): { user_msg } " )
else :
print ( f "Won't retry ( { category } ): { user_msg } " )
Typed Error Categories (New):
from praisonaiagents . errors import LLMError
try :
response = agent . _chat_completion ( messages )
except LLMError as e :
# Use typed categories instead of string parsing
if e . error_category == " billing " :
handle_quota_exceeded ()
elif e . error_category == " auth_permanent " :
handle_invalid_api_key ()
elif e . error_category == " rate_limit " :
# Auto-retry handles this
raise
Best Practices
Read error_category Instead of Regex Matching
Use the structured error_category field instead of pattern matching on error messages: def smart_error_handler ( error ):
""" Handle errors using structured categories """
category = error . context . get ( " error_category " , " unknown " )
# Better: Use structured category
if category == " rate_limit " :
print ( " Rate limit detected - will retry with provider-specific backoff " )
elif category == " context_overflow " :
print ( " Context too large - will compress and retry " )
elif category == " auth " :
print ( " Authentication failed - check credentials " )
# Avoid: Pattern matching on error message
# if "rate limit" in error.message.lower():
# ...
Monitor Structured Error Categories
Track error patterns using the new categorical data: error_counts = {
" rate_limit " : 0 ,
" context_overflow " : 0 ,
" auth " : 0 ,
" overloaded " : 0 ,
" model_error " : 0 ,
" permanent " : 0 ,
" unknown " : 0
}
def track_structured_errors ( error ):
category = error . context . get ( " error_category " , " unknown " )
error_counts [ category ] += 1
# Send structured metrics to monitoring
send_metric ( f "llm.error. { category } " , 1 , {
" provider " : error . context . get ( " provider " , " unknown " ),
" model " : error . model_name
})
Implement Custom Recovery Logic
Build on the structured classification for advanced recovery: def advanced_recovery_handler ( error ):
category = error . context . get ( " error_category " , " unknown " )
user_msg = error . context . get ( " user_message " , "" )
if category == " auth " :
# Custom credential rotation
print ( " Attempting credential rotation... " )
rotate_api_keys ()
elif category == " overloaded " :
# Custom model fallback
print ( " Primary model overloaded, switching to backup... " )
switch_to_backup_model ()
elif category == " context_overflow " :
# Log compression metrics
log_compression_event ( error . context . get ( " prompt_tokens " , 0 ))
print ( f "User message: { user_msg } " )
Provider-Specific Error Handling
Customize handling based on the LLM provider: def provider_aware_handler ( error ):
category = error . context . get ( " error_category " , " unknown " )
provider = error . context . get ( " provider " , " unknown " )
if category == " rate_limit " :
if provider == " openai " :
print ( " OpenAI rate limit - 60s base delay with jitter " )
elif provider == " anthropic " :
print ( " Anthropic rate limit - 20s base delay with jitter " )
elif provider == " azure " :
print ( " Azure rate limit - 45s base delay with jitter " )
elif category == " overloaded " and provider == " anthropic " :
print ( " Anthropic service overloaded - consider Claude alternatives " )
LLM Error Classification Typed error categories and failover decisions
Task Retry Policy Configure task-level retry behavior and policies
Hooks Agent lifecycle hooks and events
Model Failover Cross-provider failover with FailoverManager