Skip to main content
LLM failures use provider-aware error classification with structured recovery routing, enabling intelligent retry policies, context compression, credential rotation, and model fallback strategies.
LLM errors now also carry typed AgentErrorKind classifications for more precise error handling. See LLM Error Classification for the complete taxonomy and failover decision system.

Quick Start

1

Simple Agent with Structured Errors

from praisonaiagents import Agent

agent = Agent(
    name="Error Handler",
    instructions="Process user requests with automatic error recovery",
    on_error=lambda error: print(f"Error: {error.context['user_message']}")
)

result = agent.start("Hello world")
2

Advanced Error Classification

from praisonaiagents import Agent
from praisonaiagents.llm.error_classifier import classify_llm_error

def handle_structured_error(error):
    """Handle errors with structured classification"""
    category = error.context.get("error_category", "unknown")
    user_msg = error.context.get("user_message", "Unknown error")
    
    print(f"Category: {category}")
    print(f"Message: {user_msg}")
    
    if category == "rate_limit":
        print("Rate limit hit - will retry with backoff")
    elif category == "context_overflow":
        print("Context too large - will compress and retry")
    elif category == "auth":
        print("Authentication failed - check API keys")

agent = Agent(
    name="Advanced Handler",
    instructions="Process requests with detailed error handling",
    on_error=handle_structured_error
)

Error Categories

The new classifier recognizes seven distinct error categories with specific recovery actions:
CategoryError TypeRecovery ActionRetryable
rate_limitToo many requestsJittered backoff with provider-specific delays
context_overflowInput exceeds model limitsCompress context to 70% of limit
authInvalid API credentialsCredential rotation (not yet implemented)
overloadedService temporarily unavailableModel fallback + jittered backoff
model_errorMalformed request/parametersSurface to user for correction
permanentUnrecoverable errorSurface to user
unknownUnclassified errorDefault retry with backoff

LLMErrorClassification Structure

The new structured classification provides detailed recovery routing information:
FieldTypeDescription
error_categorystrOne of: rate_limit, context_overflow, auth, overloaded, model_error, permanent, unknown
is_retryableboolWhether retry is safe
should_compress_contextboolIf True, compress messages then retry
should_rotate_credentialboolIf True, credentials should be rotated
should_fallback_modelboolIf True, switch to alternate model
backoff_secondsfloatWait before retry (jittered)
user_messagestrFriendly explanation for the end user
from praisonaiagents.llm.error_classifier import classify_llm_error

classification = classify_llm_error(
    exc,                  # The exception
    provider="openai",    # "openai" | "anthropic" | "azure"
    model="gpt-4",
    prompt_tokens=0,      # optional
    context_length=0,     # optional
    retry_depth=0,        # optional
)

print(f"Category: {classification.error_category}")
print(f"Should compress: {classification.should_compress_context}")
print(f"Backoff time: {classification.backoff_seconds}")

Provider-Aware Backoff

Different providers have different rate limiting patterns, so the classifier uses provider-specific base delays:
ProviderRate Limit Base DelayService Unavailable Delay
openai60 seconds15 seconds
anthropic20 seconds15 seconds
azure45 seconds15 seconds
Default30 seconds15 seconds
Backoff times include ±50% jitter to prevent thundering herd problems when multiple agents hit rate limits simultaneously.

Recovery Actions


Jittered Backoff

The retry system uses exponential backoff with ±50% jitter to avoid thundering herd problems:
from praisonaiagents.llm.retry_utils import jittered_backoff

# Calculate delay with jitter
delay = jittered_backoff(attempt=1, base=5.0, cap=120.0)  # ~2.5-7.5 seconds
delay = jittered_backoff(attempt=2, base=5.0, cap=120.0)  # ~5.0-15.0 seconds
delay = jittered_backoff(attempt=3, base=5.0, cap=120.0)  # ~10.0-30.0 seconds

Advanced Classification Usage

For users who want direct access to the classifier:
from praisonaiagents.llm.error_classifier import classify_llm_error
from praisonaiagents.llm.retry_utils import calculate_backoff_with_retry_after

def custom_retry_loop(exc, provider="openai", model="gpt-4"):
    """Custom retry logic using structured classification"""
    classification = classify_llm_error(
        exc,
        provider=provider,
        model=model,
        retry_depth=0
    )
    
    print(f"Error category: {classification.error_category}")
    print(f"User message: {classification.user_message}")
    
    if not classification.is_retryable:
        print("Error is not retryable")
        return False
    
    if classification.should_compress_context:
        print("Context compression needed")
        # Implement context compression logic
        
    if classification.should_fallback_model:
        print("Model fallback suggested")
        # Implement model fallback logic
        
    if classification.backoff_seconds > 0:
        print(f"Waiting {classification.backoff_seconds:.1f} seconds...")
        import time
        time.sleep(classification.backoff_seconds)
    
    return True  # Proceed with retry

How It Works

Async Support

Structured error classification works with both sync and async agents:
import asyncio
from praisonaiagents import Agent

async def async_error_example():
    agent = Agent(
        name="Async Agent",
        instructions="Process requests asynchronously",
        on_error=lambda error: print(f"Async error: {error.context['error_category']}")
    )
    
    result = await agent.astart("Hello async world")
    return result

# The same structured classification applies to async flows
# Rate limiting, context compression, and other recovery actions
# are handled automatically with asyncio.sleep for delays

Typed Error Classification (AgentErrorKind)

LLM errors are automatically classified into typed AgentErrorKind categories for precise handling. For the complete system, see LLM Error Classification.

Quick Reference

  • Retryable: rate_limit, overloaded, idle_timeout, auth
  • Non-retryable: auth_permanent, model_not_found, format_error, context_overflow, billing
  • Limited retry: unknown, empty_response

Legacy Support

The simple two-bucket classification (retryable/non-retryable) remains available for backward compatibility, but typed categories provide much more control.

LLMError Structure

The LLMError class provides structured error information:
FieldTypeDescription
messagestrError description
model_namestrLLM model that failed
agent_idstrAgent identifier
session_idstrSession identifier
is_retryableboolWhether error can be retried
error_categoryAgentErrorKindTyped classification — see LLM Error Classification

Error Context

The on_error handler receives enhanced context information:
def enhanced_error_handler(error):
    """Access structured error information"""
    context = error.context
    
    # New structured fields
    category = context.get("error_category", "unknown")
    user_message = context.get("user_message", "")
    
    # Original fields still available
    model = error.model_name
    agent_id = error.agent_id
    retryable = error.is_retryable
    
    print(f"Category: {category}")
    print(f"Model: {model}")
    print(f"User-friendly message: {user_message}")
    print(f"Can retry: {retryable}")

agent = Agent(
    name="Enhanced Error Agent",
    instructions="Process with enhanced error context",
    on_error=enhanced_error_handler
)

Retry Depth Limits

The system limits retry depth to prevent infinite loops:
  • Maximum retry depth: 2 attempts
  • Context compression: Triggered on context_overflow category
  • Bounded recovery: After 2 failed retries, errors become non-retryable
# Retry behavior:
# 1st failure: Classify and retry with recovery action
# 2nd failure: Classify and retry with recovery action  
# 3rd failure: Mark as non-retryable and surface to user

Unimplemented Recovery Actions

Some recovery actions are planned but not yet implemented in the core system:
Credential Rotation: When should_rotate_credential=True, the user message includes “Credential rotation is not yet implemented”. Users must handle credential management manually.Model Fallback: When should_fallback_model=True, the user message includes “Model fallback is not yet implemented”. Users can implement custom fallback logic in their on_error handlers.

Migration from Binary Classification

If you were previously checking error.is_retryable only, you can now access richer classification: Before (binary classification):
def simple_handler(error):
    if error.is_retryable:
        print("Will retry")
    else:
        print("Won't retry")
After (structured classification):
def structured_handler(error):
    category = error.context.get("error_category", "unknown")
    user_msg = error.context.get("user_message", "")
    
    # Still works
    if error.is_retryable:
        print(f"Will retry ({category}): {user_msg}")
    else:
        print(f"Won't retry ({category}): {user_msg}")
Typed Error Categories (New):
from praisonaiagents.errors import LLMError

try:
    response = agent._chat_completion(messages)
except LLMError as e:
    # Use typed categories instead of string parsing
    if e.error_category == "billing":
        handle_quota_exceeded()
    elif e.error_category == "auth_permanent":
        handle_invalid_api_key()
    elif e.error_category == "rate_limit":
        # Auto-retry handles this
        raise

Best Practices

Use the structured error_category field instead of pattern matching on error messages:
def smart_error_handler(error):
    """Handle errors using structured categories"""
    category = error.context.get("error_category", "unknown")
    
    # Better: Use structured category
    if category == "rate_limit":
        print("Rate limit detected - will retry with provider-specific backoff")
    elif category == "context_overflow":
        print("Context too large - will compress and retry")
    elif category == "auth":
        print("Authentication failed - check credentials")
    
    # Avoid: Pattern matching on error message
    # if "rate limit" in error.message.lower():
    #     ...
Track error patterns using the new categorical data:
error_counts = {
    "rate_limit": 0,
    "context_overflow": 0, 
    "auth": 0,
    "overloaded": 0,
    "model_error": 0,
    "permanent": 0,
    "unknown": 0
}

def track_structured_errors(error):
    category = error.context.get("error_category", "unknown")
    error_counts[category] += 1
    
    # Send structured metrics to monitoring
    send_metric(f"llm.error.{category}", 1, {
        "provider": error.context.get("provider", "unknown"),
        "model": error.model_name
    })
Build on the structured classification for advanced recovery:
def advanced_recovery_handler(error):
    category = error.context.get("error_category", "unknown")
    user_msg = error.context.get("user_message", "")
    
    if category == "auth":
        # Custom credential rotation
        print("Attempting credential rotation...")
        rotate_api_keys()
    
    elif category == "overloaded":
        # Custom model fallback
        print("Primary model overloaded, switching to backup...")
        switch_to_backup_model()
    
    elif category == "context_overflow":
        # Log compression metrics
        log_compression_event(error.context.get("prompt_tokens", 0))
    
    print(f"User message: {user_msg}")
Customize handling based on the LLM provider:
def provider_aware_handler(error):
    category = error.context.get("error_category", "unknown")
    provider = error.context.get("provider", "unknown")
    
    if category == "rate_limit":
        if provider == "openai":
            print("OpenAI rate limit - 60s base delay with jitter")
        elif provider == "anthropic":
            print("Anthropic rate limit - 20s base delay with jitter")
        elif provider == "azure":
            print("Azure rate limit - 45s base delay with jitter")
    
    elif category == "overloaded" and provider == "anthropic":
        print("Anthropic service overloaded - consider Claude alternatives")

LLM Error Classification

Typed error categories and failover decisions

Task Retry Policy

Configure task-level retry behavior and policies

Hooks

Agent lifecycle hooks and events

Model Failover

Cross-provider failover with FailoverManager