Skip to main content
Failover automatically switches between LLM providers when one fails, keeping agents running during API outages or rate limits.
Failover now integrates with LLM Error Classification through the new FailoverDecision struct, which coordinates profile rotation with typed error handling.

Quick Start

1

Run an agent with failover

import os
from praisonaiagents import Agent, AuthProfile, FailoverConfig, FailoverManager

manager = FailoverManager(FailoverConfig(max_retries=3, exponential_backoff=True))
manager.add_profile(AuthProfile(
    name="openai", provider="openai",
    api_key=os.getenv("OPENAI_API_KEY"), priority=1,
))
manager.add_profile(AuthProfile(
    name="anthropic", provider="anthropic",
    api_key=os.getenv("ANTHROPIC_API_KEY"), priority=2,
))

agent = Agent(
    name="assistant",
    llm={"model": "gpt-4o-mini", "failover_manager": manager},
)
agent.start("Hello!")
2

Add more profiles

import os
from praisonaiagents import AuthProfile, FailoverManager

manager = FailoverManager()
manager.add_profile(AuthProfile(
    name="groq", provider="groq",
    api_key=os.getenv("GROQ_API_KEY"), priority=3,
))
3

Monitor provider health

status = manager.status()
for name, info in status.items():
    print(f"{name}: {info['status']}")

How failover activates during retries

Failover now drives LLM retries through direct integration with the retry mechanism:
  • On every LLM call, the system first gets the current profile via get_next_profile() and applies its api_key, base_url, and model settings
  • On success, mark_success(profile) is called to track the working provider
  • On failure, mark_failure(profile, error, is_rate_limit=...) marks the provider as failed, then get_next_profile() fetches the next available provider
  • Profile switching overrides non-retryable classification—one extra attempt is always granted after switching providers
  • The LLM automatically updates request parameters (api_key, base_url, model) when switching between profiles

How It Works

ComponentRole
AuthProfileCredentials for a single provider
FailoverManagerOrchestrates failover logic
FailoverConfigRetry and backoff settings
ProviderStatusTracks provider health

Configuration Options

FailoverManager

Manager class reference

AuthProfile

Provider credential profile
from praisonaiagents import FailoverConfig

config = FailoverConfig(
    max_retries=3,
    retry_delay=1.0,
    exponential_backoff=True,
    max_retry_delay=60.0,
)
OptionTypeDefaultDescription
max_retriesint3Maximum retry attempts
retry_delayfloat1.0Initial retry delay
exponential_backoffboolTrueUse exponential backoff
max_retry_delayfloat60.0Maximum retry delay
cooldown_on_rate_limitfloat60.0Rate limit cooldown (seconds)
cooldown_on_errorfloat30.0Error cooldown (seconds)
rotate_on_successboolFalseRotate profiles on success

Auth Profiles

Configure credentials for each provider:
import os
from praisonaiagents import AuthProfile

profile = AuthProfile(
    name="openai-primary",
    provider="openai",
    api_key=os.getenv("OPENAI_API_KEY"),
    priority=1,
    rate_limit_rpm=100,
)
FieldTypeDescription
namestrUnique profile identifier
providerstrProvider: openai, anthropic, etc.
api_keystrAPI key (masked in logs)
base_urlstrCustom API endpoint
modelstrDefault model for this profile
priorityintFailover priority (lower = higher priority)
rate_limit_rpmintRequests per minute limit
rate_limit_tpmintTokens per minute limit
metadatadictAdditional provider-specific config

Common Patterns

from praisonaiagents import AuthProfile, FailoverManager

manager = FailoverManager()

# Add multiple providers
manager.add_profile(AuthProfile(
    name="openai", provider="openai",
    api_key=os.getenv("OPENAI_API_KEY"), priority=1,
))

manager.add_profile(AuthProfile(
    name="anthropic", provider="anthropic",
    api_key=os.getenv("ANTHROPIC_API_KEY"), priority=2,
))

manager.add_profile(AuthProfile(
    name="groq", provider="groq",
    api_key=os.getenv("GROQ_API_KEY"), priority=3,
))

Failover Callbacks

React to failover events:
from praisonaiagents import FailoverManager, FailoverConfig

def on_failover(from_profile, to_profile, error):
    print(f"Failing over from {from_profile} to {to_profile}")
    print(f"Reason: {error}")
    # Log to monitoring system
    
config = FailoverConfig(
    on_failover=on_failover
)

manager = FailoverManager(config)

Provider Status

Monitor provider health:
from praisonaiagents import FailoverManager

manager = FailoverManager()

# Get status of all providers
status = manager.status()
for name, info in status.items():
    print(f"{name}: {info['status']}")
    print(f"  Failures: {info['failure_count']}")
    print(f"  Last success: {info['last_success']}")

# Reset a provider after recovery
manager.mark_success("openai")

# Reset all profiles
manager.reset_all()

Best Practices

Always have at least 2-3 providers configured. This ensures availability even during major outages.
Enable exponential_backoff=True to avoid hammering providers during issues. This helps you stay within rate limits.
Order providers by cost and reliability. Put cheaper/faster providers first, with premium providers as fallback.
Use the on_failover callback to track when failovers occur. This helps identify provider issues early.
Pair failover with LLM Error Classification so FailoverDecision coordinates profile rotation with typed errors.
Load keys from environment variables or a secrets manager — never commit credentials to version control.

LLM Error Classification

Typed errors that drive failover decisions

Providers

Supported LLM providers