Agent Retry - PraisonAI

Agent retry automatically re-runs failed LLM calls with jittered exponential backoff so transient rate limits, overloads, and network blips don’t break your agent.

Quick Start

Enable with True (simplest)

from praisonaiagents import Agent

agent = Agent(
    name="Researcher",
    instructions="Research topics on the web",
    retry=True,
)
agent.start("Find recent papers on diffusion models")

retry=True enables retry with sensible defaults: 3 retries, 5 s → 10 s → 20 s exponential schedule, capped at 120 s, with 50% additive jitter.

Tune with a dict (no extra import)

from praisonaiagents import Agent

agent = Agent(
    name="Researcher",
    instructions="Research topics on the web",
    retry={
        "max_retries": 5,
        "base_delay": 2.0,
        "max_delay": 60.0,
        "jitter_ratio": 0.3,
    },
)
agent.start("Find recent papers on diffusion models")

Full control with RetryBackoffConfig

from praisonaiagents import Agent, RetryBackoffConfig

agent = Agent(
    name="Researcher",
    instructions="Research topics on the web",
    retry=RetryBackoffConfig(
        base_delay=2.0,
        max_delay=60.0,
        jitter_ratio=0.3,
        max_retries=5,
    ),
)
agent.start("Find recent papers on diffusion models")

How It Works

Aspect	Detail
What gets retried	Only `LLMError` where `is_retryable=True` (rate limits, overloads)
What does NOT get retried	Auth errors, invalid requests, non-retryable `LLMError`, any other exception
Total attempts	`max_retries + 1` (default: 4 total)
Backoff schedule	`min(base_delay × 2^attempt, max_delay) + uniform(0, jitter_ratio × delay)`
Interruption	Raises `RuntimeError("Agent interrupted during retry backoff")` immediately

Configuration Options

RetryBackoffConfig Fields

Option	Type	Default	Description
`base_delay`	`float`	`5.0`	Base delay in seconds for the first retry.
`max_delay`	`float`	`120.0`	Upper cap on any single backoff (after jitter).
`jitter_ratio`	`float`	`0.5`	Adds `uniform(0, jitter_ratio × delay)` on top of exponential delay. Set `0.0` to disable jitter.
`max_retries`	`int`	`3`	Maximum number of retries (so up to 4 total attempts by default).

Validation — the constructor raises ValueError if:

base_delay <= 0
max_delay < base_delay
jitter_ratio outside [0, 1]
max_retries < 0

Precedence

# Bool — enable with defaults
agent = Agent(name="A", instructions="...", retry=True)

# Dict — constructed from field names
agent = Agent(name="A", instructions="...", retry={"max_retries": 5, "base_delay": 2.0})

# RetryBackoffConfig — used directly
from praisonaiagents import Agent, RetryBackoffConfig

agent = Agent(
    name="A",
    instructions="...",
    retry=RetryBackoffConfig(max_retries=5, base_delay=2.0),
)

# None (default) — no retry
agent = Agent(name="A", instructions="...")

Common Patterns

Rate-limit friendly long jobs

from praisonaiagents import Agent, RetryBackoffConfig

agent = Agent(
    name="Batch Processor",
    instructions="Process a large batch of documents",
    retry=RetryBackoffConfig(
        base_delay=1.0,
        max_delay=300.0,
        jitter_ratio=0.5,
        max_retries=10,
    ),
)
agent.start("Summarise all 500 documents in the queue")

Strict mode — fail fast

from praisonaiagents import Agent, RetryBackoffConfig

agent = Agent(
    name="Fast Checker",
    instructions="Quick health check",
    retry=RetryBackoffConfig(max_retries=1),
)

Reproducible tests — disable jitter

from praisonaiagents import Agent, RetryBackoffConfig

agent = Agent(
    name="Test Agent",
    instructions="Deterministic for testing",
    retry=RetryBackoffConfig(jitter_ratio=0.0),
)

Observe retries with a hook

from praisonaiagents import Agent, RetryBackoffConfig
from praisonaiagents.hooks import HookRegistry, HookEvent

registry = HookRegistry()

@registry.on(HookEvent.ON_RETRY)
def on_retry(event):
    print(
        f"[retry] attempt {event.attempt + 1}/{event.max_retries} "
        f"in {event.delay_seconds:.1f}s: {event.error_message[:80]}"
    )

agent = Agent(
    name="API Caller",
    instructions="Call the API",
    retry=RetryBackoffConfig(max_retries=5),
    hooks=registry,
)
agent.start("Fetch the latest data")

The OnRetry hook receives:

Field	Type	Description
`attempt`	`int`	Current attempt number (0-based)
`max_retries`	`int`	Configured max retries
`delay_seconds`	`float`	Seconds the agent will sleep before the next attempt
`error_message`	`str`	String representation of the failing `LLMError`
`operation`	`str`	`"llm_request"` (sync) or `"async_llm_request"` (async)

Best Practices

Start with retry=True, tune later

The defaults (base_delay=5.0, max_delay=120.0, jitter_ratio=0.5, max_retries=3) are well-suited to most OpenAI and Anthropic rate-limit patterns. Start with retry=True and only tune when you observe systematic timeouts or excessive waiting.

Don't disable jitter in production

Setting jitter_ratio=0.0 creates a deterministic schedule that is useful for tests but dangerous in production. When many agents share the same API key and all retry at the same second, they hammer the endpoint simultaneously — exactly what jitter prevents. Keep jitter_ratio at 0.3 or higher in production.

Cap max_delay for user-facing flows

A 120-second wait is acceptable for background batch jobs but not when a human is waiting for a response. For interactive agents, set max_delay to something like 20.0 or 30.0, and keep max_retries low (1–2).

Use the OnRetry hook for observability, not control flow

The OnRetry hook is the right place to log metrics and send alerts. Retries are best-effort — if all attempts fail, the original LLMError propagates to your caller. Build your resilience strategy around catching that exception in your application code, not inside the hook.

Tool Retry Policy

Retry tool calls — a different surface from LLM call retry.

Structured LLM Errors

Which LLMError categories are classified as retryable.

Hook Events

The OnRetry event and all other lifecycle hooks.

Agent Retry Strategies

Strategy guidance for production retry patterns.

​Quick Start

​How It Works

​Configuration Options

​RetryBackoffConfig Fields

​Precedence

​Common Patterns

​Rate-limit friendly long jobs

​Strict mode — fail fast

​Reproducible tests — disable jitter

​Observe retries with a hook

​Best Practices

​Related