Quick Start
How It Works
| Step | What happens |
|---|---|
| 1 | Agent calls tool via _execute_tool_with_circuit_breaker (sync) or execute_tool_async (async) |
| 2 | On error, _classify_error_type tags it: timeout, rate_limit, connection_error, or unknown |
| 3 | _get_tool_retry_policy resolves the active policy (tool > agent > default) |
| 4 | If policy.should_retry(error_type, attempt) is true, wait policy.get_delay_ms(attempt) ms and retry |
| 5 | HookEvent.ON_RETRY fires before each retry with the new OnRetryInput fields |
Precedence Ladder
Tool-level (highest priority):Choosing a Retry Policy
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
max_attempts | int | 3 | Total attempts including the first try |
initial_delay_ms | int | 1000 | Delay before first retry, in milliseconds |
backoff_factor | float | 2.0 | Multiplier applied to delay per attempt |
retry_on | set[str] | {"timeout","rate_limit","connection_error"} | Error types that trigger a retry |
jitter | bool | False | Add randomized jitter to delays |
jitter_factor | float | 0.25 | Jitter range as fraction of delay (±25%) |
max_delay_ms | int | 30000 | Maximum delay between retries |
approval_denied,permission_denied,approval_error,circuit_open- Python exceptions:
ValueError,TypeError,AttributeErrorfrom tool code
Common Patterns
Per-tool override for unreliable API
YAML configuration
CLI usage
Hook Integration
Monitor retry attempts with hooks:OnRetryInput:
tool_name: Name of the failing toolattempt: Current attempt number (1-based)max_attempts: Maximum attempts configureddelay_ms: Delay before this retry in millisecondserror_type: Classified error type (timeout,rate_limit, etc.)error: Original exception object
Best Practices
Keep max_attempts small (3-5)
Keep max_attempts small (3-5)
Large retry counts mask real failures. If a tool fails 10+ times, there’s likely a deeper issue that retrying won’t solve. Use monitoring instead.
Always set jitter=True for rate-limited APIs
Always set jitter=True for rate-limited APIs
Without jitter, multiple agents retrying simultaneously create a “thundering herd” that can overwhelm rate-limited services. Jitter spreads out retry attempts.
Set narrower retry_on for expensive tools
Set narrower retry_on for expensive tools
Don’t retry LLM tools on
connection_error if every attempt costs money. Use specific error types that indicate transient failures.Use tool-level override sparingly
Use tool-level override sparingly
Agent-level retry policy keeps configuration DRY. Only override at the tool level for genuinely special cases like unreliable third-party APIs.
Related
Tool Configuration
Consolidated tool configuration with ToolConfig
Concurrency
Parallel tool execution and timeouts
Hooks
Monitor and intercept agent behavior
Hook Events
Complete reference of hook events

