Quick Start
Channel supervision is automatically enabled for all gateway channels configured ingateway.yaml. No additional setup is required.
Basic Gateway Setup
Create a simple gateway with supervision:The
telegram channel is now under supervision with unlimited retry capability.How It Works
Channel supervision provides resilient error handling through error classification and unlimited retries:| Component | Responsibility |
|---|---|
| ChannelSupervisor | Manages channel lifecycle and error handling |
| BackoffPolicy | Controls retry timing with capped exponential backoff |
| Error Classification | Determines if errors are recoverable, fatal, or conflict |
| Operator Controls | Provides manual pause/resume/reconnect capabilities |
Channel States
The supervision system tracks four distinct channel states:| State | Description | Auto-Retry | Operator Actions |
|---|---|---|---|
RUNNING | Channel is actively connected and serving messages | N/A | pause, reconnect |
FAILED | Fatal error occurred (e.g., Telegram conflict, invalid token) | ❌ No | reconnect only |
PAUSED | Manually paused by operator | ❌ No | resume, reconnect |
STOPPED | Clean shutdown or initial state | ❌ No | Automatic restart |
Operator Controls
Pause Channel
Temporarily stop a channel without losing configuration:- CLI
- REST API
PAUSED state and stops processing messages. Supervision loop waits indefinitely until resumed.
Resume Channel
Resume a manually paused channel:- CLI
- REST API
PAUSED to STOPPED, then automatically restarts to RUNNING.
Reconnect Channel
Force a complete reconnection and reset error state:- CLI
- REST API
FAILED.
Error Classification
The supervision system classifies errors to determine retry behavior:| Error Type | Examples | Behavior | Recovery |
|---|---|---|---|
| Recoverable | Network timeouts, DNS failures, temporary API errors | Unlimited retry with exponential backoff | Automatic |
| Conflict | Telegram “Conflict: terminated by other getUpdates” | Immediate failure, no retry | Manual reconnect after stopping duplicate |
| Non-Recoverable | Invalid bot token, missing permissions | Immediate failure, no retry | Manual reconnect after fixing config |
- Initial delay: 5 seconds
- Maximum delay: 300 seconds (5 minutes)
- Unlimited attempts for recoverable errors
- Jitter added to prevent thundering herd
Monitoring via /health
The enhanced health endpoint includes supervision status for each channel:
- Request
- Response
state: Current channel state (running,failed,paused,stopped)last_error: Most recent error message (if any)last_error_time: Unix timestamp of last errornext_retry_at: Unix timestamp of next retry attempt (if scheduled)total_recoveries: Count of successful recoveries from errorsmanual_pause: Whether channel is manually paused by operator
Best Practices
When to pause vs reconnect
When to pause vs reconnect
Use pause for temporary investigations while keeping the channel configuration intact. Use reconnect when you need to reset error state after fixing underlying issues like network connectivity or API tokens.
Reading total_recoveries as a churn signal
Reading total_recoveries as a churn signal
High
total_recoveries counts indicate frequent connection issues. Monitor this metric to identify unstable network conditions or platform-specific problems that may require infrastructure changes.Hooking /health into monitoring systems
Hooking /health into monitoring systems
The
/health endpoint is designed for integration with Prometheus, Datadog, or other monitoring systems. Set up alerts on state: "failed" and track total_recoveries trends to detect degrading connection quality.Recovering from FAILED state
Recovering from FAILED state
Channels in
FAILED state require manual intervention. Use reconnect (not resume) to reset the error state and attempt a fresh connection. Always investigate the last_error to address root cause issues before reconnecting.Related
Gateway CLI
Complete CLI reference for gateway management
Gateway Error Handling
Error handling strategies for gateway bots

