Skip to main content
Channel supervision keeps gateway bots alive through network outages with unlimited retries and operator-level pause / resume / reconnect controls.

Quick Start

Channel supervision is automatically enabled for all gateway channels configured in gateway.yaml. No additional setup is required.
1

Basic Gateway Setup

Create a simple gateway with supervision:
# gateway.yaml
agents:
  assistant:
    instructions: "You are a helpful AI assistant."
    model: "gpt-4o-mini"

channels:
  telegram:
    token: "${TELEGRAM_BOT_TOKEN}"
    platform: telegram
praisonai gateway start --config gateway.yaml
The telegram channel is now under supervision with unlimited retry capability.
2

Control Channel Operations

Pause a problematic channel while investigating issues:
praisonai gateway pause telegram
Resume when ready:
praisonai gateway resume telegram
Force reconnect to reset error state:
praisonai gateway reconnect telegram

How It Works

Channel supervision provides resilient error handling through error classification and unlimited retries:
ComponentResponsibility
ChannelSupervisorManages channel lifecycle and error handling
BackoffPolicyControls retry timing with capped exponential backoff
Error ClassificationDetermines if errors are recoverable, fatal, or conflict
Operator ControlsProvides manual pause/resume/reconnect capabilities

Channel States

The supervision system tracks four distinct channel states:
StateDescriptionAuto-RetryOperator Actions
RUNNINGChannel is actively connected and serving messagesN/Apause, reconnect
FAILEDFatal error occurred (e.g., Telegram conflict, invalid token)❌ Noreconnect only
PAUSEDManually paused by operator❌ Noresume, reconnect
STOPPEDClean shutdown or initial state❌ NoAutomatic restart

Operator Controls

Pause Channel

Temporarily stop a channel without losing configuration:
praisonai gateway pause telegram --url ws://127.0.0.1:8765
Effect: Channel enters PAUSED state and stops processing messages. Supervision loop waits indefinitely until resumed.

Resume Channel

Resume a manually paused channel:
praisonai gateway resume telegram --url ws://127.0.0.1:8765
Effect: Channel transitions from PAUSED to STOPPED, then automatically restarts to RUNNING.

Reconnect Channel

Force a complete reconnection and reset error state:
praisonai gateway reconnect telegram --url ws://127.0.0.1:8765
Effect: Resets retry counter, clears error history, forces restart. Works from any state including FAILED.

Error Classification

The supervision system classifies errors to determine retry behavior:
Error TypeExamplesBehaviorRecovery
RecoverableNetwork timeouts, DNS failures, temporary API errorsUnlimited retry with exponential backoffAutomatic
ConflictTelegram “Conflict: terminated by other getUpdates”Immediate failure, no retryManual reconnect after stopping duplicate
Non-RecoverableInvalid bot token, missing permissionsImmediate failure, no retryManual reconnect after fixing config
The retry policy uses capped exponential backoff:
  • Initial delay: 5 seconds
  • Maximum delay: 300 seconds (5 minutes)
  • Unlimited attempts for recoverable errors
  • Jitter added to prevent thundering herd

Monitoring via /health

The enhanced health endpoint includes supervision status for each channel:
curl http://127.0.0.1:8765/health
Key supervision fields:
  • state: Current channel state (running, failed, paused, stopped)
  • last_error: Most recent error message (if any)
  • last_error_time: Unix timestamp of last error
  • next_retry_at: Unix timestamp of next retry attempt (if scheduled)
  • total_recoveries: Count of successful recoveries from errors
  • manual_pause: Whether channel is manually paused by operator

Best Practices

Use pause for temporary investigations while keeping the channel configuration intact. Use reconnect when you need to reset error state after fixing underlying issues like network connectivity or API tokens.
High total_recoveries counts indicate frequent connection issues. Monitor this metric to identify unstable network conditions or platform-specific problems that may require infrastructure changes.
The /health endpoint is designed for integration with Prometheus, Datadog, or other monitoring systems. Set up alerts on state: "failed" and track total_recoveries trends to detect degrading connection quality.
Channels in FAILED state require manual intervention. Use reconnect (not resume) to reset the error state and attempt a fresh connection. Always investigate the last_error to address root cause issues before reconnecting.

Gateway CLI

Complete CLI reference for gateway management

Gateway Error Handling

Error handling strategies for gateway bots