Inbound Dead-Letter Queue

TL;DR — When agent.chat() fails (LLM 5xx, timeout, rate-limit) the user’s message is normally lost. Set dlq=InboundDLQ(...) and PraisonAI persists the failed message so you can replay it later.

Why you want this

No silent data loss

Failed inbound messages are persisted to a SQLite file before the exception bubbles up.

Operator-friendly replay

A single CLI command (praisonai bot dlq replay) re-runs failed messages through the agent.

Bounded by design

TTL + max_size keep the queue from growing unbounded; oldest entries evict first.

Zero new dependency

Uses only stdlib sqlite3. Default OFF — your existing bots are untouched.

How it flows

Quick start (3 lines)

from praisonai.bots import BotSessionManager, InboundDLQ

dlq = InboundDLQ(path="~/.praisonai/dlq.sqlite")
mgr = BotSessionManager(platform="telegram", dlq=dlq)
# ↑ that's it — failed agent.chat() now lands in the DLQ

Default behaviour is unchanged when no dlq= is passed. This is fully opt-in.

CLI

# List failed messages (newest first)
praisonai bot dlq list

# List from a custom path
praisonai bot dlq list --path /var/lib/myapp/dlq.sqlite --limit 50

# Replay through your bot's configured agent
praisonai bot dlq replay --config bot.yaml

# Purge everything (asks for confirmation)
praisonai bot dlq purge
praisonai bot dlq purge --yes  # skip confirmation

API reference

path

str | Path

required

Where the SQLite file lives. Parent directories are created automatically.

max_size

int

default:"10_000"

Maximum number of entries kept. When exceeded, oldest entries are dropped first.

ttl_seconds

int

default:"604800 (7 days)"

Entries older than this are evicted on the next enqueue() or evict_expired().

`DLQEntry`

Show Fields

id: int — primary key, monotonic.
ts: float — UNIX time of failure.
platform: str — bot platform (telegram, discord, etc).
user_id: str — platform user id.
prompt: str — the original user message.
chat_id: str, thread_id: str, user_name: str — metadata when known.
error: str — the error string that caused the failure.
attempts: int — how many times replay has been attempted (and failed).

Methods

Inspect
Mutate
Replay

dlq.size()                  # int
dlq.list(limit=100)         # list[DLQEntry], newest first

dlq.enqueue(
    platform="telegram", user_id="12345",
    prompt="hi", error="LLM 503",
)
dlq.purge()                 # delete all
dlq.evict_expired()         # drop entries older than ttl

async def handler(entry):
    try:
        await mgr.chat(agent, entry.user_id, entry.prompt,
                       chat_id=entry.chat_id,
                       user_name=entry.user_name)
        return True   # success → entry deleted
    except Exception:
        return False  # keep entry, increment attempts

succeeded, failed = await dlq.replay(handler)

Real LLM smoke test

[1] Sending failing message: 'What is 2 plus 2? Answer with a single digit.'
   Caught expected error: simulated LLM 503
   DLQ size after fail: 1  ✅

[2] Replaying DLQ via real LLM …
   succeeded=1, failed=0, remaining=0

[Real LLM reply] 4

PASS: DLQ → replay → real LLM produced expected '4'.

Operational notes

Disk usage — every failed message + its prompt is written to disk. With chronic LLM outages this can grow fast. Tune max_size and ttl_seconds for your retention policy.

Thread safety — every write is guarded by an internal threading.Lock. SQLite WAL is enabled. Safe to share one InboundDLQ instance across threads.

Backward compatible — BotSessionManager(...) without dlq= behaves exactly as before. No behaviour change for existing bots.

Combining with other features

With Cross-Platform Mirror (W1)

The DLQ records platform, user_id, and (if W1’s IdentityResolver is wired) the same user_id resolves the same human across platforms. Replay restores the exact session.

With BackoffPolicy (resilience)

For transient failures use praisonai.bots._resilience.BackoffPolicy to retry inline before falling back to the DLQ. The DLQ is the last resort, not the first.

With observability

Wrap dlq.enqueue() with your tracer (e.g. OTEL span) to alert on DLQ growth. A non-zero dlq.size() is a great SLO trip-wire.

Paid upgrade path

OSS now File-backed SQLite DLQ — single-host deploys. Cloud (planned) Multi-region replicated DLQ with web dashboard, automatic alerting, and one-click bulk replay.

Documentation Index

​Why you want this

No silent data loss

Operator-friendly replay

Bounded by design

Zero new dependency

​How it flows

​Quick start (3 lines)

​CLI

​API reference

​DLQEntry

​Methods

​Real LLM smoke test

​Operational notes

​Combining with other features

​Paid upgrade path

Why you want this

How it flows

Quick start (3 lines)

CLI

API reference

`DLQEntry`

Methods

Real LLM smoke test

Operational notes

Combining with other features

Paid upgrade path