Quick Start
Auto-detect (Default)
Choosing the Right Method
| Method | Streams | Display | Best For |
|---|---|---|---|
start() (auto-detect) | 🎯 Auto | ✅ Auto | Recommended — works everywhere |
start(stream=True) | ✅ Yes | ✅ Auto | Force streaming, interactive chat |
iter_stream() | ✅ Always | ❌ No | App integration, custom UIs |
run() | ❌ No | ❌ No | Production, batch processing |
chat(stream=True) | Configurable | Configurable | Low-level control |
Common Patterns
Terminal Streaming
App Integration with iter_stream()
Best for integrating into your own application — yields raw chunks with no display overhead.
Streaming with Callbacks
Hook into every streaming event for fine-grained control.FastAPI SSE Integration
Pipe streaming tokens directly to a web client using Server-Sent Events.Async Streaming
Streaming with Tools
When your agent uses tools, streaming happens in two phases: the initial response that decides to call tools, and a follow-up response that synthesizes the tool results.Error Handling in the Stream
If the LLM call fails after retries, the stream ends with a visible error sentence instead of silently dropping. You may receive this exact sentinel string:| Part | Meaning |
|---|---|
ref: followup-<timestamp> | Correlation ID logged server-side — share this when reporting issues |
Please retry | Retries already ran internally; another attempt may succeed if the root cause was transient |
reducing prompt size | Common root cause is context-length or provider capacity errors |
The initial LLM call and the follow-up LLM call (after tool execution) now share the same retry and rate-limiting behavior — users no longer need to add their own retry wrapper around streaming + tools.
StreamEvent Protocol
Every streaming chunk emits aStreamEvent with full context.
| Event | When |
|---|---|
REQUEST_START | Before API call |
HEADERS_RECEIVED | HTTP 200 arrives |
FIRST_TOKEN | First content delta (TTFT marker) |
DELTA_TEXT | Each text chunk |
DELTA_TOOL_CALL | Tool call streaming |
LAST_TOKEN | Final content delta |
STREAM_END | Stream completed |
Metrics
Track Time To First Token (TTFT) and throughput.| Metric | Description |
|---|---|
| TTFT | Time from request to first token (provider latency) |
| Stream Duration | From first to last token |
| Total Time | End-to-end request time |
| Tokens/s | Token generation rate |
Key Concepts
Time To First Token (TTFT)
Streaming vs Non-Streaming
| Mode | Behavior | Use Case |
|---|---|---|
stream=None (default) | Try streaming, fall back to non-streaming if unsupported | Recommended — works across all providers |
stream=True | Force streaming (errors on sync adapters that don’t support it) | When you definitely want tokens |
stream=False | Force non-streaming | Batch jobs, structured output, sync providers |
Sync vs Async Adapters: Async methods (
achat, astart, _execute_unified_achat_completion) still default to stream=True because async adapters universally support streaming. Sync methods (chat, start, run) use the new smart-fallback default. Some adapters (e.g., sync OpenAI/Deepseek adapter) currently do NOT support sync streaming and will trigger the fallback.CLI Usage
Best Practices
Let the SDK pick streaming mode
Let the SDK pick streaming mode
Omit the
stream argument (or pass stream=None) and the SDK will choose streaming where supported and silently fall back where it isn’t. Only override when you have a specific reason.Use iter_stream() for app integration
Use iter_stream() for app integration
iter_stream() yields raw chunks with zero display overhead — ideal for piping into FastAPI, WebSocket, or custom UIs.Use start(stream=True) for terminal
Use start(stream=True) for terminal
start() handles display automatically. Pass stream=True for real-time token output in interactive sessions.Monitor TTFT for performance
Monitor TTFT for performance
High TTFT indicates model or network issues. Use
StreamMetrics to track and optimize.Handle errors in callbacks
Handle errors in callbacks
Two layers of error handling. Callback exceptions are still caught by the emitter to avoid breaking the stream — log them inside your callback. LLM call failures, however, are now retried automatically and, on persistent failure, surface as a visible
[Error: ... (ref: ...)] sentence at the end of the stream — check for this sentinel when consuming iter_stream().Troubleshooting
”Streaming seems to buffer before showing anything”
This is TTFT, not buffering. The model is generating the first token. Check:- Model complexity (larger models have higher TTFT)
- Prompt length (longer prompts take longer to process)
- Network latency to the API
”Tokens appear in chunks, not one at a time”
Normal. Providers may batch tokens for efficiency.”Stream ends with [Error: Failed to generate final response after tool execution (ref: followup-...)]”
The follow-up LLM call (the one that synthesizes tool results into a final answer) failed after the built-in retries. Common causes:
- Persistent rate limit — pair streaming with a Rate Limiter at higher RPM, or back off the caller.
- Context-length overflow — reduce conversation history or tool-result size.
- Provider outage — include the
ref:ID when reporting. The internal log line (ref=..., model=..., error=...) makes it searchable.
”Streaming is not supported in sync OpenAIAdapter” / Deepseek multi-agent crash
Fixed in PraisonAI 4.6.47+ (PR #1734). Earlier versions defaulted sync chat tostream=True, which crashed on sync-only providers like Deepseek. Upgrade, or pass stream=False explicitly if you can’t.
Related
Output & Display
Output formatting options
Async
Async agent execution
Rate Limiter
Control request rates across initial and follow-up LLM calls

