Quick Start
Choosing the Right Method
| Method | Streams | Display | Best For |
|---|---|---|---|
start(stream=True) | ✅ Yes | ✅ Auto | Terminal, interactive chat |
iter_stream() | ✅ Always | ❌ No | App integration, custom UIs |
run() | ❌ No | ❌ No | Production, batch processing |
chat(stream=True) | Configurable | Configurable | Low-level control |
Common Patterns
Terminal Streaming
App Integration with iter_stream()
Best for integrating into your own application — yields raw chunks with no display overhead.
Streaming with Callbacks
Hook into every streaming event for fine-grained control.FastAPI SSE Integration
Pipe streaming tokens directly to a web client using Server-Sent Events.Async Streaming
StreamEvent Protocol
Every streaming chunk emits aStreamEvent with full context.
| Event | When |
|---|---|
REQUEST_START | Before API call |
HEADERS_RECEIVED | HTTP 200 arrives |
FIRST_TOKEN | First content delta (TTFT marker) |
DELTA_TEXT | Each text chunk |
DELTA_TOOL_CALL | Tool call streaming |
LAST_TOKEN | Final content delta |
STREAM_END | Stream completed |
Metrics
Track Time To First Token (TTFT) and throughput.| Metric | Description |
|---|---|
| TTFT | Time from request to first token (provider latency) |
| Stream Duration | From first to last token |
| Total Time | End-to-end request time |
| Tokens/s | Token generation rate |
Key Concepts
Time To First Token (TTFT)
Streaming vs Non-Streaming
| Mode | Behavior | Use Case |
|---|---|---|
stream=True | Tokens appear as generated | Interactive chat, real-time display |
stream=False | Complete response at once | Batch processing, structured output |
CLI Usage
Best Practices
Use iter_stream() for app integration
Use iter_stream() for app integration
iter_stream() yields raw chunks with zero display overhead — ideal for piping into FastAPI, WebSocket, or custom UIs.Use start(stream=True) for terminal
Use start(stream=True) for terminal
start() handles display automatically. Pass stream=True for real-time token output in interactive sessions.Monitor TTFT for performance
Monitor TTFT for performance
High TTFT indicates model or network issues. Use
StreamMetrics to track and optimize.Handle errors in callbacks
Handle errors in callbacks
The emitter catches callback exceptions silently to avoid breaking the stream. Log errors inside your callback.
Troubleshooting
”Streaming seems to buffer before showing anything”
This is TTFT, not buffering. The model is generating the first token. Check:- Model complexity (larger models have higher TTFT)
- Prompt length (longer prompts take longer to process)
- Network latency to the API

