Overview
Streaming enables real-time token delivery from LLM providers, displaying responses as they’re generated rather than waiting for the complete response. This creates a more responsive user experience.Key Concepts
Time To First Token (TTFT)
TTFT is the time between sending a request and receiving the first token. This delay is inherent to LLM generation—you cannot stream tokens before the model produces them.| Metric | Description |
|---|---|
| TTFT | Time from request to first token (provider latency) |
| Stream Duration | Time from first to last token |
| Total Time | End-to-end request time |
Streaming vs Non-Streaming
| Mode | Behavior | Use Case |
|---|---|---|
stream=True | Tokens appear as generated | Interactive chat, real-time display |
stream=False | Complete response returned at once | Batch processing, structured output |
Basic Usage
Enable Streaming
Using OutputConfig
CLI Usage
Advanced: StreamEvent Protocol
For programmatic access to streaming events, use theStreamEvent protocol:
Understanding Perceived Delays
Why does streaming seem slow?
- TTFT is provider-dependent: The model must process your prompt and begin generating before any tokens arrive. This is not buffering—it’s generation time.
- Network latency: Round-trip time to the API adds to TTFT.
- Response length: Longer responses take longer to stream, but you see progress immediately.
What streaming does NOT do
- Stream tokens before the provider generates them
- Eliminate TTFT (this is inherent to LLM generation)
- Make the total response time faster
What streaming DOES do
- Show tokens immediately as they arrive
- Provide visual feedback during generation
- Enable early termination if needed
- Improve perceived responsiveness
Best Practices
Use stream=True for chat
Interactive conversations benefit from immediate feedback
Use stream=False for batch
Batch processing doesn’t need streaming overhead
Monitor TTFT
High TTFT may indicate model/network issues
Enable metrics
Use
metrics=True to track streaming performanceTiming Glossary
| Term | Definition |
|---|---|
| Request Start | Timestamp when API call is initiated |
| Headers Received | When HTTP response headers arrive (200 OK) |
| First Token | First content delta received (TTFT marker) |
| Token Cadence | Rate of token delivery (tokens/second) |
| Last Token | Final content delta received |
| Stream End | Stream processing completed |
Troubleshooting
”Streaming seems to buffer before showing anything”
This is TTFT, not buffering. The model is generating the first token. Check:- Model complexity (larger models have higher TTFT)
- Prompt length (longer prompts take longer to process)
- Network latency to the API
”Tokens appear in chunks, not one at a time”
This is normal. Providers may batch tokens for efficiency. Therefresh_per_second setting in Rich Live display also affects visual update frequency.

