> ## Documentation Index
> Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Streaming

> Real-time token streaming for responsive AI interactions

Stream AI responses token-by-token as they're generated, instead of waiting for the complete response.

```mermaid theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
graph LR
    subgraph "Streaming Flow"
        A[💬 Prompt] --> B[🤖 Agent]
        B --> C[⚡ Stream]
        C --> |token by token| D[📺 Your App]
    end
    
    classDef input fill:#6366F1,stroke:#7C90A0,color:#fff
    classDef agent fill:#F59E0B,stroke:#7C90A0,color:#fff
    classDef output fill:#10B981,stroke:#7C90A0,color:#fff
    
    class A input
    class B,C agent
    class D output
```

## Quick Start

<Steps>
  <Step title="Install">
    ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    pip install praisonaiagents
    ```
  </Step>

  <Step title="Stream Responses">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    from praisonaiagents import Agent

    agent = Agent(instructions="You are a helpful assistant")

    for chunk in agent.start("Write a short story", stream=True):
        print(chunk, end="", flush=True)
    ```
  </Step>
</Steps>

***

## Choosing the Right Method

```mermaid theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
graph TB
    Q{"What's your use case?"} --> T[🖥️ Terminal / Interactive]
    Q --> A[📱 App Integration]
    Q --> P[⚙️ Production / Batch]
    
    T --> S["agent.start()"]
    A --> I["agent.iter_stream()"]
    P --> R["agent.run()"]
    
    S --> |"Streams + displays automatically"| Done[✅]
    I --> |"Yields chunks, no display"| Done
    R --> |"Returns complete result"| Done
    
    classDef question fill:#6366F1,stroke:#7C90A0,color:#fff
    classDef method fill:#F59E0B,stroke:#7C90A0,color:#fff
    classDef done fill:#10B981,stroke:#7C90A0,color:#fff
    
    class Q question
    class T,A,P,S,I,R method
    class Done done
```

| Method               | Streams      | Display      | Best For                     |
| -------------------- | ------------ | ------------ | ---------------------------- |
| `start(stream=True)` | ✅ Yes        | ✅ Auto       | Terminal, interactive chat   |
| `iter_stream()`      | ✅ Always     | ❌ No         | App integration, custom UIs  |
| `run()`              | ❌ No         | ❌ No         | Production, batch processing |
| `chat(stream=True)`  | Configurable | Configurable | Low-level control            |

***

## Common Patterns

### Terminal Streaming

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent

agent = Agent(instructions="You are a helpful assistant")

# Tokens appear as they arrive
for chunk in agent.start("Explain quantum computing", stream=True):
    print(chunk, end="", flush=True)
```

### App Integration with `iter_stream()`

Best for integrating into your own application — yields raw chunks with no display overhead.

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent

agent = Agent(instructions="You are a helpful assistant")

full_response = ""
for chunk in agent.iter_stream("Write a haiku"):
    full_response += chunk
    # Send to your UI, WebSocket, or processing pipeline

print(full_response)
```

### Streaming with Callbacks

Hook into every streaming event for fine-grained control.

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent
from praisonaiagents.streaming import StreamEvent, StreamEventType

def on_event(event: StreamEvent):
    if event.type == StreamEventType.DELTA_TEXT:
        print(event.content, end="", flush=True)
    elif event.type == StreamEventType.FIRST_TOKEN:
        print("⚡ First token received!")
    elif event.type == StreamEventType.STREAM_END:
        print("\n✅ Done!")

agent = Agent(instructions="You are a helpful assistant")
agent.stream_emitter.add_callback(on_event)
agent.start("Tell me a joke", stream=True)
```

### FastAPI SSE Integration

Pipe streaming tokens directly to a web client using Server-Sent Events.

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from praisonaiagents import Agent

app = FastAPI()

@app.get("/stream")
async def stream_response(prompt: str):
    agent = Agent(instructions="You are a helpful assistant")
    
    def generate():
        for chunk in agent.iter_stream(prompt):
            yield f"data: {chunk}\n\n"
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")
```

### Async Streaming

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import asyncio
from praisonaiagents import Agent

async def main():
    agent = Agent(instructions="You are a helpful assistant")
    result = await agent.astart("Write a poem", stream=True)
    print(result)

asyncio.run(main())
```

***

## Streaming with Tools

When your agent uses tools, streaming happens in two phases: the initial response that decides to call tools, and a follow-up response that synthesizes the tool results.

```mermaid theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
sequenceDiagram
    participant U as User
    participant A as Agent  
    participant L as LLM
    participant T as Tools
    
    U->>A: Request with stream=True
    A->>L: Phase 1 (streamed)
    L-->>A: "I'll use tool_name..."
    A->>T: Execute tool_name()
    T-->>A: Tool result
    A->>L: Phase 2 follow-up (streamed) 
    L-->>A: Synthesized response
    A-->>U: Combined stream
    
    Note over L: Both phases use retry-wrapped LLM calls
```

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent, tool

@tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Weather in {city}: 72°F, sunny"

agent = Agent(
    instructions="You are a weather assistant",
    tools=[get_weather]
)

for chunk in agent.start("What's the weather in Paris?", stream=True):
    print(chunk, end="", flush=True)
```

Both phases go through the same retry-wrapped LLM path, so transient rate-limit or network errors are retried automatically without any caller intervention.

***

## Error Handling in the Stream

If the LLM call fails after retries, the stream ends with a visible error sentence instead of silently dropping.

You may receive this exact sentinel string:

```
[Error: Failed to generate final response after tool execution (ref: followup-1713957912345). Please retry. If it continues, try reducing prompt size.]
```

| Part                        | Meaning                                                                                     |
| --------------------------- | ------------------------------------------------------------------------------------------- |
| `ref: followup-<timestamp>` | Correlation ID logged server-side — share this when reporting issues                        |
| `Please retry`              | Retries already ran internally; another attempt may succeed if the root cause was transient |
| `reducing prompt size`      | Common root cause is context-length or provider capacity errors                             |

Detect the error sentinel in your stream consumer:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent

agent = Agent(instructions="You are a helpful assistant", tools=[...])

full = ""
for chunk in agent.iter_stream("Research and summarize quantum computing"):
    full += chunk
    print(chunk, end="", flush=True)

if "[Error:" in full and "ref:" in full:
    # Surface ref to your logs / retry externally
    print(f"\n⚠️ Error detected, check logs for correlation ID")
```

<Note>
  The **initial** LLM call and the **follow-up** LLM call (after tool execution) now share the same retry and rate-limiting behavior — users no longer need to add their own retry wrapper around streaming + tools.
</Note>

***

## StreamEvent Protocol

Every streaming chunk emits a `StreamEvent` with full context.

```mermaid theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
sequenceDiagram
    participant A as Agent
    participant L as LLM
    participant C as Your Callback
    
    A->>L: Request
    L-->>C: REQUEST_START
    L-->>C: HEADERS_RECEIVED
    L-->>C: FIRST_TOKEN
    loop Token by Token
        L-->>C: DELTA_TEXT
    end
    L-->>C: LAST_TOKEN
    L-->>C: STREAM_END
```

| Event              | When                              |
| ------------------ | --------------------------------- |
| `REQUEST_START`    | Before API call                   |
| `HEADERS_RECEIVED` | HTTP 200 arrives                  |
| `FIRST_TOKEN`      | First content delta (TTFT marker) |
| `DELTA_TEXT`       | Each text chunk                   |
| `DELTA_TOOL_CALL`  | Tool call streaming               |
| `LAST_TOKEN`       | Final content delta               |
| `STREAM_END`       | Stream completed                  |

***

## Metrics

Track Time To First Token (TTFT) and throughput.

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent
from praisonaiagents.streaming import StreamEvent, StreamEventType, StreamMetrics

metrics = StreamMetrics()

def on_event(event: StreamEvent):
    metrics.update_from_event(event)
    if event.type == StreamEventType.DELTA_TEXT:
        print(event.content, end="", flush=True)

agent = Agent(instructions="You are a helpful assistant")
agent.stream_emitter.add_callback(on_event)
agent.start("Explain AI briefly", stream=True)

print(metrics.format_summary())
# Output: TTFT: 245ms | Stream: 1200ms | Total: 1445ms | Tokens: 150 (125.0/s)
```

| Metric              | Description                                         |
| ------------------- | --------------------------------------------------- |
| **TTFT**            | Time from request to first token (provider latency) |
| **Stream Duration** | From first to last token                            |
| **Total Time**      | End-to-end request time                             |
| **Tokens/s**        | Token generation rate                               |

***

## Key Concepts

### Time To First Token (TTFT)

```
Request → [TTFT] → First Token → [Streaming] → Last Token → Done
```

TTFT is the time before the first token arrives. This is provider latency — the model must process your prompt before generating. Streaming does NOT reduce TTFT, but it shows progress immediately.

### Streaming vs Non-Streaming

| Mode           | Behavior                   | Use Case                            |
| -------------- | -------------------------- | ----------------------------------- |
| `stream=True`  | Tokens appear as generated | Interactive chat, real-time display |
| `stream=False` | Complete response at once  | Batch processing, structured output |

***

## CLI Usage

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
# Stream responses in terminal
praisonai chat --stream "Tell me a joke"

# With verbose output
praisonai chat --stream --verbose "Explain quantum computing"
```

***

## Best Practices

<AccordionGroup>
  <Accordion title="Use iter_stream() for app integration">
    `iter_stream()` yields raw chunks with zero display overhead — ideal for piping into FastAPI, WebSocket, or custom UIs.
  </Accordion>

  <Accordion title="Use start(stream=True) for terminal">
    `start()` handles display automatically. Pass `stream=True` for real-time token output in interactive sessions.
  </Accordion>

  <Accordion title="Monitor TTFT for performance">
    High TTFT indicates model or network issues. Use `StreamMetrics` to track and optimize.
  </Accordion>

  <Accordion title="Handle errors in callbacks">
    Two layers of error handling. Callback exceptions are still caught by the emitter to avoid breaking the stream — log them inside your callback. LLM call failures, however, are now retried automatically and, on persistent failure, surface as a visible `[Error: ... (ref: ...)]` sentence at the end of the stream — check for this sentinel when consuming `iter_stream()`.
  </Accordion>
</AccordionGroup>

***

## Troubleshooting

### "Streaming seems to buffer before showing anything"

This is TTFT, not buffering. The model is generating the first token. Check:

* Model complexity (larger models have higher TTFT)
* Prompt length (longer prompts take longer to process)
* Network latency to the API

### "Tokens appear in chunks, not one at a time"

Normal. Providers may batch tokens for efficiency.

### "Stream ends with `[Error: Failed to generate final response after tool execution (ref: followup-...)]`"

The follow-up LLM call (the one that synthesizes tool results into a final answer) failed after the built-in retries. Common causes:

* Persistent rate limit — pair streaming with a [Rate Limiter](/docs/features/rate-limiter) at higher RPM, or back off the caller.
* Context-length overflow — reduce conversation history or tool-result size.
* Provider outage — include the `ref:` ID when reporting. The internal log line (`ref=..., model=..., error=...`) makes it searchable.

***

## Related

<CardGroup cols={3}>
  <Card title="Output & Display" icon="display" href="/docs/features/display-system">
    Output formatting options
  </Card>

  <Card title="Async" icon="clock" href="/docs/features/async">
    Async agent execution
  </Card>

  <Card title="Rate Limiter" icon="gauge" href="/docs/features/rate-limiter">
    Control request rates across initial and follow-up LLM calls
  </Card>
</CardGroup>
