Skip to main content

Overview

Streaming enables real-time token delivery from LLM providers, displaying responses as they’re generated rather than waiting for the complete response. This creates a more responsive user experience.

Key Concepts

Time To First Token (TTFT)

TTFT is the time between sending a request and receiving the first token. This delay is inherent to LLM generation—you cannot stream tokens before the model produces them.
Request → [TTFT] → First Token → [Streaming] → Last Token → Done
MetricDescription
TTFTTime from request to first token (provider latency)
Stream DurationTime from first to last token
Total TimeEnd-to-end request time

Streaming vs Non-Streaming

ModeBehaviorUse Case
stream=TrueTokens appear as generatedInteractive chat, real-time display
stream=FalseComplete response returned at onceBatch processing, structured output

Basic Usage

Enable Streaming

from praisonaiagents import Agent

# Create agent with streaming enabled
agent = Agent(
    instructions="You are a helpful assistant",
    llm="gpt-4o-mini",
    output="verbose"  # Use output= for display settings (includes streaming)
)

# Tokens appear as they arrive
result = agent.start("Write a short story")

Using OutputConfig

from praisonaiagents import Agent
from praisonaiagents import OutputConfig

agent = Agent(
    instructions="You are a helpful assistant",
    output=OutputConfig(
        stream=True,
        output="verbose",
        metrics=True  # Show timing metrics
    )
)

CLI Usage

# Stream responses in terminal
praisonai chat --stream "Tell me a joke"

# With verbose output
praisonai chat --stream --verbose "Explain quantum computing"

Advanced: StreamEvent Protocol

For programmatic access to streaming events, use the StreamEvent protocol:
from praisonaiagents.streaming import (
    StreamEvent,
    StreamEventType,
    StreamMetrics,
    create_logging_callback
)

# Create a metrics-tracking callback
stream_logger, callback = create_logging_callback(
    output="verbose",
    metrics=True
)

# Events emitted during streaming:
# - REQUEST_START: Before API call
# - HEADERS_RECEIVED: When HTTP 200 arrives
# - FIRST_TOKEN: First content delta (TTFT marker)
# - DELTA_TEXT: Each text chunk
# - DELTA_TOOL_CALL: Tool call streaming
# - LAST_TOKEN: Final content delta
# - STREAM_END: Stream completed

# After streaming, get metrics
print(stream_logger.get_metrics_summary())
# Output: TTFT: 450ms | Stream: 2100ms | Total: 2550ms | Tokens: 150 (71.4/s)

Understanding Perceived Delays

Why does streaming seem slow?

  1. TTFT is provider-dependent: The model must process your prompt and begin generating before any tokens arrive. This is not buffering—it’s generation time.
  2. Network latency: Round-trip time to the API adds to TTFT.
  3. Response length: Longer responses take longer to stream, but you see progress immediately.

What streaming does NOT do

  • Stream tokens before the provider generates them
  • Eliminate TTFT (this is inherent to LLM generation)
  • Make the total response time faster

What streaming DOES do

  • Show tokens immediately as they arrive
  • Provide visual feedback during generation
  • Enable early termination if needed
  • Improve perceived responsiveness

Best Practices

Use stream=True for chat

Interactive conversations benefit from immediate feedback

Use stream=False for batch

Batch processing doesn’t need streaming overhead

Monitor TTFT

High TTFT may indicate model/network issues

Enable metrics

Use metrics=True to track streaming performance

Timing Glossary

TermDefinition
Request StartTimestamp when API call is initiated
Headers ReceivedWhen HTTP response headers arrive (200 OK)
First TokenFirst content delta received (TTFT marker)
Token CadenceRate of token delivery (tokens/second)
Last TokenFinal content delta received
Stream EndStream processing completed

Troubleshooting

”Streaming seems to buffer before showing anything”

This is TTFT, not buffering. The model is generating the first token. Check:
  • Model complexity (larger models have higher TTFT)
  • Prompt length (longer prompts take longer to process)
  • Network latency to the API

”Tokens appear in chunks, not one at a time”

This is normal. Providers may batch tokens for efficiency. The refresh_per_second setting in Rich Live display also affects visual update frequency.

”Non-streaming is faster than streaming”

Total time is similar, but streaming shows progress. Non-streaming waits for the complete response, which can feel slower even if the total time is the same.