Specialized Agents

PraisonAI supports specialized agent types that provide domain-specific capabilities for media processing, document handling, and more. These agents can be used in YAML workflows using the simple agent: field.

Supported Agent Types

Agent Type	Purpose	Key Methods
`AudioAgent`	Text-to-Speech (TTS) and Speech-to-Text (STT)	`speech()`, `transcribe()`
`VideoAgent`	Video generation	`generate()`
`ImageAgent`	Image generation, editing, variations	`generate()`, `edit()`
`OCRAgent`	Text extraction from documents/images	`extract()`
`DeepResearchAgent`	Automated research with citations	`research()`

Quick Start

Text-to-Speech (TTS)

agents:
  speaker:
    agent: AudioAgent
    llm: openai/tts-1
    role: Text-to-Speech Agent
    goal: Convert text to speech

steps:
  - agent: speaker
    action: speech
    text: "Hello, welcome to PraisonAI!"
    output: "hello.mp3"

Speech-to-Text (STT)

agents:
  transcriber:
    agent: AudioAgent
    llm: openai/whisper-1
    role: Transcriber
    goal: Transcribe audio to text

steps:
  - agent: transcriber
    action: transcribe
    input: "recording.mp3"

Image Generation

agents:
  artist:
    agent: ImageAgent
    llm: openai/dall-e-3
    role: Image Creator
    goal: Generate images from prompts

steps:
  - agent: artist
    action: generate
    prompt: "A beautiful sunset over mountains"
    output: "sunset.png"

Video Generation

agents:
  director:
    agent: VideoAgent
    llm: openai/sora-2
    role: Video Creator
    goal: Generate videos from prompts

steps:
  - agent: director
    action: generate
    prompt: "A cat playing with yarn"
    output: "cat.mp4"

Document OCR

agents:
  reader:
    agent: OCRAgent
    llm: mistral/mistral-ocr-latest
    role: Document Reader
    goal: Extract text from documents

steps:
  - agent: reader
    action: extract
    source: "document.pdf"

Python API

You can also use specialized agents directly in Python:

from praisonaiagents import AudioAgent, ImageAgent, VideoAgent, OCRAgent

# Text-to-Speech
audio = AudioAgent(llm="openai/tts-1")
audio.speech("Hello world!", output="hello.mp3")

# Speech-to-Text
audio = AudioAgent(llm="openai/whisper-1")
text = audio.transcribe("recording.mp3")

# Image Generation
image = ImageAgent(llm="openai/dall-e-3")
result = image.generate("A mountain landscape")

# Video Generation
video = VideoAgent(llm="openai/sora-2")
result = video.generate("A sunset timelapse")

# OCR
ocr = OCRAgent(llm="mistral/mistral-ocr-latest")
text = ocr.extract("document.pdf")

Supported Providers

AudioAgent (TTS)

openai/tts-1 - OpenAI TTS
openai/tts-1-hd - OpenAI TTS HD
elevenlabs/eleven_multilingual_v2 - ElevenLabs
gemini/gemini-2.5-flash-preview-tts - Google Gemini

AudioAgent (STT)

openai/whisper-1 - OpenAI Whisper
groq/whisper-large-v3 - Groq Whisper
deepgram/nova-2 - Deepgram

ImageAgent

openai/dall-e-3 - DALL-E 3
openai/dall-e-2 - DALL-E 2
vertex_ai/imagen-3.0-generate-001 - Google Imagen

VideoAgent

openai/sora-2 - OpenAI Sora
gemini/veo-3.0-generate-preview - Google Veo
runwayml/gen4_turbo - RunwayML

OCRAgent

mistral/mistral-ocr-latest - Mistral OCR

CLI Usage

Use specialized agents via recipes:

# Text-to-Speech
praisonai recipe run ai-text-to-speech --var text="Hello world"

# Speech-to-Text
praisonai recipe run ai-speech-to-text --var audio=recording.mp3

# Image Generation
praisonai recipe run ai-generate-image --var prompt="A sunset"

# Video Generation
praisonai recipe run ai-generate-video --var prompt="A cat playing"

# Document OCR
praisonai recipe run ai-document-ocr --var source=document.pdf

Best Practices

Use appropriate models - Choose the right model for your use case (e.g., tts-1-hd for higher quality audio)
Handle file outputs - Specialized agents often produce files; ensure proper output paths
Chain with standard agents - Combine specialized agents with standard Agent for complex workflows
Use context passing - Use {{previous_output}} to pass results between agents

Multi-Agent Pipelines - Chain specialized agents together
Audio Agents - Detailed audio agent documentation
Video Agents - Detailed video agent documentation
Image Agents - Detailed image agent documentation
OCR - Detailed OCR documentation

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

Specialized Agents

Specialized Agents

Supported Agent Types

Quick Start

Text-to-Speech (TTS)

Speech-to-Text (STT)

Image Generation

Video Generation

Document OCR

Python API

Supported Providers

AudioAgent (TTS)

AudioAgent (STT)

ImageAgent

VideoAgent

OCRAgent

CLI Usage

Best Practices

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

​Specialized Agents

​Supported Agent Types

​Quick Start

​Text-to-Speech (TTS)

​Speech-to-Text (STT)

​Image Generation

​Video Generation

​Document OCR

​Python API

​Supported Providers

​AudioAgent (TTS)

​AudioAgent (STT)

​ImageAgent

​VideoAgent

​OCRAgent

​CLI Usage

​Best Practices

​Related

Specialized Agents

Supported Agent Types

Quick Start

Text-to-Speech (TTS)

Speech-to-Text (STT)

Image Generation

Video Generation

Document OCR

Python API

Supported Providers

AudioAgent (TTS)

AudioAgent (STT)

ImageAgent

VideoAgent

OCRAgent

CLI Usage

Best Practices

Related