> ## Documentation Index
> Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Specialized Agents

> Use specialized agent types (AudioAgent, VideoAgent, ImageAgent, OCRAgent) in YAML workflows

# Specialized Agents

PraisonAI supports specialized agent types that provide domain-specific capabilities for media processing, document handling, and more. These agents can be used in YAML workflows using the simple `agent:` field.

## Supported Agent Types

| Agent Type          | Purpose                                       | Key Methods                |
| ------------------- | --------------------------------------------- | -------------------------- |
| `AudioAgent`        | Text-to-Speech (TTS) and Speech-to-Text (STT) | `speech()`, `transcribe()` |
| `VideoAgent`        | Video generation                              | `generate()`               |
| `ImageAgent`        | Image generation, editing, variations         | `generate()`, `edit()`     |
| `OCRAgent`          | Text extraction from documents/images         | `extract()`                |
| `DeepResearchAgent` | Automated research with citations             | `research()`               |

## Quick Start

### Text-to-Speech (TTS)

```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
agents:
  speaker:
    agent: AudioAgent
    llm: openai/tts-1
    role: Text-to-Speech Agent
    goal: Convert text to speech

steps:
  - agent: speaker
    action: speech
    text: "Hello, welcome to PraisonAI!"
    output: "hello.mp3"
```

### Speech-to-Text (STT)

```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
agents:
  transcriber:
    agent: AudioAgent
    llm: openai/whisper-1
    role: Transcriber
    goal: Transcribe audio to text

steps:
  - agent: transcriber
    action: transcribe
    input: "recording.mp3"
```

### Image Generation

```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
agents:
  artist:
    agent: ImageAgent
    llm: openai/dall-e-3
    role: Image Creator
    goal: Generate images from prompts

steps:
  - agent: artist
    action: generate
    prompt: "A beautiful sunset over mountains"
    output: "sunset.png"
```

### Video Generation

```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
agents:
  director:
    agent: VideoAgent
    llm: openai/sora-2
    role: Video Creator
    goal: Generate videos from prompts

steps:
  - agent: director
    action: generate
    prompt: "A cat playing with yarn"
    output: "cat.mp4"
```

### Document OCR

```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
agents:
  reader:
    agent: OCRAgent
    llm: mistral/mistral-ocr-latest
    role: Document Reader
    goal: Extract text from documents

steps:
  - agent: reader
    action: extract
    source: "document.pdf"
```

## Python API

You can also use specialized agents directly in Python:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import AudioAgent, ImageAgent, VideoAgent, OCRAgent

# Text-to-Speech
audio = AudioAgent(llm="openai/tts-1")
audio.speech("Hello world!", output="hello.mp3")

# Speech-to-Text
audio = AudioAgent(llm="openai/whisper-1")
text = audio.transcribe("recording.mp3")

# Image Generation
image = ImageAgent(llm="openai/dall-e-3")
result = image.generate("A mountain landscape")

# Video Generation
video = VideoAgent(llm="openai/sora-2")
result = video.generate("A sunset timelapse")

# OCR
ocr = OCRAgent(llm="mistral/mistral-ocr-latest")
text = ocr.extract("document.pdf")
```

## Supported Providers

### AudioAgent (TTS)

* `openai/tts-1` - OpenAI TTS
* `openai/tts-1-hd` - OpenAI TTS HD
* `elevenlabs/eleven_multilingual_v2` - ElevenLabs
* `gemini/gemini-2.5-flash-preview-tts` - Google Gemini

### AudioAgent (STT)

* `openai/whisper-1` - OpenAI Whisper
* `groq/whisper-large-v3` - Groq Whisper
* `deepgram/nova-2` - Deepgram

### ImageAgent

* `openai/dall-e-3` - DALL-E 3
* `openai/dall-e-2` - DALL-E 2
* `vertex_ai/imagen-3.0-generate-001` - Google Imagen

### VideoAgent

* `openai/sora-2` - OpenAI Sora
* `gemini/veo-3.0-generate-preview` - Google Veo
* `runwayml/gen4_turbo` - RunwayML

### OCRAgent

* `mistral/mistral-ocr-latest` - Mistral OCR

## CLI Usage

Use specialized agents via recipes:

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
# Text-to-Speech
praisonai recipe run ai-text-to-speech --var text="Hello world"

# Speech-to-Text
praisonai recipe run ai-speech-to-text --var audio=recording.mp3

# Image Generation
praisonai recipe run ai-generate-image --var prompt="A sunset"

# Video Generation
praisonai recipe run ai-generate-video --var prompt="A cat playing"

# Document OCR
praisonai recipe run ai-document-ocr --var source=document.pdf
```

## Best Practices

1. **Use appropriate models** - Choose the right model for your use case (e.g., `tts-1-hd` for higher quality audio)
2. **Handle file outputs** - Specialized agents often produce files; ensure proper output paths
3. **Chain with standard agents** - Combine specialized agents with standard `Agent` for complex workflows
4. **Use context passing** - Use `{{previous_output}}` to pass results between agents

## Related

* [Multi-Agent Pipelines](/docs/features/multi-agent-pipelines) - Chain specialized agents together
* [Audio Agents](/docs/audio/overview) - Detailed audio agent documentation
* [Video Agents](/docs/video/overview) - Detailed video agent documentation
* [Image Agents](/docs/image/overview) - Detailed image agent documentation
* [OCR](/docs/ocr/overview) - Detailed OCR documentation
