Skip to main content

Specialized Agents

PraisonAI supports specialized agent types that provide domain-specific capabilities for media processing, document handling, and more. These agents can be used in YAML workflows using the simple agent: field.

Supported Agent Types

Agent TypePurposeKey Methods
AudioAgentText-to-Speech (TTS) and Speech-to-Text (STT)speech(), transcribe()
VideoAgentVideo generationgenerate()
ImageAgentImage generation, editing, variationsgenerate(), edit()
OCRAgentText extraction from documents/imagesextract()
DeepResearchAgentAutomated research with citationsresearch()

Quick Start

Text-to-Speech (TTS)

agents:
  speaker:
    agent: AudioAgent
    llm: openai/tts-1
    role: Text-to-Speech Agent
    goal: Convert text to speech

steps:
  - agent: speaker
    action: speech
    text: "Hello, welcome to PraisonAI!"
    output: "hello.mp3"

Speech-to-Text (STT)

agents:
  transcriber:
    agent: AudioAgent
    llm: openai/whisper-1
    role: Transcriber
    goal: Transcribe audio to text

steps:
  - agent: transcriber
    action: transcribe
    input: "recording.mp3"

Image Generation

agents:
  artist:
    agent: ImageAgent
    llm: openai/dall-e-3
    role: Image Creator
    goal: Generate images from prompts

steps:
  - agent: artist
    action: generate
    prompt: "A beautiful sunset over mountains"
    output: "sunset.png"

Video Generation

agents:
  director:
    agent: VideoAgent
    llm: openai/sora-2
    role: Video Creator
    goal: Generate videos from prompts

steps:
  - agent: director
    action: generate
    prompt: "A cat playing with yarn"
    output: "cat.mp4"

Document OCR

agents:
  reader:
    agent: OCRAgent
    llm: mistral/mistral-ocr-latest
    role: Document Reader
    goal: Extract text from documents

steps:
  - agent: reader
    action: extract
    source: "document.pdf"

Python API

You can also use specialized agents directly in Python:
from praisonaiagents import AudioAgent, ImageAgent, VideoAgent, OCRAgent

# Text-to-Speech
audio = AudioAgent(llm="openai/tts-1")
audio.speech("Hello world!", output="hello.mp3")

# Speech-to-Text
audio = AudioAgent(llm="openai/whisper-1")
text = audio.transcribe("recording.mp3")

# Image Generation
image = ImageAgent(llm="openai/dall-e-3")
result = image.generate("A mountain landscape")

# Video Generation
video = VideoAgent(llm="openai/sora-2")
result = video.generate("A sunset timelapse")

# OCR
ocr = OCRAgent(llm="mistral/mistral-ocr-latest")
text = ocr.extract("document.pdf")

Supported Providers

AudioAgent (TTS)

  • openai/tts-1 - OpenAI TTS
  • openai/tts-1-hd - OpenAI TTS HD
  • elevenlabs/eleven_multilingual_v2 - ElevenLabs
  • gemini/gemini-2.5-flash-preview-tts - Google Gemini

AudioAgent (STT)

  • openai/whisper-1 - OpenAI Whisper
  • groq/whisper-large-v3 - Groq Whisper
  • deepgram/nova-2 - Deepgram

ImageAgent

  • openai/dall-e-3 - DALL-E 3
  • openai/dall-e-2 - DALL-E 2
  • vertex_ai/imagen-3.0-generate-001 - Google Imagen

VideoAgent

  • openai/sora-2 - OpenAI Sora
  • gemini/veo-3.0-generate-preview - Google Veo
  • runwayml/gen4_turbo - RunwayML

OCRAgent

  • mistral/mistral-ocr-latest - Mistral OCR

CLI Usage

Use specialized agents via recipes:
# Text-to-Speech
praisonai recipe run ai-text-to-speech --var text="Hello world"

# Speech-to-Text
praisonai recipe run ai-speech-to-text --var audio=recording.mp3

# Image Generation
praisonai recipe run ai-generate-image --var prompt="A sunset"

# Video Generation
praisonai recipe run ai-generate-video --var prompt="A cat playing"

# Document OCR
praisonai recipe run ai-document-ocr --var source=document.pdf

Best Practices

  1. Use appropriate models - Choose the right model for your use case (e.g., tts-1-hd for higher quality audio)
  2. Handle file outputs - Specialized agents often produce files; ensure proper output paths
  3. Chain with standard agents - Combine specialized agents with standard Agent for complex workflows
  4. Use context passing - Use {{previous_output}} to pass results between agents