Multimodal AI Agents

PraisonAI supports multimodal AI agents capable of processing various types of media including images and videos. This enables you to create agents that can understand and analyze visual content alongside text.

Installation

pip install praisonaiagents
export OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxx

Quick Start

Create multimodal_app.py and add the following code:

from praisonaiagents import Agent, Task, PraisonAIAgents

# Create Vision Analysis Agent
vision_agent = Agent(
    name="VisionAnalyst",
    role="Computer Vision Specialist",
    goal="Analyze images and videos to extract meaningful information",
    backstory="""You are an expert in computer vision and image analysis.
    You excel at describing images, detecting objects, and understanding visual content.""",
    llm="gpt-4o-mini",
    self_reflect=False
)

# Create tasks with different media types
task = Task(
    name="analyze_landmark",
    description="Describe this famous landmark and its architectural features.",
    expected_output="Detailed description of the landmark's architecture and significance",
    agent=vision_agent,
    images=["https://upload.wikimedia.org/wikipedia/commons/b/bf/Krakow_-_Kosciol_Mariacki.jpg"]
)

# Run the agents
agents = PraisonAIAgents(
    agents=[vision_agent],
    tasks=[task],
    process="sequential",
    verbose=True
)

result = agents.start()
print(result)

Agent Configuration

Core Attributes

  • All standard agent attributes apply (role, goal, backstory, etc.)
  • llm: Must be a model that supports vision capabilities (e.g., “gpt-4o-mini”)
  • verbose: Enable detailed logs (default: False)

Task Configuration

Media Support

Tasks can include various types of media through the images parameter:

  1. Image URLs
task = Task(
    description="Analyze the image",
    images=["https://example.com/image.jpg"]
)
  1. Local Image Files
task = Task(
    description="Analyze the local image",
    images=["path/to/local/image.jpg"]
)
  1. Video Files
task = Task(
    description="Analyze the video content",
    images=["path/to/video.mp4"]
)

Task Types

  1. Image Analysis
  • Object detection
  • Scene description
  • Architectural analysis
  • Text extraction from images
  1. Video Analysis
  • Event summarization
  • Object tracking
  • Action recognition
  • Text and caption extraction

Advanced Features

Multiple Media Sources

Tasks can process multiple media sources simultaneously:

task = Task(
    description="Compare these images and describe the differences",
    images=[
        "https://example.com/image1.jpg",
        "https://example.com/image2.jpg"
    ]
)

Combining Text and Visual Analysis

Create tasks that leverage both text and visual understanding:

task = Task(
    description="""Analyze this technical diagram and:
    1. Explain the components
    2. Describe their relationships
    3. Identify any text or labels
    4. Provide technical insights""",
    images=["diagram.jpg"]
)

Best Practices

  1. Media Handling

    • Ensure images are in supported formats (JPEG, PNG, etc.)
    • Keep video files within reasonable size limits
    • Provide clear, high-quality media for best results
  2. Task Design

    • Write clear, specific descriptions
    • Break complex analyses into subtasks
    • Include expected output format
  3. Performance Optimization

    • Use appropriate model versions for your needs
    • Consider caching for repeated analyses
    • Monitor token usage with visual content
  4. Error Handling

    • Validate media files before processing
    • Handle missing or corrupted files gracefully
    • Implement retry logic for failed analyses

Example Use Cases

  1. Document Analysis
document_task = Task(
    description="Extract and summarize text from this document image",
    expected_output="Structured text content with key information highlighted",
    agent=vision_agent,
    images=["document.jpg"]
)
  1. Security Monitoring
security_task = Task(
    description="Monitor this security camera feed and report any suspicious activity",
    expected_output="Detailed report of detected events and anomalies",
    agent=vision_agent,
    images=["security_feed.mp4"]
)
  1. Medical Imaging
medical_task = Task(
    description="Analyze this medical scan and identify any abnormalities",
    expected_output="Professional medical image analysis report",
    agent=vision_agent,
    images=["medical_scan.jpg"]
)

Was this page helpful?