> ## Documentation Index
> Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Multimodal Agents

> Guide for creating and using multimodal AI agents in PraisonAI for processing images, videos, and other media types

## Quick Start

<Tabs>
  <Tab title="Code">
    <Steps>
      <Step title="Install Package">
        First, install the PraisonAI Agents package:

        ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        pip install praisonaiagents opencv-python moviepy
        ```
      </Step>

      <Step title="Set API Key">
        Set your OpenAI API key as an environment variable in your terminal:

        ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        export OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxx
        ```
      </Step>

      <Step title="Create a file">
        Create a new file `app.py` with the basic setup:

        ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        from praisonaiagents import Agent, Task, AgentTeam

        # Create Vision Analysis Agent
        vision_agent = Agent(
            name="VisionAnalyst",
            role="Computer Vision Specialist",
            goal="Analyze images and videos to extract meaningful information",
            backstory="""You are an expert in computer vision and image analysis.
            You excel at describing images, detecting objects, and understanding visual content.""",
            llm="gpt-4o-mini",
            reflection=False
        )

        # Create tasks with different media types
        task = Task(
            name="analyze_landmark",
            description="Describe this famous landmark and its architectural features.",
            expected_output="Detailed description of the landmark's architecture and significance",
            agent=vision_agent,
            images=["https://upload.wikimedia.org/wikipedia/commons/b/bf/Krakow_-_Kosciol_Mariacki.jpg"]
        )

        # Run the agents
        agents = AgentTeam(
            agents=[vision_agent],
            tasks=[task],
            process="sequential",
            
        )

        agents.start()
        ```
      </Step>

      <Step title="Start Agents">
        Type this in your terminal to run your agents:

        ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        python app.py
        ```
      </Step>
    </Steps>
  </Tab>

  <Tab title="No Code">
    <Steps>
      <Step title="Install Package">
        Install the PraisonAI package:

        ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        pip install praisonai opencv-python moviepy
        ```
      </Step>

      <Step title="Set API Key">
        Set your OpenAI API key as an environment variable in your terminal:

        ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        export OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxx
        ```
      </Step>

      <Step title="Create a file">
        Create a new file `agents.yaml` with the basic setup:

        ```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        framework: praisonai
        process: sequential
        topic: analyze landmark image
        agents:  # Canonical: use 'agents' instead of 'roles'
          vision_analyst:
            name: VisionAnalyst
            role: Computer Vision Specialist
            goal: Analyze images and videos to extract meaningful information
            instructions:  # Canonical: use 'instructions' instead of 'backstory' |
              You are an expert in computer vision and image analysis.
              You excel at describing images, detecting objects, and understanding visual content.
            llm: gpt-4o-mini
            self_reflect: false
            tasks:
              analyze_landmark:
                description: Describe this famous landmark and its architectural features.
                expected_output: Detailed description of the landmark's architecture and significance
                images:
                  - https://upload.wikimedia.org/wikipedia/commons/b/bf/Krakow_-_Kosciol_Mariacki.jpg
        ```
      </Step>

      <Step title="Start Agents">
        Type this in your terminal to run your agents:

        ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
        praisonai agents.yaml
        ```
      </Step>
    </Steps>
  </Tab>
</Tabs>

<Note>
  **Requirements**

  * Python 3.10 or higher
  * OpenAI API key with vision model access
  * Basic understanding of Python and media handling
</Note>

<div className="relative w-full aspect-video">
  <iframe className="absolute top-0 left-0 w-full h-full" src="https://www.youtube.com/embed/hjAWmUT1qqY" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen />
</div>

## Understanding Multimodal Agents

<Card title="What are Multimodal Agents?" icon="question">
  Multimodal agents are designed to:

  * Process multiple types of data (text, images, videos)
  * Understand context across different modalities
  * Generate insights from diverse media sources
  * Handle complex multimedia tasks
</Card>

## Features

<CardGroup cols={2}>
  <Card title="Vision Processing" icon="eye">
    Analyze images, detect objects, and understand visual content.
  </Card>

  <Card title="Video Analysis" icon="video">
    Process video content for events and actions.
  </Card>

  <Card title="Text Extraction" icon="font">
    Extract and analyze text from images and documents.
  </Card>

  <Card title="Cross-Modal Understanding" icon="arrows-repeat">
    Integrate insights across different media types.
  </Card>
</CardGroup>

## Multi-Agent Media Processing

<Tabs>
  <Tab title="Code">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    from praisonaiagents import Agent, Task, AgentTeam

    # Create first agent for image analysis
    vision_agent = Agent(
        role="Image Analyst",
        goal="Analyze visual content and extract key information",
        backstory="Expert in visual analysis and image understanding",
        llm="gpt-4o-mini",
        reflection=False
    )

    # Create second agent for content writing
    writer_agent = Agent(
        role="Content Writer",
        goal="Create engaging content based on image analysis",
        backstory="Expert in creating compelling content from visual insights",
        llm="gpt-4o-mini"
    )

    # Create tasks for different media types
    document_task = Task(
        description="Extract and summarize text from this document image",
        expected_output="Structured text content with key information highlighted",
        agent=vision_agent,
        images=["document.jpg"]
    )

    writing_task = Task(
        description="Create engaging content based on image analysis",
        expected_output="Compelling article incorporating visual insights",
        agent=writer_agent
    )

    # Create and start the agents
    agents = AgentTeam(
        agents=[vision_agent, writer_agent],
        tasks=[document_task, writing_task],
        process="sequential"
    )

    result = agents.start()
    ```
  </Tab>

  <Tab title="No Code">
    ```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    framework: praisonai
    process: sequential
    topic: document analysis and content creation
    agents:  # Canonical
      vision_analyst:
        role: Image Analyst
        goal: Analyze visual content and extract key information
        instructions:  # Canonical: use 'instructions' instead of 'backstory' Expert in visual analysis and image understanding
        llm: gpt-4o-mini
        self_reflect: false
        tasks:
          document_task:
            description: Extract and summarize text from this document image
            expected_output: Structured text content with key information highlighted
            images:
              - document.jpg

      content_writer:
        role: Content Writer
        goal: Create engaging content based on image analysis
        instructions:  # Canonical: use 'instructions' instead of 'backstory' Expert in creating compelling content from visual insights
        llm: gpt-4o-mini
        tasks:
          writing_task:
            description: Create engaging content based on image analysis
            expected_output: Compelling article incorporating visual insights
    ```
  </Tab>
</Tabs>

### Configuration Options

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
# Create an agent with multimodal configuration
agent = Agent(
    role="Media Analyst",
    goal="Process multiple types of media",
    backstory="Expert in multimedia analysis",
    llm="gpt-4o-mini",  # Must support vision capabilities
      # Enable detailed logging
    reflection=False  # Optional: disable self-reflection
)

# Task with media requirements
task = Task(
    description="Analyze media content",
    expected_output="Comprehensive analysis",
    agent=agent,
    images=[  # Support multiple media sources
        "https://example.com/image1.jpg",
        "path/to/local/image.jpg",
        "path/to/video.mp4"
    ]
)
```

## Best Practices

<CardGroup cols={2}>
  <Card title="Media Handling" icon="image">
    * Use supported formats (JPEG, PNG)
    * Keep reasonable file sizes
    * Provide high-quality media
    * Validate files before processing
  </Card>

  <Card title="Task Design" icon="list-check">
    * Write clear descriptions
    * Break down complex analyses
    * Specify expected outputs
    * Handle errors gracefully
  </Card>
</CardGroup>

## Ephemeral Attachments

Send images to the agent for analysis without storing them in chat history. Essential for preventing context window overflow when processing multiple images.

<Tabs>
  <Tab title="Attachments Parameter">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    from praisonaiagents import Agent

    agent = Agent(
        instructions="You analyze images and remember context",
        memory=True
    )

    # Image is analyzed but NOT stored in history
    response = agent.chat(
        prompt="What's in this image?",     # ← Stored in history
        attachments=["photo.jpg"],           # ← NOT stored (ephemeral)
    )

    # Agent remembers the question, not the image data
    response = agent.chat("What did I ask about earlier?")
    # Agent: "You asked 'What's in this image?' and I told you..."
    ```
  </Tab>

  <Tab title="Ephemeral Context Manager">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    from praisonaiagents import Agent

    agent = Agent(instructions="Analyze images", memory=True)

    # Pre-image conversation
    agent.chat("Hello, I have some photos to show you")

    # Ephemeral block - nothing stored permanently
    with agent.ephemeral():
        response = agent.chat(
            "Analyze this",
            attachments=["image1.jpg"]
        )
        followup = agent.chat("What about the colors?")

    # After block, history is restored - images NOT persisted
    agent.chat("What have we discussed?")  # Only remembers pre-image chat
    ```
  </Tab>
</Tabs>

### History Management Methods

Clean up chat history after image analysis sessions:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent

agent = Agent(instructions="Image analyst", memory=True)

# After image analysis, clean up history
agent.prune_history(keep_last=5)           # Keep only last 5 messages
agent.delete_history(-1)                    # Delete last message
agent.delete_history_matching("[IMAGE]")   # Delete all image-related messages

# Check history size
print(f"History size: {agent.get_history_size()}")
```

| Method                             | Description                                 |
| ---------------------------------- | ------------------------------------------- |
| `prune_history(keep_last=N)`       | Keep only last N messages                   |
| `delete_history(index)`            | Delete message by index                     |
| `delete_history_matching(pattern)` | Delete messages containing pattern          |
| `get_history_size()`               | Get current history length                  |
| `ephemeral()`                      | Context manager for temporary conversations |

<Note>
  Use `attachments=` for one-time image analysis, or `ephemeral()` for multi-turn image conversations that shouldn't persist.
</Note>

## Example Use Cases

<CardGroup cols={2}>
  <Card title="Document Analysis" icon="file-lines">
    Extract and analyze text from document images.
  </Card>

  <Card title="Security Monitoring" icon="camera-cctv">
    Monitor security feeds for suspicious activity.
  </Card>

  <Card title="Medical Imaging" icon="microscope">
    Analyze medical scans for abnormalities.
  </Card>

  <Card title="Architectural Analysis" icon="building">
    Study architectural features and designs.
  </Card>
</CardGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="AutoAgents" icon="robot" href="./autoagents">
    Learn about automatically created and managed AI agents
  </Card>

  <Card title="Mini Agents" icon="microchip" href="./mini">
    Explore lightweight, focused AI agents
  </Card>
</CardGroup>

<Note>
  For optimal results, ensure your media files are in supported formats and sizes for processing.
</Note>
