Multi-Modal Agent
Build agents that can process and understand images, PDFs, audio, and other file types.Quick Start
Image Analysis
From URL
From Base64
From File Path
PDF Processing
File Attachments
Multi-Modal Messages
Combine multiple content types in a single message:Image Generation
Generate images with DALL-E or other models:With Agent
Supported Models
| Model | Provider | Capabilities |
|---|---|---|
gpt-4o | OpenAI | Vision, Text |
gpt-4o-mini | OpenAI | Vision, Text |
claude-3.5-sonnet | Anthropic | Vision, Text, PDFs |
claude-3-opus | Anthropic | Vision, Text, PDFs |
gemini-1.5-pro | Vision, Text, Video | |
gemini-1.5-flash | Vision, Text |
Best Practices
- Use appropriate models - Not all models support vision
- Optimize image size - Resize large images to reduce tokens
- Be specific - Provide clear instructions for image analysis
- Handle errors - Some images may fail to process
Environment Variables
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY | Yes | For GPT-4o vision |
ANTHROPIC_API_KEY | For Claude | Claude vision |
GOOGLE_API_KEY | For Gemini | Gemini vision |
Related
- Image Agent - Dedicated image agent
- Generate Image - Image generation

