Skip to main content

Multi-Modal Agent

Build agents that can process and understand images, PDFs, audio, and other file types.

Quick Start

import { Agent, createImagePart, createFilePart } from 'praisonai-ts';

const agent = new Agent({
  name: 'VisionAgent',
  instructions: 'You analyze images and documents.',
  model: 'gpt-4o', // Vision-capable model
});

// Analyze an image
const response = await agent.chat([
  { type: 'text', text: 'What do you see in this image?' },
  createImagePart('https://example.com/image.jpg'),
]);

console.log(response);

Image Analysis

From URL

import { createImagePart } from 'praisonai-ts';

const response = await agent.chat([
  { type: 'text', text: 'Describe this image' },
  createImagePart('https://example.com/photo.jpg'),
]);

From Base64

import { createImagePart } from 'praisonai-ts';
import fs from 'fs';

const imageData = fs.readFileSync('./image.png');
const base64 = imageData.toString('base64');

const response = await agent.chat([
  { type: 'text', text: 'What is in this image?' },
  createImagePart(`data:image/png;base64,${base64}`),
]);

From File Path

import { createImagePart } from 'praisonai-ts';

const response = await agent.chat([
  { type: 'text', text: 'Analyze this screenshot' },
  createImagePart('./screenshot.png'), // Local file path
]);

PDF Processing

import { Agent, createPdfPart } from 'praisonai-ts';

const agent = new Agent({
  name: 'DocumentAgent',
  instructions: 'You analyze PDF documents.',
  model: 'gpt-4o',
});

const response = await agent.chat([
  { type: 'text', text: 'Summarize this document' },
  createPdfPart('./report.pdf'),
]);

File Attachments

import { createFilePart } from 'praisonai-ts';

// Text file
const response = await agent.chat([
  { type: 'text', text: 'Review this code' },
  createFilePart('./code.ts', 'text/typescript'),
]);

// CSV data
const response2 = await agent.chat([
  { type: 'text', text: 'Analyze this data' },
  createFilePart('./data.csv', 'text/csv'),
]);

Multi-Modal Messages

Combine multiple content types in a single message:
import { createMultimodalMessage } from 'praisonai-ts';

const message = createMultimodalMessage([
  { type: 'text', text: 'Compare these two images:' },
  { type: 'image', url: 'https://example.com/image1.jpg' },
  { type: 'image', url: 'https://example.com/image2.jpg' },
]);

const response = await agent.chat(message);

Image Generation

Generate images with DALL-E or other models:
import { aiGenerateImage } from 'praisonai-ts';

const result = await aiGenerateImage({
  model: 'dall-e-3',
  prompt: 'A futuristic city with flying cars',
  size: '1024x1024',
  quality: 'hd',
});

console.log(result.images[0].url);

With Agent

import { ImageAgent, createImageAgent } from 'praisonai-ts';

const imageAgent = createImageAgent({
  model: 'dall-e-3',
  defaultSize: '1024x1024',
});

const result = await imageAgent.generate('A sunset over mountains');
console.log(result.url);

Supported Models

ModelProviderCapabilities
gpt-4oOpenAIVision, Text
gpt-4o-miniOpenAIVision, Text
claude-3.5-sonnetAnthropicVision, Text, PDFs
claude-3-opusAnthropicVision, Text, PDFs
gemini-1.5-proGoogleVision, Text, Video
gemini-1.5-flashGoogleVision, Text

Best Practices

  1. Use appropriate models - Not all models support vision
  2. Optimize image size - Resize large images to reduce tokens
  3. Be specific - Provide clear instructions for image analysis
  4. Handle errors - Some images may fail to process

Environment Variables

VariableRequiredDescription
OPENAI_API_KEYYesFor GPT-4o vision
ANTHROPIC_API_KEYFor ClaudeClaude vision
GOOGLE_API_KEYFor GeminiGemini vision