Skip to main content
Agents can see and understand images - describe content, read text, and answer questions.

Quick Start

1

Analyze an Image

import { Agent } from 'praisonai';

const agent = new Agent({
  instructions: 'You describe images in detail',
  llm: 'gpt-4o'  // Vision-capable model
});

await agent.chat([
  { role: 'user', content: [
    { type: 'text', text: 'What is in this image?' },
    { type: 'image', url: 'https://example.com/photo.jpg' }
  ]}
]);
2

From Local File

await agent.chat([
  { role: 'user', content: [
    { type: 'text', text: 'Describe this image' },
    { type: 'image', path: './my-photo.jpg' }
  ]}
]);

User Interaction Flow


Configuration Levels

// Level 1: Bool - Enable with vision model
const agent = new Agent({
  llm: 'gpt-4o',  // Vision-capable
  vision: true
});

// Level 2: String - Specify detail level
const agent = new Agent({
  llm: 'gpt-4o',
  vision: 'high'  // 'low', 'auto', 'high'
});

// Level 3: Dict - Full options
const agent = new Agent({
  llm: 'gpt-4o',
  vision: {
    detail: 'high',
    maxImages: 5
  }
});

What You Can Do

TaskExample
Describe images”What is in this photo?”
Read text (OCR)“What does the sign say?”
Compare images”What changed between these?”
Identify objects”List everything you see”

API Reference

VisionConfig

Complete configuration options

VisionAgent

Full class documentation

Best Practices

Use GPT-4o, Claude 3, or Gemini Pro Vision for image analysis.
“What text is on the document?” works better than “What is this?”
Set detail: 'high' when reading small text or documents.