Vision - PraisonAI

Agents can see and understand images - describe content, read text, and answer questions.

Quick Start

Analyze an Image

import { Agent } from 'praisonai';

const agent = new Agent({
  instructions: 'You describe images in detail',
  llm: 'gpt-4o'  // Vision-capable model
});

await agent.chat([
  { role: 'user', content: [
    { type: 'text', text: 'What is in this image?' },
    { type: 'image', url: 'https://example.com/photo.jpg' }
  ]}
]);

From Local File

await agent.chat([
  { role: 'user', content: [
    { type: 'text', text: 'Describe this image' },
    { type: 'image', path: './my-photo.jpg' }
  ]}
]);

User Interaction Flow

Configuration Levels

// Level 1: Bool - Enable with vision model
const agent = new Agent({
  llm: 'gpt-4o',  // Vision-capable
  vision: true
});

// Level 2: String - Specify detail level
const agent = new Agent({
  llm: 'gpt-4o',
  vision: 'high'  // 'low', 'auto', 'high'
});

// Level 3: Dict - Full options
const agent = new Agent({
  llm: 'gpt-4o',
  vision: {
    detail: 'high',
    maxImages: 5
  }
});

What You Can Do

Task	Example
Describe images	”What is in this photo?”
Read text (OCR)	“What does the sign say?”
Compare images	”What changed between these?”
Identify objects	”List everything you see”

API Reference

VisionConfig

Complete configuration options

VisionAgent

Full class documentation

Best Practices

Use vision-capable models

Use GPT-4o, Claude 3, or Gemini Pro Vision for image analysis.

Be specific in questions

“What text is on the document?” works better than “What is this?”

Use high detail for text

Set detail: 'high' when reading small text or documents.

Video

Analyze videos

OCR

Extract text from images

JavaScript

​Quick Start

​User Interaction Flow

​Configuration Levels

​What You Can Do

​API Reference