Skip to main content
OCR and text extraction agent using vision models.

Simple

Agents: 1 — Single agent with vision capabilities extracts text from images.

Workflow

  1. Receive image with text
  2. Process with vision model
  3. Extract and return text content

Setup

pip install praisonaiagents praisonai
export OPENAI_API_KEY="your-key"

Run — Python

from praisonaiagents import Agent, Task, PraisonAIAgents

agent = Agent(
    name="OCRAgent",
    instructions="Extract all text from images preserving layout.",
    llm="gpt-4o-mini"
)

task = Task(
    description="Extract all text from this document",
    expected_output="Extracted text",
    agent=agent,
    images=["document.jpg"]
)

agents = PraisonAIAgents(agents=[agent], tasks=[task])
result = agents.start()
print(result)

Run — CLI

praisonai "Extract text from this document" --image document.jpg

Run — agents.yaml

framework: praisonai
topic: Text Extraction
roles:
  ocr_agent:
    role: OCR Specialist
    goal: Extract text from images
    backstory: You are an expert in text extraction
    llm: gpt-4o-mini
    tasks:
      extract:
        description: Extract all text from this document
        expected_output: Extracted text
        images:
          - document.jpg
praisonai agents.yaml

Serve API

from praisonaiagents import Agent

agent = Agent(
    name="OCRAgent",
    instructions="You are an OCR expert.",
    llm="gpt-4o-mini"
)

agent.launch(port=8080)
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Extract text from: https://example.com/doc.jpg"}'

Advanced Workflow (All Features)

Agents: 1 — Single agent with memory, persistence, structured output, and session resumability.

Workflow

  1. Initialize session for document tracking
  2. Configure SQLite persistence for extraction history
  3. Extract text with structured output
  4. Store results in memory for search
  5. Resume session for document comparison

Setup

pip install praisonaiagents praisonai pydantic
export OPENAI_API_KEY="your-key"

Run — Python

from praisonaiagents import Agent, Task, PraisonAIAgents, Session
from pydantic import BaseModel

class ExtractedDocument(BaseModel):
    filename: str
    text: str
    sections: list[str]
    word_count: int

session = Session(session_id="ocr-001", user_id="user-1")

agent = Agent(
    name="OCRAgent",
    instructions="Extract text and return structured results.",
    llm="gpt-4o-mini",
    memory=True
)

task = Task(
    description="Extract all text from this document",
    expected_output="Structured extraction",
    agent=agent,
    images=["document.jpg"],
    output_pydantic=ExtractedDocument
)

agents = PraisonAIAgents(
    agents=[agent],
    tasks=[task],
    memory=True,
    memory_config={"provider": "sqlite", "db_path": "ocr.db"},
    verbose=1
)

result = agents.start()
print(result)

Run — CLI

praisonai "Extract text" --image document.jpg --memory --verbose

Run — agents.yaml

framework: praisonai
topic: Text Extraction
memory: true
memory_config:
  provider: sqlite
  db_path: ocr.db
roles:
  ocr_agent:
    role: OCR Specialist
    goal: Extract text with structured output
    backstory: You are an expert in text extraction
    llm: gpt-4o-mini
    memory: true
    tasks:
      extract:
        description: Extract all text from this document
        expected_output: Structured extraction
        images:
          - document.jpg
        output_json:
          filename: string
          text: string
          sections: array
          word_count: number
praisonai agents.yaml --verbose

Serve API

from praisonaiagents import Agent

agent = Agent(
    name="OCRAgent",
    instructions="Extract text and return structured results.",
    llm="gpt-4o-mini",
    memory=True
)

agent.launch(port=8080)
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Extract text", "session_id": "ocr-001"}'

Monitor / Verify

praisonai "test ocr" --image test.jpg --verbose

Cleanup

rm -f ocr.db

Features Demonstrated

FeatureImplementation
WorkflowVision-based text extraction
DB PersistenceSQLite via memory_config
Observability--verbose flag
ResumabilitySession with session_id
Structured OutputPydantic ExtractedDocument model

Next Steps