Evaluation Loop

EvaluationLoop implements the “Ralph Loop” pattern: run an agent, judge the output, provide feedback, and repeat until the quality threshold is met.

How It Works

Quick Start

Agent.run_until()
EvaluationLoop Class
With Callback

from praisonaiagents import Agent

agent = Agent(name="analyzer", instructions="Analyze systems thoroughly")

result = agent.run_until(
    "Analyze the authentication flow",
    criteria="Analysis is thorough and actionable",
    threshold=8.0,
)

print(f"Score: {result.final_score}/10")
print(f"Success: {result.success}")

from praisonaiagents import Agent
from praisonaiagents.eval import EvaluationLoop

agent = Agent(name="writer", instructions="Write compelling content")

loop = EvaluationLoop(
    agent=agent,
    criteria="Content is engaging and well-structured",
    threshold=8.0,
    max_iterations=5,
)

result = loop.run("Write a product description for an AI assistant")
print(result.final_report)

from praisonaiagents import Agent
from praisonaiagents.eval import EvaluationLoop

def on_iteration(iteration_result):
    print(f"Iteration {iteration_result.iteration}: {iteration_result.score}/10")

agent = Agent(name="coder", instructions="Write clean code")

result = agent.run_until(
    "Write a function to validate email addresses",
    criteria="Code is correct, readable, and handles edge cases",
    on_iteration=on_iteration,
)

Modes

Optimize Mode

Stops as soon as the threshold is met. Best for production use.

loop = EvaluationLoop(
    agent=agent,
    criteria="...",
    mode="optimize"  # default
)

Review Mode

Runs all iterations regardless of score. Best for analysis.

loop = EvaluationLoop(
    agent=agent,
    criteria="...",
    mode="review"
)

Configuration

agent

Agent

required

The Agent instance to evaluate

criteria

string

required

Evaluation criteria for the Judge (e.g., “Response is thorough and accurate”)

threshold

float

default:"8.0"

Score threshold for success (1-10 scale)

max_iterations

int

default:"5"

Maximum number of iterations before stopping

mode

string

default:"optimize"

"optimize" (stop on success) or "review" (run all iterations)

on_iteration

Callable

Optional callback called after each iteration with IterationResult

verbose

bool

default:"false"

Enable verbose logging

Results

EvaluationLoopResult

success

bool

Whether the loop achieved the threshold

final_score

float

Score from the last iteration (1-10)

score_history

list[float]

All scores across iterations

final_output

string

Output from the last iteration

accumulated_findings

list[string]

All findings/suggestions collected

num_iterations

int

Number of iterations completed

total_duration_seconds

float

Total time taken

result = agent.run_until("Analyze the codebase", criteria="...")

# Access results
print(result.success)              # True
print(result.final_score)          # 8.5
print(result.score_history)        # [6.0, 7.2, 8.5]
print(result.num_iterations)       # 3
print(result.accumulated_findings) # ["Consider edge cases", ...]

# Generate report
print(result.final_report)         # Markdown report

# Serialize
print(result.to_json())            # JSON string
print(result.to_dict())            # Dictionary

IterationResult

Each iteration produces an IterationResult:

for iteration in result.iterations:
    print(f"Iteration {iteration.iteration}")
    print(f"  Score: {iteration.score}/10")
    print(f"  Reasoning: {iteration.reasoning}")
    print(f"  Findings: {iteration.findings}")
    print(f"  Output: {iteration.output[:100]}...")

Async Support

import asyncio
from praisonaiagents import Agent
from praisonaiagents.eval import EvaluationLoop

async def main():
    agent = Agent(name="analyzer", instructions="Analyze systems")
    
    # Using EvaluationLoop directly
    loop = EvaluationLoop(agent=agent, criteria="Analysis is thorough")
    result = await loop.run_async("Analyze the auth flow")
    
    # Or using Agent method
    result = await agent.run_until_async(
        "Analyze the auth flow",
        criteria="Analysis is thorough",
    )
    
    print(result.final_score)

asyncio.run(main())

Best Practices

Write Specific Criteria

Be specific in your criteria to get consistent results:

# ❌ Vague
criteria="Response is good"

# ✅ Specific
criteria="Response includes: 1) Clear problem statement, 2) Step-by-step solution, 3) Code examples"

Set Appropriate Thresholds

8.0 (default): Good for most use cases
9.0+: High quality, may require more iterations
7.0: Acceptable quality, faster completion

Limit Iterations for Cost Control

Each iteration makes LLM calls. Set max_iterations based on your budget:

loop = EvaluationLoop(
    agent=agent,
    criteria="...",
    max_iterations=3,  # Limit for cost control
)

Use Callbacks for Monitoring

Track progress in real-time:

def on_iteration(r):
    print(f"[{r.iteration}] Score: {r.score} - {r.reasoning[:50]}...")
    if r.score < 6:
        print("  ⚠️ Low score, may need more iterations")

result = agent.run_until("...", criteria="...", on_iteration=on_iteration)

Judge

LLM-as-judge for evaluating outputs

Evaluator-Optimizer

Multi-agent evaluator-optimizer pattern

EvaluationLoop uses lazy loading - the Judge is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

How It Works

Quick Start

Modes

Optimize Mode

Review Mode

Configuration

Results

EvaluationLoopResult

IterationResult

Async Support

Best Practices

Judge

Evaluator-Optimizer

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

​How It Works

​Quick Start

​Modes

Optimize Mode

Review Mode

​Configuration

​Results

​EvaluationLoopResult

​IterationResult

​Async Support

​Best Practices

​Related

Judge

Evaluator-Optimizer

How It Works

Quick Start

Modes

Configuration

Results

EvaluationLoopResult

IterationResult

Async Support

Best Practices

Related