Skip to main content
EvaluationLoop implements the “Ralph Loop” pattern: run an agent, judge the output, provide feedback, and repeat until the quality threshold is met.

How It Works

Quick Start

from praisonaiagents import Agent

agent = Agent(name="analyzer", instructions="Analyze systems thoroughly")

result = agent.run_until(
    "Analyze the authentication flow",
    criteria="Analysis is thorough and actionable",
    threshold=8.0,
)

print(f"Score: {result.final_score}/10")
print(f"Success: {result.success}")

Modes

Optimize Mode

Stops as soon as the threshold is met. Best for production use.
loop = EvaluationLoop(
    agent=agent,
    criteria="...",
    mode="optimize"  # default
)

Review Mode

Runs all iterations regardless of score. Best for analysis.
loop = EvaluationLoop(
    agent=agent,
    criteria="...",
    mode="review"
)

Configuration

agent
Agent
required
The Agent instance to evaluate
criteria
string
required
Evaluation criteria for the Judge (e.g., “Response is thorough and accurate”)
threshold
float
default:"8.0"
Score threshold for success (1-10 scale)
max_iterations
int
default:"5"
Maximum number of iterations before stopping
mode
string
default:"optimize"
"optimize" (stop on success) or "review" (run all iterations)
on_iteration
Callable
Optional callback called after each iteration with IterationResult
verbose
bool
default:"false"
Enable verbose logging

Results

EvaluationLoopResult

success
bool
Whether the loop achieved the threshold
final_score
float
Score from the last iteration (1-10)
score_history
list[float]
All scores across iterations
final_output
string
Output from the last iteration
accumulated_findings
list[string]
All findings/suggestions collected
num_iterations
int
Number of iterations completed
total_duration_seconds
float
Total time taken
result = agent.run_until("Analyze the codebase", criteria="...")

# Access results
print(result.success)              # True
print(result.final_score)          # 8.5
print(result.score_history)        # [6.0, 7.2, 8.5]
print(result.num_iterations)       # 3
print(result.accumulated_findings) # ["Consider edge cases", ...]

# Generate report
print(result.final_report)         # Markdown report

# Serialize
print(result.to_json())            # JSON string
print(result.to_dict())            # Dictionary

IterationResult

Each iteration produces an IterationResult:
for iteration in result.iterations:
    print(f"Iteration {iteration.iteration}")
    print(f"  Score: {iteration.score}/10")
    print(f"  Reasoning: {iteration.reasoning}")
    print(f"  Findings: {iteration.findings}")
    print(f"  Output: {iteration.output[:100]}...")

Async Support

import asyncio
from praisonaiagents import Agent
from praisonaiagents.eval import EvaluationLoop

async def main():
    agent = Agent(name="analyzer", instructions="Analyze systems")
    
    # Using EvaluationLoop directly
    loop = EvaluationLoop(agent=agent, criteria="Analysis is thorough")
    result = await loop.run_async("Analyze the auth flow")
    
    # Or using Agent method
    result = await agent.run_until_async(
        "Analyze the auth flow",
        criteria="Analysis is thorough",
    )
    
    print(result.final_score)

asyncio.run(main())

Best Practices

Be specific in your criteria to get consistent results:
# ❌ Vague
criteria="Response is good"

# ✅ Specific
criteria="Response includes: 1) Clear problem statement, 2) Step-by-step solution, 3) Code examples"
  • 8.0 (default): Good for most use cases
  • 9.0+: High quality, may require more iterations
  • 7.0: Acceptable quality, faster completion
Each iteration makes LLM calls. Set max_iterations based on your budget:
loop = EvaluationLoop(
    agent=agent,
    criteria="...",
    max_iterations=3,  # Limit for cost control
)
Track progress in real-time:
def on_iteration(r):
    print(f"[{r.iteration}] Score: {r.score} - {r.reasoning[:50]}...")
    if r.score < 6:
        print("  ⚠️ Low score, may need more iterations")

result = agent.run_until("...", criteria="...", on_iteration=on_iteration)
EvaluationLoop uses lazy loading - the Judge is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.