Skip to main content
Use an LLM to evaluate and score agent outputs for accuracy, quality, and custom criteria.

Accuracy Mode

Compare output against expected result

Criteria Mode

Evaluate against custom criteria

Recipe Mode

Evaluate workflow execution traces

Extensible

Register custom judge types

Quick Start

from praisonaiagents.eval import Judge

# Simple accuracy check
result = Judge().run(output="4", expected="4", input="What is 2+2?")
print(f"Score: {result.score}/10, Passed: {result.passed}")

Evaluation Modes

Accuracy Evaluation

Compare agent output against an expected result:
from praisonaiagents.eval import Judge

judge = Judge()
result = judge.run(
    output="Python is a high-level programming language",
    expected="Python is a programming language",
    input="What is Python?"
)

print(f"Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")

Criteria Evaluation

Evaluate output against custom criteria:
from praisonaiagents.eval import Judge

judge = Judge(criteria="Response is helpful, accurate, and concise")
result = judge.run(output="Hello! I'm here to help you with any questions.")

if result.passed:
    print("✅ Output meets criteria")
else:
    print("❌ Output needs improvement")
    for suggestion in result.suggestions:
        print(f"  • {suggestion}")

Recipe/Workflow Evaluation

Evaluate multi-agent workflow execution:
from praisonaiagents.eval import RecipeJudge

judge = RecipeJudge(mode="context")  # or "memory", "knowledge"
result = judge.run(
    output="Final workflow output...",
    expected="Complete analysis with citations"
)

Configuration

JudgeConfig

model
string
default:"gpt-4o-mini"
LLM model to use for evaluation
temperature
number
default:"0.1"
Temperature for consistent scoring (lower = more consistent)
maxTokens
number
default:"500"
Maximum tokens for LLM response
threshold
number
default:"7.0"
Score threshold for passing (1-10 scale)
criteria
string
Custom evaluation criteria
from praisonaiagents.eval import Judge, JudgeConfig

config = JudgeConfig(
    model="gpt-4o",
    temperature=0.0,
    threshold=8.0,
    criteria="Response must be technically accurate"
)

judge = Judge(config=config)

JudgeResult

The result object contains:
FieldTypeDescription
scorenumberQuality score (1-10)
passedbooleanWhether score >= threshold
reasoningstringExplanation for the score
outputstringThe judged output
expectedstring?Expected output (if provided)
criteriastring?Criteria used (if provided)
suggestionsstring[]Improvement suggestions
timestampnumberWhen evaluation occurred

Judge with Agent

Evaluate an agent’s response directly:
from praisonaiagents import Agent
from praisonaiagents.eval import Judge

agent = Agent(
    name="Math Helper",
    instructions="You solve math problems"
)

judge = Judge()
result = judge.run(
    agent=agent,
    input="What is 15 * 7?",
    expected="105"
)

print(f"Agent scored: {result.score}/10")

Custom Judges

Register Custom Judge

from praisonaiagents.eval import Judge, add_judge, get_judge, list_judges

class CodeQualityJudge(Judge):
    """Judge for evaluating code quality."""
    
    def __init__(self, **kwargs):
        super().__init__(
            criteria="Code is clean, efficient, and well-documented",
            **kwargs
        )

# Register
add_judge("code_quality", CodeQualityJudge)

# Use
JudgeClass = get_judge("code_quality")
judge = JudgeClass()

# List all judges
print(list_judges())  # ['accuracy', 'criteria', 'recipe', 'code_quality']

Domain-Agnostic Evaluation

Use JudgeCriteriaConfig for any domain:
from praisonaiagents.eval import Judge, JudgeCriteriaConfig

# Water flow optimization
config = JudgeCriteriaConfig(
    name="water_flow",
    description="Evaluate water flow optimization",
    prompt_template="""Evaluate the water flow configuration:
{output}

Score based on:
- Flow rate efficiency
- Pressure optimization  
- Resource conservation

SCORE: [1-10]
REASONING: [explanation]
SUGGESTIONS: [improvements]""",
    scoring_dimensions=["flow_rate", "pressure", "efficiency"],
    threshold=7.0
)

judge = Judge(criteria_config=config)
result = judge.run(output="Flow rate: 50L/min, Pressure: 2.5 bar")

Async Evaluation

import asyncio
from praisonaiagents.eval import Judge

async def evaluate_outputs():
    judge = Judge(criteria="Response is helpful")
    
    outputs = [
        "Hello! How can I help?",
        "I don't know.",
        "Let me help you with that!"
    ]
    
    results = await asyncio.gather(*[
        judge.run_async(output=output)
        for output in outputs
    ])
    
    for output, result in zip(outputs, results):
        print(f"{output[:30]}... → {result.score}/10")

asyncio.run(evaluate_outputs())

CLI Reference

# Basic judge
praisonai eval judge --input "Output to evaluate"

# With expected output (accuracy mode)
praisonai eval judge --input "4" --expected "4"

# With criteria
praisonai eval judge --input "Hello!" --criteria "Response is friendly"

# Custom threshold
praisonai eval judge --input "Test" --threshold 8.0

# JSON output
praisonai eval judge --input "Test" --json

# Custom model
praisonai eval judge --input "Test" --model gpt-4o

Best Practices

Set temperature: 0.1 or lower for consistent scoring across evaluations.
Be specific about what constitutes a good output. Vague criteria lead to inconsistent scores.
  • 7.0: Standard quality bar
  • 8.0: High quality requirement
  • 6.0: Lenient evaluation
The suggestions array provides actionable improvements. Use them to iterate on agent prompts.