Skip to main content
Evaluation uses LLM-as-Judge to assess AI outputs with human-like reasoning, providing scores, feedback, and improvement suggestions.

How Evaluation Works


LLM as Judge

The Judge class uses an LLM to evaluate outputs with human-like reasoning. This is the recommended approach for most evaluations.
from praisonaiagents.eval import Judge

# Evaluate any output
result = Judge().run(
    output="The capital of France is Paris.",
    expected="Paris is the capital of France."
)

print(f"Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")

Judge Types

Compares output against expected output.
from praisonaiagents.eval import AccuracyJudge

judge = AccuracyJudge()
result = judge.run(
    output="Paris",
    expected="Paris",
    input="What is the capital of France?"
)
# Score: 10/10 - Perfect match
Evaluates against custom criteria.
from praisonaiagents.eval import CriteriaJudge

judge = CriteriaJudge(criteria="Response is professional and helpful")
result = judge.run(output="Hello! How can I assist you today?")
# Score: 9/10 - Professional and helpful
Evaluates multi-agent workflow outputs.
from praisonaiagents.eval import RecipeJudge

judge = RecipeJudge()
result = judge.run(
    output=workflow_output,
    expected="Complete research report"
)

Judge Registry

Register and retrieve custom judges:
from praisonaiagents.eval import add_judge, get_judge, list_judges

# Register a custom judge
add_judge("my_judge", MyCustomJudge)

# List all judges
print(list_judges())  # ['accuracy', 'criteria', 'recipe', 'my_judge']

# Get a judge by name
judge = get_judge("my_judge")

Evaluation Types

Accuracy Evaluation

Compare agent output against expected output using LLM-as-judge.
from praisonaiagents.eval import AccuracyEvaluator

evaluator = AccuracyEvaluator(
    agent=my_agent,
    input_text="What is 2+2?",
    expected_output="4"
)

result = evaluator.run(print_summary=True)
print(f"Average Score: {result.avg_score}/10")

Performance Evaluation

Measure runtime and memory usage.
from praisonaiagents.eval import PerformanceEvaluator

evaluator = PerformanceEvaluator(
    agent=my_agent,
    input_text="Hello!",
    num_iterations=10,
    warmup_runs=2
)

result = evaluator.run(print_summary=True)
print(f"Avg Time: {result.avg_run_time:.3f}s")
print(f"Avg Memory: {result.avg_memory_usage:.2f}MB")
MetricDescription
avg_run_timeAverage execution time
min_run_timeFastest execution
max_run_timeSlowest execution
std_dev_run_timeStandard deviation
median_run_timeMedian execution time
p95_run_time95th percentile
avg_memory_usageAverage memory (MB)

Reliability Evaluation

Verify that expected tools are called.
from praisonaiagents.eval import ReliabilityEvaluator

evaluator = ReliabilityEvaluator(
    agent=my_agent,
    input_text="Search for AI news",
    expected_tools=["search_web", "summarize"]
)

result = evaluator.run(print_summary=True)
print(f"Status: {result.status}")  # PASSED or FAILED
print(f"Pass Rate: {result.pass_rate}%")

Criteria Evaluation

Evaluate against custom criteria with numeric or binary scoring.
from praisonaiagents.eval import CriteriaEvaluator

evaluator = CriteriaEvaluator(
    criteria="Response is helpful, accurate, and professional",
    agent=my_agent,
    input_text="How do I reset my password?",
    scoring_type="numeric",
    threshold=7.0
)

result = evaluator.run(print_summary=True)
print(f"Score: {result.avg_score}/10")
print(f"Passed: {result.all_passed}")

Evaluation Flow


Async Evaluation

All evaluators support async execution:
import asyncio
from praisonaiagents.eval import AccuracyEvaluator

async def evaluate():
    evaluator = AccuracyEvaluator(
        agent=my_agent,
        input_text="Hello",
        expected_output="Hi there!"
    )
    
    result = await evaluator.run_async(print_summary=True)
    return result

result = asyncio.run(evaluate())

Saving Results

Save evaluation results for later analysis:
from praisonaiagents.eval import AccuracyEvaluator

evaluator = AccuracyEvaluator(
    agent=my_agent,
    input_text="Test",
    expected_output="Expected",
    save_results_path="./eval_results/accuracy_{timestamp}.json"
)

result = evaluator.run()
# Results automatically saved to file

Evaluation Packages

Run multiple test cases as a batch:
from praisonaiagents.eval import EvalPackage, EvalCase, Judge

# Define test cases
cases = [
    EvalCase(name="math", input="2+2", expected="4"),
    EvalCase(name="geography", input="Capital of France?", expected="Paris"),
    EvalCase(name="greeting", input="Hello", expected="Hi"),
]

# Create package
package = EvalPackage(
    name="Math and Geography Tests",
    cases=cases
)

# Run cases with Judge
judge = Judge()
for case in package.cases:
    result = judge.run(
        agent=my_agent,
        input=case.input,
        expected=case.expected
    )
    print(f"{case.name}: {result.score}/10")

Quick Reference

Judge

from praisonaiagents.eval import Judge
result = Judge().run(output="...", expected="...")

Accuracy

from praisonaiagents.eval import AccuracyEvaluator
result = AccuracyEvaluator(agent=a, input_text="...", expected_output="...").run()

Performance

from praisonaiagents.eval import PerformanceEvaluator
result = PerformanceEvaluator(func=f, num_iterations=10).run()

Reliability

from praisonaiagents.eval import ReliabilityEvaluator
result = ReliabilityEvaluator(agent=a, expected_tools=["..."]).run()

Best Practices

Use Judge for most evaluations - it provides human-like reasoning.
Be specific: “Response is under 100 words and includes a greeting” is better than “Response is good”.
Run evaluations multiple times to account for LLM non-determinism.
Use save_results_path to track evaluation history over time.
Use Accuracy for correctness, Performance for speed, Reliability for tool usage.
LLM Costs: Each evaluation makes LLM API calls. Use num_iterations wisely and consider caching for repeated evaluations.