Skip to main content

Agent Evaluation

PraisonAI provides a comprehensive evaluation framework for testing and benchmarking AI agents. The evaluation system supports multiple evaluation types with zero performance impact when not in use.

Evaluation Types

Eval Mode for Testing
TypeDescriptionUse Case
AccuracyCompare output against expected output using LLM-as-judgeVerify correctness
PerformanceMeasure runtime and memory usageBenchmark speed
ReliabilityVerify expected tool calls are madeTest tool usage
CriteriaEvaluate against custom criteriaQuality assessment

Installation

The evaluation framework is included in praisonaiagents:
pip install praisonaiagents

Python Usage

Accuracy Evaluation

Compare agent outputs against expected results using an LLM judge:
from praisonaiagents import Agent
from praisonaiagents.eval import AccuracyEvaluator

# Create agent
agent = Agent(instructions="You are a math tutor. Answer concisely.")

# Create evaluator
evaluator = AccuracyEvaluator(
    agent=agent,
    input_text="What is 2 + 2?",
    expected_output="4",
    num_iterations=3,  # Run multiple times for statistical significance
    verbose=True
)

# Run evaluation
result = evaluator.run(print_summary=True)

print(f"Average Score: {result.avg_score}/10")
print(f"Passed: {result.passed}")

Performance Evaluation

Benchmark agent runtime and memory usage:
from praisonaiagents import Agent
from praisonaiagents.eval import PerformanceEvaluator

agent = Agent(instructions="You are a helpful assistant.")

evaluator = PerformanceEvaluator(
    agent=agent,
    input_text="What is the capital of France?",
    num_iterations=10,  # Number of benchmark runs
    warmup_runs=2,      # Warmup runs before measurement
    track_memory=True,  # Track memory usage
    verbose=True
)

result = evaluator.run(print_summary=True)

print(f"Average Time: {result.avg_run_time:.4f}s")
print(f"Min Time: {result.min_run_time:.4f}s")
print(f"Max Time: {result.max_run_time:.4f}s")
print(f"P95 Time: {result.p95_run_time:.4f}s")
print(f"Avg Memory: {result.avg_memory:.2f} MB")

Reliability Evaluation

Verify that agents call the expected tools:
from praisonaiagents import Agent
from praisonaiagents.eval import ReliabilityEvaluator

def search_web(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

def calculate(expression: str) -> str:
    """Calculate expression."""
    return str(eval(expression))

agent = Agent(
    instructions="You have search and calculator tools.",
    tools=[search_web, calculate]
)

evaluator = ReliabilityEvaluator(
    agent=agent,
    input_text="Search for weather and calculate 25 * 4",
    expected_tools=["search_web", "calculate"],
    forbidden_tools=["delete_file"],  # Should NOT be called
    verbose=True
)

result = evaluator.run(print_summary=True)

print(f"Passed: {result.passed}")
print(f"Pass Rate: {result.pass_rate:.1%}")

Criteria Evaluation

Evaluate outputs against custom criteria using LLM-as-judge:
from praisonaiagents import Agent
from praisonaiagents.eval import CriteriaEvaluator

agent = Agent(instructions="You are a customer service agent.")

# Numeric scoring (1-10)
evaluator = CriteriaEvaluator(
    criteria="Response is helpful, empathetic, and provides a clear solution",
    agent=agent,
    input_text="My order hasn't arrived yet.",
    scoring_type="numeric",  # Score 1-10
    threshold=7.0,           # Pass if score >= 7
    num_iterations=2,
    verbose=True
)

result = evaluator.run(print_summary=True)

print(f"Average Score: {result.avg_score}/10")
print(f"Pass Rate: {result.pass_rate:.1%}")

# Binary scoring (pass/fail)
binary_evaluator = CriteriaEvaluator(
    criteria="Response does not contain offensive language",
    agent=agent,
    input_text="Tell me a joke",
    scoring_type="binary",
    verbose=True
)

binary_result = binary_evaluator.run(print_summary=True)

Failure Callbacks

Handle evaluation failures with callbacks:
from praisonaiagents.eval import CriteriaEvaluator

def handle_failure(score):
    print(f"ALERT: Evaluation failed with score {score.score}")
    print(f"Reasoning: {score.reasoning}")
    # Send alert, log to monitoring system, etc.

evaluator = CriteriaEvaluator(
    criteria="Response is professional",
    agent=agent,
    input_text="Help me",
    on_fail=handle_failure,
    threshold=8.0
)

evaluator.run()

Evaluate Pre-generated Outputs

Evaluate outputs without running the agent:
from praisonaiagents.eval import AccuracyEvaluator, CriteriaEvaluator

# Accuracy evaluation of pre-generated output
accuracy_eval = AccuracyEvaluator(
    func=lambda x: "unused",  # Placeholder
    input_text="What is 2+2?",
    expected_output="4"
)

result = accuracy_eval.evaluate_output("The answer is 4")

# Criteria evaluation of pre-generated output
criteria_eval = CriteriaEvaluator(
    criteria="Response is helpful and accurate",
    func=lambda x: "unused"
)

result = criteria_eval.evaluate_output("Here's how to solve that...")

Saving Results

Save evaluation results to files:
evaluator = AccuracyEvaluator(
    agent=agent,
    input_text="Test input",
    expected_output="Expected output",
    save_results_path="results/{name}_{eval_id}.json"  # Supports placeholders
)

result = evaluator.run()
# Results automatically saved to file

CLI Usage

Accuracy Evaluation

praisonai eval accuracy \
    --prompt "What is 2+2?" \  # Direct prompt (no agents.yaml needed)
    --expected "4"

# Or with agents.yaml:
praisonai eval accuracy \
    --agent agents.yaml \
    --input "What is 2+2?" \
    --expected "4" \
    --iterations 3 \
    --output results.json \
    --verbose

Performance Evaluation

praisonai eval performance \
    --agent agents.yaml \
    --input "Hello" \
    --iterations 10 \
    --warmup 2 \
    --memory \
    --output perf_results.json

Reliability Evaluation

praisonai eval reliability \
    --agent agents.yaml \
    --input "Search for weather" \
    --expected-tools "search_web,calculate" \
    --forbidden-tools "delete_file" \
    --output reliability.json

Criteria Evaluation

praisonai eval criteria \
    --agent agents.yaml \
    --input "Help me with my order" \
    --criteria "Response is helpful and professional" \
    --scoring numeric \
    --threshold 7.0 \
    --iterations 2 \
    --output criteria.json

Batch Evaluation

Run multiple test cases from a JSON file:
praisonai eval batch \
    --agent agents.yaml \
    --test-file tests.json \
    --batch-type accuracy \
    --output batch_results.json
Test file format (tests.json):
[
    {
        "input": "What is 2+2?",
        "expected": "4"
    },
    {
        "input": "What is the capital of France?",
        "expected": "Paris"
    }
]

CLI Options Reference

Common Options

OptionShortDescription
--agent-aPath to agents.yaml file
--output-oOutput file for results
--verbose-vEnable verbose output
--quiet-qSuppress JSON output

Accuracy Options

OptionShortDescription
--input-iInput text for the agent
--expected-eExpected output
--iterations-nNumber of iterations
--model-mLLM model for judging

Performance Options

OptionShortDescription
--input-iInput text for the agent
--iterations-nNumber of benchmark iterations
--warmup-wNumber of warmup runs
--memoryTrack memory usage

Reliability Options

OptionShortDescription
--input-iInput text for the agent
--expected-tools-tExpected tools (comma-separated)
--forbidden-tools-fForbidden tools (comma-separated)

Criteria Options

OptionShortDescription
--input-iInput text for the agent
--criteria-cEvaluation criteria
--scoring-sScoring type (numeric/binary)
--thresholdPass threshold for numeric scoring
--iterations-nNumber of iterations
--model-mLLM model for judging

Result Data Structures

AccuracyResult

result.evaluations  # List of individual scores
result.avg_score    # Average score (0-10)
result.min_score    # Minimum score
result.max_score    # Maximum score
result.std_dev      # Standard deviation
result.passed       # True if avg_score >= 7
result.to_dict()    # Convert to dictionary
result.to_json()    # Convert to JSON string

PerformanceResult

result.metrics          # List of PerformanceMetrics
result.avg_run_time     # Average runtime in seconds
result.min_run_time     # Minimum runtime
result.max_run_time     # Maximum runtime
result.median_run_time  # Median runtime
result.p95_run_time     # 95th percentile runtime
result.avg_memory       # Average memory usage (MB)
result.max_memory       # Peak memory usage (MB)

ReliabilityResult

result.tool_results  # List of ToolCallResult
result.passed_calls  # Tools that passed
result.failed_calls  # Tools that failed
result.pass_rate     # Pass rate (0-1)
result.passed        # True if all checks passed
result.status        # "PASSED" or "FAILED"

CriteriaResult

result.evaluations  # List of CriteriaScore
result.criteria     # The evaluation criteria
result.scoring_type # "numeric" or "binary"
result.threshold    # Pass threshold
result.avg_score    # Average score
result.pass_rate    # Pass rate (0-1)
result.passed       # True if passed threshold

Best Practices

  1. Use Multiple Iterations: Run evaluations multiple times for statistical significance
  2. Warmup Runs: Use warmup runs for performance benchmarks to avoid cold-start effects
  3. Save Results: Always save results for tracking and comparison
  4. Custom Criteria: Write specific, measurable criteria for criteria evaluations
  5. Batch Testing: Use batch evaluation for regression testing
  6. CI/CD Integration: Integrate evaluations into your CI/CD pipeline

Examples

See the examples directory for complete examples: