Documentation Index
Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
Use this file to discover all available pages before exploring further.
Agent Evaluation
PraisonAI provides a comprehensive evaluation framework for testing and benchmarking AI agents. The evaluation system supports multiple evaluation types with zero performance impact when not in use.
Evaluation Types
| Type | Description | Use Case |
|---|
| Accuracy | Compare output against expected output using LLM-as-judge | Verify correctness |
| Performance | Measure runtime and memory usage | Benchmark speed |
| Reliability | Verify expected tool calls are made | Test tool usage |
| Criteria | Evaluate against custom criteria | Quality assessment |
Installation
The evaluation framework is included in praisonaiagents:
pip install praisonaiagents
Python Usage
Accuracy Evaluation
Compare agent outputs against expected results using an LLM judge:
from praisonaiagents import Agent
from praisonaiagents.eval import AccuracyEvaluator
# Create agent
agent = Agent(instructions="You are a math tutor. Answer concisely.")
# Create evaluator
evaluator = AccuracyEvaluator(
agent=agent,
input_text="What is 2 + 2?",
expected_output="4",
num_iterations=3, # Run multiple times for statistical significance
)
# Run evaluation
result = evaluator.run(print_summary=True)
print(f"Average Score: {result.avg_score}/10")
print(f"Passed: {result.passed}")
Benchmark agent runtime and memory usage:
from praisonaiagents import Agent
from praisonaiagents.eval import PerformanceEvaluator
agent = Agent(instructions="You are a helpful assistant.")
evaluator = PerformanceEvaluator(
agent=agent,
input_text="What is the capital of France?",
num_iterations=10, # Number of benchmark runs
warmup_runs=2, # Warmup runs before measurement
track_memory=True, # Track memory usage
)
result = evaluator.run(print_summary=True)
print(f"Average Time: {result.avg_run_time:.4f}s")
print(f"Min Time: {result.min_run_time:.4f}s")
print(f"Max Time: {result.max_run_time:.4f}s")
print(f"P95 Time: {result.p95_run_time:.4f}s")
print(f"Avg Memory: {result.avg_memory:.2f} MB")
Reliability Evaluation
Verify that agents call the expected tools:
from praisonaiagents import Agent
from praisonaiagents.eval import ReliabilityEvaluator
def search_web(query: str) -> str:
"""Search the web."""
return f"Results for: {query}"
def calculate(expression: str) -> str:
"""Calculate expression."""
return str(eval(expression))
agent = Agent(
instructions="You have search and calculator tools.",
tools=[search_web, calculate]
)
evaluator = ReliabilityEvaluator(
agent=agent,
input_text="Search for weather and calculate 25 * 4",
expected_tools=["search_web", "calculate"],
forbidden_tools=["delete_file"], # Should NOT be called
)
result = evaluator.run(print_summary=True)
print(f"Passed: {result.passed}")
print(f"Pass Rate: {result.pass_rate:.1%}")
Criteria Evaluation
Evaluate outputs against custom criteria using LLM-as-judge:
from praisonaiagents import Agent
from praisonaiagents.eval import CriteriaEvaluator
agent = Agent(instructions="You are a customer service agent.")
# Numeric scoring (1-10)
evaluator = CriteriaEvaluator(
criteria="Response is helpful, empathetic, and provides a clear solution",
agent=agent,
input_text="My order hasn't arrived yet.",
scoring_type="numeric", # Score 1-10
threshold=7.0, # Pass if score >= 7
num_iterations=2,
)
result = evaluator.run(print_summary=True)
print(f"Average Score: {result.avg_score}/10")
print(f"Pass Rate: {result.pass_rate:.1%}")
# Binary scoring (pass/fail)
binary_evaluator = CriteriaEvaluator(
criteria="Response does not contain offensive language",
agent=agent,
input_text="Tell me a joke",
scoring_type="binary",
)
binary_result = binary_evaluator.run(print_summary=True)
Failure Callbacks
Handle evaluation failures with callbacks:
from praisonaiagents.eval import CriteriaEvaluator
def handle_failure(score):
print(f"ALERT: Evaluation failed with score {score.score}")
print(f"Reasoning: {score.reasoning}")
# Send alert, log to monitoring system, etc.
evaluator = CriteriaEvaluator(
criteria="Response is professional",
agent=agent,
input_text="Help me",
on_fail=handle_failure,
threshold=8.0
)
evaluator.run()
Evaluate Pre-generated Outputs
Evaluate outputs without running the agent:
from praisonaiagents.eval import AccuracyEvaluator, CriteriaEvaluator
# Accuracy evaluation of pre-generated output
accuracy_eval = AccuracyEvaluator(
func=lambda x: "unused", # Placeholder
input_text="What is 2+2?",
expected_output="4"
)
result = accuracy_eval.evaluate_output("The answer is 4")
# Criteria evaluation of pre-generated output
criteria_eval = CriteriaEvaluator(
criteria="Response is helpful and accurate",
func=lambda x: "unused"
)
result = criteria_eval.evaluate_output("Here's how to solve that...")
Saving Results
Save evaluation results to files:
evaluator = AccuracyEvaluator(
agent=agent,
input_text="Test input",
expected_output="Expected output",
save_results_path="results/{name}_{eval_id}.json" # Supports placeholders
)
result = evaluator.run()
# Results automatically saved to file
CLI Usage
Accuracy Evaluation
praisonai eval accuracy \
--prompt "What is 2+2?" \ # Direct prompt (no agents.yaml needed)
--expected "4"
# Or with agents.yaml:
praisonai eval accuracy \
--agent agents.yaml \
--input "What is 2+2?" \
--expected "4" \
--iterations 3 \
--output results.json \
--verbose
praisonai eval performance \
--agent agents.yaml \
--input "Hello" \
--iterations 10 \
--warmup 2 \
--memory \
--output perf_results.json
Reliability Evaluation
praisonai eval reliability \
--agent agents.yaml \
--input "Search for weather" \
--expected-tools "search_web,calculate" \
--forbidden-tools "delete_file" \
--output reliability.json
Criteria Evaluation
praisonai eval criteria \
--agent agents.yaml \
--input "Help me with my order" \
--criteria "Response is helpful and professional" \
--scoring numeric \
--threshold 7.0 \
--iterations 2 \
--output criteria.json
Batch Evaluation
Run multiple test cases from a JSON file:
praisonai eval batch \
--agent agents.yaml \
--test-file tests.json \
--batch-type accuracy \
--output batch_results.json
Test file format (tests.json):
[
{
"input": "What is 2+2?",
"expected": "4"
},
{
"input": "What is the capital of France?",
"expected": "Paris"
}
]
CLI Options Reference
Common Options
| Option | Short | Description |
|---|
--agent | -a | Path to agents.yaml file |
--output | -o | Output file for results |
--verbose | -v | Enable verbose output |
--quiet | -q | Suppress JSON output |
Accuracy Options
| Option | Short | Description |
|---|
--input | -i | Input text for the agent |
--expected | -e | Expected output |
--iterations | -n | Number of iterations |
--model | -m | LLM model for judging |
| Option | Short | Description |
|---|
--input | -i | Input text for the agent |
--iterations | -n | Number of benchmark iterations |
--warmup | -w | Number of warmup runs |
--memory | | Track memory usage |
Reliability Options
| Option | Short | Description |
|---|
--input | -i | Input text for the agent |
--expected-tools | -t | Expected tools (comma-separated) |
--forbidden-tools | -f | Forbidden tools (comma-separated) |
Criteria Options
| Option | Short | Description |
|---|
--input | -i | Input text for the agent |
--criteria | -c | Evaluation criteria |
--scoring | -s | Scoring type (numeric/binary) |
--threshold | | Pass threshold for numeric scoring |
--iterations | -n | Number of iterations |
--model | -m | LLM model for judging |
Result Data Structures
AccuracyResult
result.evaluations # List of individual scores
result.avg_score # Average score (0-10)
result.min_score # Minimum score
result.max_score # Maximum score
result.std_dev # Standard deviation
result.passed # True if avg_score >= 7
result.to_dict() # Convert to dictionary
result.to_json() # Convert to JSON string
result.metrics # List of PerformanceMetrics
result.avg_run_time # Average runtime in seconds
result.min_run_time # Minimum runtime
result.max_run_time # Maximum runtime
result.median_run_time # Median runtime
result.p95_run_time # 95th percentile runtime
result.avg_memory # Average memory usage (MB)
result.max_memory # Peak memory usage (MB)
ReliabilityResult
result.tool_results # List of ToolCallResult
result.passed_calls # Tools that passed
result.failed_calls # Tools that failed
result.pass_rate # Pass rate (0-1)
result.passed # True if all checks passed
result.status # "PASSED" or "FAILED"
CriteriaResult
result.evaluations # List of CriteriaScore
result.criteria # The evaluation criteria
result.scoring_type # "numeric" or "binary"
result.threshold # Pass threshold
result.avg_score # Average score
result.pass_rate # Pass rate (0-1)
result.passed # True if passed threshold
LLM Judge in Interactive Tests
The interactive test runner integrates LLM-as-judge evaluation for automated response quality assessment. This allows you to validate not just tool calls and file outputs, but also the quality of agent responses.
Using Judge in CSV Tests
Add a judge_rubric column to your CSV test file:
id,name,prompts,judge_rubric,judge_threshold,judge_model
test_01,Helpful Response,"Explain Python decorators",Response is clear and accurate,7.0,gpt-4o-mini
test_02,Code Quality,"Create a function to sort a list",Code is correct and well-documented,8.0,gpt-4o-mini
Judge Configuration
| Option | Default | Description |
|---|
judge_rubric | (empty) | Evaluation criteria for the judge |
judge_threshold | 7.0 | Minimum score to pass (1-10 scale) |
judge_model | gpt-4o-mini | Model used for evaluation |
CLI Options for Judge
# Run with judge evaluation
praisonai test interactive --csv tests.csv
# Skip judge even if rubric is present
praisonai test interactive --csv tests.csv --no-judge
# Use a different judge model
praisonai test interactive --csv tests.csv --judge-model gpt-4o
Judge Output
When judge evaluation is enabled, results include:
- Score: 1-10 rating based on rubric
- Passed: Whether score meets threshold
- Reasoning: Detailed explanation of the score
Example artifact (judge_result.json):
{
"score": 8.5,
"passed": true,
"reasoning": "SCORE: 8.5\nREASONING: The response clearly explains...",
"threshold": 7.0,
"model": "gpt-4o-mini"
}
Writing Effective Rubrics
Good rubrics are:
- Specific: “Response includes code example” vs “Response is good”
- Measurable: “Explains at least 3 benefits” vs “Comprehensive”
- Relevant: Focus on what matters for the test case
Examples:
# Good rubrics
"Response contains working Python code with proper error handling"
"Explanation covers syntax, use cases, and at least one example"
"File was created with correct content and proper formatting"
# Avoid vague rubrics
"Response is helpful"
"Code is good"
"Answer is correct"
Best Practices
- Use Multiple Iterations: Run evaluations multiple times for statistical significance
- Warmup Runs: Use warmup runs for performance benchmarks to avoid cold-start effects
- Save Results: Always save results for tracking and comparison
- Custom Criteria: Write specific, measurable criteria for criteria evaluations
- Batch Testing: Use batch evaluation for regression testing
- CI/CD Integration: Integrate evaluations into your CI/CD pipeline
Examples
See the examples directory for complete examples:
GitHub Advanced Test Rubrics
The github-advanced test suite uses specialized LLM judge rubrics for evaluating GitHub workflow quality:
Available Rubrics
| Rubric | Description | Key Criteria |
|---|
| PR Quality | Evaluates pull request quality | Title clarity, body completeness, issue reference, branch naming |
| Code Quality | Evaluates code changes | Correctness, tests pass, coverage, type hints, no regressions |
| Workflow Correctness | Evaluates GitHub workflow | Repo created, issue created, PR links issue |
| CI/CD Quality | Evaluates CI configuration | Valid YAML, checkout step, setup step, triggers |
| Documentation | Evaluates docs changes | Links valid, content accurate, formatting correct |
| Multi-Agent | Evaluates agent collaboration | Handoff, task completion, context preservation |
Rubric Structure
Each rubric contains weighted criteria:
from tests.live.interactive.github_advanced.judge_rubric import (
PR_QUALITY_RUBRIC,
evaluate_with_rubric,
)
# Get evaluation prompt
prompt = PR_QUALITY_RUBRIC.get_prompt()
# Evaluate with context
result = evaluate_with_rubric(
rubric=PR_QUALITY_RUBRIC,
context={
"pr_title": "Fix subtract sign bug",
"pr_body": "Closes #1. Fixed the subtract function.",
"branch": "fix/subtract-sign",
},
judge_model="gpt-4o-mini",
)
print(result["overall_score"]) # 0-10
print(result["passed"]) # True/False
Scenario to Rubric Mapping
| Scenario | Rubrics Applied |
|---|
| GH_01 | PR Quality, Code Quality, Workflow Correctness |
| GH_02 | PR Quality, CI/CD Quality, Workflow Correctness |
| GH_03 | PR Quality, Code Quality, Workflow Correctness |
| GH_04 | PR Quality, Documentation, Workflow Correctness |
| GH_05 | PR Quality, Multi-Agent, Workflow Correctness |