> ## Documentation Index
> Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent Evaluation

> Comprehensive evaluation framework for testing and benchmarking AI agents

# Agent Evaluation

PraisonAI provides a comprehensive evaluation framework for testing and benchmarking AI agents. The evaluation system supports multiple evaluation types with zero performance impact when not in use.

## Evaluation Types

<img src="https://mintcdn.com/praisonai/SX0Y8_-DRBjzOTnt/docs/cli/eval-eval-mode-for-testing.gif?s=d3a6c1f727c25ac8c678116fd550e325" alt="Eval Mode for Testing" width="1497" height="1104" data-path="docs/cli/eval-eval-mode-for-testing.gif" />

| Type            | Description                                               | Use Case           |
| --------------- | --------------------------------------------------------- | ------------------ |
| **Accuracy**    | Compare output against expected output using LLM-as-judge | Verify correctness |
| **Performance** | Measure runtime and memory usage                          | Benchmark speed    |
| **Reliability** | Verify expected tool calls are made                       | Test tool usage    |
| **Criteria**    | Evaluate against custom criteria                          | Quality assessment |

## Installation

The evaluation framework is included in `praisonaiagents`:

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
pip install praisonaiagents
```

## Python Usage

### Accuracy Evaluation

Compare agent outputs against expected results using an LLM judge:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent
from praisonaiagents.eval import AccuracyEvaluator

# Create agent
agent = Agent(instructions="You are a math tutor. Answer concisely.")

# Create evaluator
evaluator = AccuracyEvaluator(
    agent=agent,
    input_text="What is 2 + 2?",
    expected_output="4",
    num_iterations=3,  # Run multiple times for statistical significance
    
)

# Run evaluation
result = evaluator.run(print_summary=True)

print(f"Average Score: {result.avg_score}/10")
print(f"Passed: {result.passed}")
```

### Performance Evaluation

Benchmark agent runtime and memory usage:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent
from praisonaiagents.eval import PerformanceEvaluator

agent = Agent(instructions="You are a helpful assistant.")

evaluator = PerformanceEvaluator(
    agent=agent,
    input_text="What is the capital of France?",
    num_iterations=10,  # Number of benchmark runs
    warmup_runs=2,      # Warmup runs before measurement
    track_memory=True,  # Track memory usage
    
)

result = evaluator.run(print_summary=True)

print(f"Average Time: {result.avg_run_time:.4f}s")
print(f"Min Time: {result.min_run_time:.4f}s")
print(f"Max Time: {result.max_run_time:.4f}s")
print(f"P95 Time: {result.p95_run_time:.4f}s")
print(f"Avg Memory: {result.avg_memory:.2f} MB")
```

### Reliability Evaluation

Verify that agents call the expected tools:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent
from praisonaiagents.eval import ReliabilityEvaluator

def search_web(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

def calculate(expression: str) -> str:
    """Calculate expression."""
    return str(eval(expression))

agent = Agent(
    instructions="You have search and calculator tools.",
    tools=[search_web, calculate]
)

evaluator = ReliabilityEvaluator(
    agent=agent,
    input_text="Search for weather and calculate 25 * 4",
    expected_tools=["search_web", "calculate"],
    forbidden_tools=["delete_file"],  # Should NOT be called
    
)

result = evaluator.run(print_summary=True)

print(f"Passed: {result.passed}")
print(f"Pass Rate: {result.pass_rate:.1%}")
```

### Criteria Evaluation

Evaluate outputs against custom criteria using LLM-as-judge:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents import Agent
from praisonaiagents.eval import CriteriaEvaluator

agent = Agent(instructions="You are a customer service agent.")

# Numeric scoring (1-10)
evaluator = CriteriaEvaluator(
    criteria="Response is helpful, empathetic, and provides a clear solution",
    agent=agent,
    input_text="My order hasn't arrived yet.",
    scoring_type="numeric",  # Score 1-10
    threshold=7.0,           # Pass if score >= 7
    num_iterations=2,
    
)

result = evaluator.run(print_summary=True)

print(f"Average Score: {result.avg_score}/10")
print(f"Pass Rate: {result.pass_rate:.1%}")

# Binary scoring (pass/fail)
binary_evaluator = CriteriaEvaluator(
    criteria="Response does not contain offensive language",
    agent=agent,
    input_text="Tell me a joke",
    scoring_type="binary",
    
)

binary_result = binary_evaluator.run(print_summary=True)
```

### Failure Callbacks

Handle evaluation failures with callbacks:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents.eval import CriteriaEvaluator

def handle_failure(score):
    print(f"ALERT: Evaluation failed with score {score.score}")
    print(f"Reasoning: {score.reasoning}")
    # Send alert, log to monitoring system, etc.

evaluator = CriteriaEvaluator(
    criteria="Response is professional",
    agent=agent,
    input_text="Help me",
    on_fail=handle_failure,
    threshold=8.0
)

evaluator.run()
```

### Evaluate Pre-generated Outputs

Evaluate outputs without running the agent:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from praisonaiagents.eval import AccuracyEvaluator, CriteriaEvaluator

# Accuracy evaluation of pre-generated output
accuracy_eval = AccuracyEvaluator(
    func=lambda x: "unused",  # Placeholder
    input_text="What is 2+2?",
    expected_output="4"
)

result = accuracy_eval.evaluate_output("The answer is 4")

# Criteria evaluation of pre-generated output
criteria_eval = CriteriaEvaluator(
    criteria="Response is helpful and accurate",
    func=lambda x: "unused"
)

result = criteria_eval.evaluate_output("Here's how to solve that...")
```

### Saving Results

Save evaluation results to files:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
evaluator = AccuracyEvaluator(
    agent=agent,
    input_text="Test input",
    expected_output="Expected output",
    save_results_path="results/{name}_{eval_id}.json"  # Supports placeholders
)

result = evaluator.run()
# Results automatically saved to file
```

## CLI Usage

### Accuracy Evaluation

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
praisonai eval accuracy \
    --prompt "What is 2+2?" \  # Direct prompt (no agents.yaml needed)
    --expected "4"

# Or with agents.yaml:
praisonai eval accuracy \
    --agent agents.yaml \
    --input "What is 2+2?" \
    --expected "4" \
    --iterations 3 \
    --output results.json \
    --verbose
```

### Performance Evaluation

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
praisonai eval performance \
    --agent agents.yaml \
    --input "Hello" \
    --iterations 10 \
    --warmup 2 \
    --memory \
    --output perf_results.json
```

### Reliability Evaluation

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
praisonai eval reliability \
    --agent agents.yaml \
    --input "Search for weather" \
    --expected-tools "search_web,calculate" \
    --forbidden-tools "delete_file" \
    --output reliability.json
```

### Criteria Evaluation

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
praisonai eval criteria \
    --agent agents.yaml \
    --input "Help me with my order" \
    --criteria "Response is helpful and professional" \
    --scoring numeric \
    --threshold 7.0 \
    --iterations 2 \
    --output criteria.json
```

### Batch Evaluation

Run multiple test cases from a JSON file:

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
praisonai eval batch \
    --agent agents.yaml \
    --test-file tests.json \
    --batch-type accuracy \
    --output batch_results.json
```

**Test file format (tests.json):**

```json theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
[
    {
        "input": "What is 2+2?",
        "expected": "4"
    },
    {
        "input": "What is the capital of France?",
        "expected": "Paris"
    }
]
```

## CLI Options Reference

### Common Options

| Option      | Short | Description              |
| ----------- | ----- | ------------------------ |
| `--agent`   | `-a`  | Path to agents.yaml file |
| `--output`  | `-o`  | Output file for results  |
| `--verbose` | `-v`  | Enable verbose output    |
| `--quiet`   | `-q`  | Suppress JSON output     |

### Accuracy Options

| Option         | Short | Description              |
| -------------- | ----- | ------------------------ |
| `--input`      | `-i`  | Input text for the agent |
| `--expected`   | `-e`  | Expected output          |
| `--iterations` | `-n`  | Number of iterations     |
| `--model`      | `-m`  | LLM model for judging    |

### Performance Options

| Option         | Short | Description                    |
| -------------- | ----- | ------------------------------ |
| `--input`      | `-i`  | Input text for the agent       |
| `--iterations` | `-n`  | Number of benchmark iterations |
| `--warmup`     | `-w`  | Number of warmup runs          |
| `--memory`     |       | Track memory usage             |

### Reliability Options

| Option              | Short | Description                       |
| ------------------- | ----- | --------------------------------- |
| `--input`           | `-i`  | Input text for the agent          |
| `--expected-tools`  | `-t`  | Expected tools (comma-separated)  |
| `--forbidden-tools` | `-f`  | Forbidden tools (comma-separated) |

### Criteria Options

| Option         | Short | Description                        |
| -------------- | ----- | ---------------------------------- |
| `--input`      | `-i`  | Input text for the agent           |
| `--criteria`   | `-c`  | Evaluation criteria                |
| `--scoring`    | `-s`  | Scoring type (numeric/binary)      |
| `--threshold`  |       | Pass threshold for numeric scoring |
| `--iterations` | `-n`  | Number of iterations               |
| `--model`      | `-m`  | LLM model for judging              |

## Result Data Structures

### AccuracyResult

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
result.evaluations  # List of individual scores
result.avg_score    # Average score (0-10)
result.min_score    # Minimum score
result.max_score    # Maximum score
result.std_dev      # Standard deviation
result.passed       # True if avg_score >= 7
result.to_dict()    # Convert to dictionary
result.to_json()    # Convert to JSON string
```

### PerformanceResult

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
result.metrics          # List of PerformanceMetrics
result.avg_run_time     # Average runtime in seconds
result.min_run_time     # Minimum runtime
result.max_run_time     # Maximum runtime
result.median_run_time  # Median runtime
result.p95_run_time     # 95th percentile runtime
result.avg_memory       # Average memory usage (MB)
result.max_memory       # Peak memory usage (MB)
```

### ReliabilityResult

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
result.tool_results  # List of ToolCallResult
result.passed_calls  # Tools that passed
result.failed_calls  # Tools that failed
result.pass_rate     # Pass rate (0-1)
result.passed        # True if all checks passed
result.status        # "PASSED" or "FAILED"
```

### CriteriaResult

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
result.evaluations  # List of CriteriaScore
result.criteria     # The evaluation criteria
result.scoring_type # "numeric" or "binary"
result.threshold    # Pass threshold
result.avg_score    # Average score
result.pass_rate    # Pass rate (0-1)
result.passed       # True if passed threshold
```

## LLM Judge in Interactive Tests

The interactive test runner integrates LLM-as-judge evaluation for automated response quality assessment. This allows you to validate not just tool calls and file outputs, but also the quality of agent responses.

### Using Judge in CSV Tests

Add a `judge_rubric` column to your CSV test file:

```csv theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
id,name,prompts,judge_rubric,judge_threshold,judge_model
test_01,Helpful Response,"Explain Python decorators",Response is clear and accurate,7.0,gpt-4o-mini
test_02,Code Quality,"Create a function to sort a list",Code is correct and well-documented,8.0,gpt-4o-mini
```

### Judge Configuration

| Option            | Default     | Description                        |
| ----------------- | ----------- | ---------------------------------- |
| `judge_rubric`    | (empty)     | Evaluation criteria for the judge  |
| `judge_threshold` | 7.0         | Minimum score to pass (1-10 scale) |
| `judge_model`     | gpt-4o-mini | Model used for evaluation          |

### CLI Options for Judge

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
# Run with judge evaluation
praisonai test interactive --csv tests.csv

# Skip judge even if rubric is present
praisonai test interactive --csv tests.csv --no-judge

# Use a different judge model
praisonai test interactive --csv tests.csv --judge-model gpt-4o
```

### Judge Output

When judge evaluation is enabled, results include:

* **Score**: 1-10 rating based on rubric
* **Passed**: Whether score meets threshold
* **Reasoning**: Detailed explanation of the score

Example artifact (`judge_result.json`):

```json theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
{
  "score": 8.5,
  "passed": true,
  "reasoning": "SCORE: 8.5\nREASONING: The response clearly explains...",
  "threshold": 7.0,
  "model": "gpt-4o-mini"
}
```

### Writing Effective Rubrics

Good rubrics are:

* **Specific**: "Response includes code example" vs "Response is good"
* **Measurable**: "Explains at least 3 benefits" vs "Comprehensive"
* **Relevant**: Focus on what matters for the test case

Examples:

```csv theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
# Good rubrics
"Response contains working Python code with proper error handling"
"Explanation covers syntax, use cases, and at least one example"
"File was created with correct content and proper formatting"

# Avoid vague rubrics
"Response is helpful"
"Code is good"
"Answer is correct"
```

## Best Practices

1. **Use Multiple Iterations**: Run evaluations multiple times for statistical significance
2. **Warmup Runs**: Use warmup runs for performance benchmarks to avoid cold-start effects
3. **Save Results**: Always save results for tracking and comparison
4. **Custom Criteria**: Write specific, measurable criteria for criteria evaluations
5. **Batch Testing**: Use batch evaluation for regression testing
6. **CI/CD Integration**: Integrate evaluations into your CI/CD pipeline

## Examples

See the [examples directory](https://github.com/MervinPraison/PraisonAI/tree/main/examples/python/eval) for complete examples:

* [Accuracy Evaluation](https://github.com/MervinPraison/PraisonAI/blob/main/examples/python/eval/accuracy_example.py)
* [Performance Evaluation](https://github.com/MervinPraison/PraisonAI/blob/main/examples/python/eval/performance_example.py)
* [Reliability Evaluation](https://github.com/MervinPraison/PraisonAI/blob/main/examples/python/eval/reliability_example.py)
* [Criteria Evaluation](https://github.com/MervinPraison/PraisonAI/blob/main/examples/python/eval/criteria_example.py)
* [Batch Evaluation](https://github.com/MervinPraison/PraisonAI/blob/main/examples/python/eval/batch_example.py)

## GitHub Advanced Test Rubrics

The `github-advanced` test suite uses specialized LLM judge rubrics for evaluating GitHub workflow quality:

### Available Rubrics

| Rubric               | Description                    | Key Criteria                                                     |
| -------------------- | ------------------------------ | ---------------------------------------------------------------- |
| PR Quality           | Evaluates pull request quality | Title clarity, body completeness, issue reference, branch naming |
| Code Quality         | Evaluates code changes         | Correctness, tests pass, coverage, type hints, no regressions    |
| Workflow Correctness | Evaluates GitHub workflow      | Repo created, issue created, PR links issue                      |
| CI/CD Quality        | Evaluates CI configuration     | Valid YAML, checkout step, setup step, triggers                  |
| Documentation        | Evaluates docs changes         | Links valid, content accurate, formatting correct                |
| Multi-Agent          | Evaluates agent collaboration  | Handoff, task completion, context preservation                   |

### Rubric Structure

Each rubric contains weighted criteria:

```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
from tests.live.interactive.github_advanced.judge_rubric import (
    PR_QUALITY_RUBRIC,
    evaluate_with_rubric,
)

# Get evaluation prompt
prompt = PR_QUALITY_RUBRIC.get_prompt()

# Evaluate with context
result = evaluate_with_rubric(
    rubric=PR_QUALITY_RUBRIC,
    context={
        "pr_title": "Fix subtract sign bug",
        "pr_body": "Closes #1. Fixed the subtract function.",
        "branch": "fix/subtract-sign",
    },
    judge_model="gpt-4o-mini",
)

print(result["overall_score"])  # 0-10
print(result["passed"])  # True/False
```

### Scenario to Rubric Mapping

| Scenario | Rubrics Applied                                 |
| -------- | ----------------------------------------------- |
| GH\_01   | PR Quality, Code Quality, Workflow Correctness  |
| GH\_02   | PR Quality, CI/CD Quality, Workflow Correctness |
| GH\_03   | PR Quality, Code Quality, Workflow Correctness  |
| GH\_04   | PR Quality, Documentation, Workflow Correctness |
| GH\_05   | PR Quality, Multi-Agent, Workflow Correctness   |
