Agent Evaluation
PraisonAI provides a comprehensive evaluation framework for testing and benchmarking AI agents. The evaluation system supports multiple evaluation types with zero performance impact when not in use.Evaluation Types

| Type | Description | Use Case |
|---|---|---|
| Accuracy | Compare output against expected output using LLM-as-judge | Verify correctness |
| Performance | Measure runtime and memory usage | Benchmark speed |
| Reliability | Verify expected tool calls are made | Test tool usage |
| Criteria | Evaluate against custom criteria | Quality assessment |
Installation
The evaluation framework is included inpraisonaiagents:
Python Usage
Accuracy Evaluation
Compare agent outputs against expected results using an LLM judge:Performance Evaluation
Benchmark agent runtime and memory usage:Reliability Evaluation
Verify that agents call the expected tools:Criteria Evaluation
Evaluate outputs against custom criteria using LLM-as-judge:Failure Callbacks
Handle evaluation failures with callbacks:Evaluate Pre-generated Outputs
Evaluate outputs without running the agent:Saving Results
Save evaluation results to files:CLI Usage
Accuracy Evaluation
Performance Evaluation
Reliability Evaluation
Criteria Evaluation
Batch Evaluation
Run multiple test cases from a JSON file:CLI Options Reference
Common Options
| Option | Short | Description |
|---|---|---|
--agent | -a | Path to agents.yaml file |
--output | -o | Output file for results |
--verbose | -v | Enable verbose output |
--quiet | -q | Suppress JSON output |
Accuracy Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--expected | -e | Expected output |
--iterations | -n | Number of iterations |
--model | -m | LLM model for judging |
Performance Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--iterations | -n | Number of benchmark iterations |
--warmup | -w | Number of warmup runs |
--memory | Track memory usage |
Reliability Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--expected-tools | -t | Expected tools (comma-separated) |
--forbidden-tools | -f | Forbidden tools (comma-separated) |
Criteria Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--criteria | -c | Evaluation criteria |
--scoring | -s | Scoring type (numeric/binary) |
--threshold | Pass threshold for numeric scoring | |
--iterations | -n | Number of iterations |
--model | -m | LLM model for judging |
Result Data Structures
AccuracyResult
PerformanceResult
ReliabilityResult
CriteriaResult
Best Practices
- Use Multiple Iterations: Run evaluations multiple times for statistical significance
- Warmup Runs: Use warmup runs for performance benchmarks to avoid cold-start effects
- Save Results: Always save results for tracking and comparison
- Custom Criteria: Write specific, measurable criteria for criteria evaluations
- Batch Testing: Use batch evaluation for regression testing
- CI/CD Integration: Integrate evaluations into your CI/CD pipeline

