Agent Evaluation
Evaluate your Agents before deploying to production. Test accuracy, measure performance, and verify tool usage. Catch regressions and ensure quality.Evaluate Agent Accuracy
Test if your Agent gives correct answers:Copy
import { Agent, accuracyEval } from 'praisonai';
const agent = new Agent({
name: 'Math Tutor',
instructions: 'You are a math tutor. Answer math questions accurately.'
});
// Test the Agent
const response = await agent.chat('What is 2+2?');
// Evaluate accuracy
const result = await accuracyEval({
input: 'What is 2+2?',
expectedOutput: '4',
actualOutput: response,
threshold: 0.8
});
console.log('Passed:', result.passed);
console.log('Score:', result.score);
// Assert in tests
if (!result.passed) {
throw new Error(`Agent accuracy test failed: ${result.score}`);
}
Benchmark Agent Performance
Measure Agent response times:Copy
import { Agent, performanceEval } from 'praisonai';
const agent = new Agent({
name: 'Fast Agent',
instructions: 'You respond quickly and concisely.'
});
// Benchmark the Agent
const result = await performanceEval({
func: async () => await agent.chat('Hello'),
iterations: 10,
warmupRuns: 2
});
console.log('Agent Performance:');
console.log(` Avg: ${result.avgTime}ms`);
console.log(` Min: ${result.minTime}ms`);
console.log(` Max: ${result.maxTime}ms`);
console.log(` P95: ${result.p95Time}ms`);
// Assert performance requirements
if (result.avgTime > 2000) {
console.warn('Agent is slower than expected');
}
Verify Agent Tool Usage
Ensure Agent calls the right tools:Copy
import { Agent, createTool, reliabilityEval } from 'praisonai';
const calculatorTool = createTool({
name: 'calculator',
description: 'Perform calculations',
parameters: { type: 'object', properties: { expression: { type: 'string' } } },
execute: async ({ expression }) => eval(expression).toString()
});
const agent = new Agent({
name: 'Calculator Agent',
instructions: 'Use the calculator tool for math.',
tools: [calculatorTool]
});
// Track tool calls
const toolCalls: string[] = [];
agent.onToolCall((name) => toolCalls.push(name));
await agent.chat('What is 15 * 7?');
// Verify correct tools were called
const result = await reliabilityEval({
expectedToolCalls: ['calculator'],
actualToolCalls: toolCalls
});
console.log('Tool Usage:');
console.log(` Passed: ${result.passed}`);
console.log(` Missing: ${result.details?.missing}`);
console.log(` Extra: ${result.details?.extra}`);
Agent Test Suite
Run comprehensive Agent evaluations:Copy
import { Agent, EvalSuite } from 'praisonai';
const agent = new Agent({
name: 'Support Agent',
instructions: 'You help customers with their questions.'
});
const suite = new EvalSuite();
// Test accuracy across multiple scenarios
const testCases = [
{ input: 'What are your hours?', expected: '9am to 5pm' },
{ input: 'How do I return an item?', expected: 'return policy' },
{ input: 'What is your phone number?', expected: '555-1234' }
];
for (const test of testCases) {
const response = await agent.chat(test.input);
await suite.runAccuracy(`accuracy-${test.input}`, {
input: test.input,
expectedOutput: test.expected,
actualOutput: response,
threshold: 0.7
});
}
// Test performance
await suite.runPerformance('response-time', {
func: async () => await agent.chat('Hello'),
iterations: 5
});
// Print results
suite.printSummary();
const summary = suite.getSummary();
console.log(`\nResults: ${summary.passed}/${summary.total} passed`);
console.log(`Average Score: ${summary.avgScore.toFixed(2)}`);
Multi-Agent Evaluation
Evaluate Agent teams:Copy
import { Agent, PraisonAIAgents, EvalSuite, performanceEval } from 'praisonai';
const researcher = new Agent({
name: 'Researcher',
instructions: 'Research topics thoroughly.'
});
const writer = new Agent({
name: 'Writer',
instructions: 'Write clear summaries.'
});
const agents = new PraisonAIAgents({
agents: [researcher, writer],
tasks: [
{ agent: researcher, description: 'Research: {topic}' },
{ agent: writer, description: 'Summarize the research' }
]
});
// Benchmark the entire pipeline
const result = await performanceEval({
func: async () => await agents.start({ topic: 'AI trends' }),
iterations: 3,
warmupRuns: 1
});
console.log('Multi-Agent Pipeline Performance:');
console.log(` Avg: ${result.avgTime}ms`);
console.log(` P95: ${result.p95Time}ms`);
Agent Regression Testing
Catch quality regressions:Copy
import { Agent, EvalSuite } from 'praisonai';
const agent = new Agent({
name: 'Production Agent',
instructions: 'You are a helpful assistant.'
});
async function runRegressionTests() {
const suite = new EvalSuite();
// Load test cases from file
const testCases = require('./agent-test-cases.json');
for (const test of testCases) {
const response = await agent.chat(test.input);
await suite.runAccuracy(test.name, {
input: test.input,
expectedOutput: test.expected,
actualOutput: response,
threshold: test.threshold || 0.8
});
}
const summary = suite.getSummary();
// Fail CI if tests don't pass
if (summary.failed > 0) {
console.error(`❌ ${summary.failed} tests failed`);
process.exit(1);
}
console.log(`✅ All ${summary.total} tests passed`);
}
runRegressionTests();
A/B Testing Agents
Compare Agent versions:Copy
import { Agent, accuracyEval, performanceEval } from 'praisonai';
const agentV1 = new Agent({
name: 'Agent v1',
instructions: 'You are a helpful assistant.',
model: 'gpt-3.5-turbo'
});
const agentV2 = new Agent({
name: 'Agent v2',
instructions: 'You are a helpful assistant. Be concise.',
model: 'gpt-4o-mini'
});
async function compareAgents(query: string, expected: string) {
const [r1, r2] = await Promise.all([
agentV1.chat(query),
agentV2.chat(query)
]);
const [acc1, acc2] = await Promise.all([
accuracyEval({ input: query, expectedOutput: expected, actualOutput: r1 }),
accuracyEval({ input: query, expectedOutput: expected, actualOutput: r2 })
]);
console.log('Comparison:');
console.log(` v1 Score: ${acc1.score.toFixed(2)}`);
console.log(` v2 Score: ${acc2.score.toFixed(2)}`);
console.log(` Winner: ${acc1.score > acc2.score ? 'v1' : 'v2'}`);
}
await compareAgents('What is TypeScript?', 'typed JavaScript');
Evaluation Types
| Type | Purpose |
|---|---|
| Accuracy | Test if Agent gives correct answers |
| Performance | Measure response time and latency |
| Reliability | Verify correct tool usage |
Related
- Guardrails - Validate Agent outputs
- Observability - Monitor Agent behavior
- Testing - Unit testing Agents

