Skip to main content

Agent Evaluation

Evaluate your Agents before deploying to production. Test accuracy, measure performance, and verify tool usage. Catch regressions and ensure quality.

Evaluate Agent Accuracy

Test if your Agent gives correct answers:
import { Agent, accuracyEval } from 'praisonai';

const agent = new Agent({
  name: 'Math Tutor',
  instructions: 'You are a math tutor. Answer math questions accurately.'
});

// Test the Agent
const response = await agent.chat('What is 2+2?');

// Evaluate accuracy
const result = await accuracyEval({
  input: 'What is 2+2?',
  expectedOutput: '4',
  actualOutput: response,
  threshold: 0.8
});

console.log('Passed:', result.passed);
console.log('Score:', result.score);

// Assert in tests
if (!result.passed) {
  throw new Error(`Agent accuracy test failed: ${result.score}`);
}

Benchmark Agent Performance

Measure Agent response times:
import { Agent, performanceEval } from 'praisonai';

const agent = new Agent({
  name: 'Fast Agent',
  instructions: 'You respond quickly and concisely.'
});

// Benchmark the Agent
const result = await performanceEval({
  func: async () => await agent.chat('Hello'),
  iterations: 10,
  warmupRuns: 2
});

console.log('Agent Performance:');
console.log(`  Avg: ${result.avgTime}ms`);
console.log(`  Min: ${result.minTime}ms`);
console.log(`  Max: ${result.maxTime}ms`);
console.log(`  P95: ${result.p95Time}ms`);

// Assert performance requirements
if (result.avgTime > 2000) {
  console.warn('Agent is slower than expected');
}

Verify Agent Tool Usage

Ensure Agent calls the right tools:
import { Agent, createTool, reliabilityEval } from 'praisonai';

const calculatorTool = createTool({
  name: 'calculator',
  description: 'Perform calculations',
  parameters: { type: 'object', properties: { expression: { type: 'string' } } },
  execute: async ({ expression }) => eval(expression).toString()
});

const agent = new Agent({
  name: 'Calculator Agent',
  instructions: 'Use the calculator tool for math.',
  tools: [calculatorTool]
});

// Track tool calls
const toolCalls: string[] = [];
agent.onToolCall((name) => toolCalls.push(name));

await agent.chat('What is 15 * 7?');

// Verify correct tools were called
const result = await reliabilityEval({
  expectedToolCalls: ['calculator'],
  actualToolCalls: toolCalls
});

console.log('Tool Usage:');
console.log(`  Passed: ${result.passed}`);
console.log(`  Missing: ${result.details?.missing}`);
console.log(`  Extra: ${result.details?.extra}`);

Agent Test Suite

Run comprehensive Agent evaluations:
import { Agent, EvalSuite } from 'praisonai';

const agent = new Agent({
  name: 'Support Agent',
  instructions: 'You help customers with their questions.'
});

const suite = new EvalSuite();

// Test accuracy across multiple scenarios
const testCases = [
  { input: 'What are your hours?', expected: '9am to 5pm' },
  { input: 'How do I return an item?', expected: 'return policy' },
  { input: 'What is your phone number?', expected: '555-1234' }
];

for (const test of testCases) {
  const response = await agent.chat(test.input);
  await suite.runAccuracy(`accuracy-${test.input}`, {
    input: test.input,
    expectedOutput: test.expected,
    actualOutput: response,
    threshold: 0.7
  });
}

// Test performance
await suite.runPerformance('response-time', {
  func: async () => await agent.chat('Hello'),
  iterations: 5
});

// Print results
suite.printSummary();

const summary = suite.getSummary();
console.log(`\nResults: ${summary.passed}/${summary.total} passed`);
console.log(`Average Score: ${summary.avgScore.toFixed(2)}`);

Multi-Agent Evaluation

Evaluate Agent teams:
import { Agent, PraisonAIAgents, EvalSuite, performanceEval } from 'praisonai';

const researcher = new Agent({
  name: 'Researcher',
  instructions: 'Research topics thoroughly.'
});

const writer = new Agent({
  name: 'Writer',
  instructions: 'Write clear summaries.'
});

const agents = new PraisonAIAgents({
  agents: [researcher, writer],
  tasks: [
    { agent: researcher, description: 'Research: {topic}' },
    { agent: writer, description: 'Summarize the research' }
  ]
});

// Benchmark the entire pipeline
const result = await performanceEval({
  func: async () => await agents.start({ topic: 'AI trends' }),
  iterations: 3,
  warmupRuns: 1
});

console.log('Multi-Agent Pipeline Performance:');
console.log(`  Avg: ${result.avgTime}ms`);
console.log(`  P95: ${result.p95Time}ms`);

Agent Regression Testing

Catch quality regressions:
import { Agent, EvalSuite } from 'praisonai';

const agent = new Agent({
  name: 'Production Agent',
  instructions: 'You are a helpful assistant.'
});

async function runRegressionTests() {
  const suite = new EvalSuite();
  
  // Load test cases from file
  const testCases = require('./agent-test-cases.json');
  
  for (const test of testCases) {
    const response = await agent.chat(test.input);
    await suite.runAccuracy(test.name, {
      input: test.input,
      expectedOutput: test.expected,
      actualOutput: response,
      threshold: test.threshold || 0.8
    });
  }
  
  const summary = suite.getSummary();
  
  // Fail CI if tests don't pass
  if (summary.failed > 0) {
    console.error(`❌ ${summary.failed} tests failed`);
    process.exit(1);
  }
  
  console.log(`✅ All ${summary.total} tests passed`);
}

runRegressionTests();

A/B Testing Agents

Compare Agent versions:
import { Agent, accuracyEval, performanceEval } from 'praisonai';

const agentV1 = new Agent({
  name: 'Agent v1',
  instructions: 'You are a helpful assistant.',
  model: 'gpt-3.5-turbo'
});

const agentV2 = new Agent({
  name: 'Agent v2',
  instructions: 'You are a helpful assistant. Be concise.',
  model: 'gpt-4o-mini'
});

async function compareAgents(query: string, expected: string) {
  const [r1, r2] = await Promise.all([
    agentV1.chat(query),
    agentV2.chat(query)
  ]);
  
  const [acc1, acc2] = await Promise.all([
    accuracyEval({ input: query, expectedOutput: expected, actualOutput: r1 }),
    accuracyEval({ input: query, expectedOutput: expected, actualOutput: r2 })
  ]);
  
  console.log('Comparison:');
  console.log(`  v1 Score: ${acc1.score.toFixed(2)}`);
  console.log(`  v2 Score: ${acc2.score.toFixed(2)}`);
  console.log(`  Winner: ${acc1.score > acc2.score ? 'v1' : 'v2'}`);
}

await compareAgents('What is TypeScript?', 'typed JavaScript');

Evaluation Types

TypePurpose
AccuracyTest if Agent gives correct answers
PerformanceMeasure response time and latency
ReliabilityVerify correct tool usage