> ## Documentation Index
> Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation Framework

> Test and benchmark your Agents for quality and performance

# Agent Evaluation

Evaluate your Agents before deploying to production. Test accuracy, measure performance, and verify tool usage. Catch regressions and ensure quality.

## Evaluate Agent Accuracy

Test if your Agent gives correct answers:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, accuracyEval } from 'praisonai';

const agent = new Agent({
  name: 'Math Tutor',
  instructions: 'You are a math tutor. Answer math questions accurately.'
});

// Test the Agent
const response = await agent.chat('What is 2+2?');

// Evaluate accuracy
const result = await accuracyEval({
  input: 'What is 2+2?',
  expectedOutput: '4',
  actualOutput: response,
  threshold: 0.8
});

console.log('Passed:', result.passed);
console.log('Score:', result.score);

// Assert in tests
if (!result.passed) {
  throw new Error(`Agent accuracy test failed: ${result.score}`);
}
```

## Benchmark Agent Performance

Measure Agent response times:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, performanceEval } from 'praisonai';

const agent = new Agent({
  name: 'Fast Agent',
  instructions: 'You respond quickly and concisely.'
});

// Benchmark the Agent
const result = await performanceEval({
  func: async () => await agent.chat('Hello'),
  iterations: 10,
  warmupRuns: 2
});

console.log('Agent Performance:');
console.log(`  Avg: ${result.avgTime}ms`);
console.log(`  Min: ${result.minTime}ms`);
console.log(`  Max: ${result.maxTime}ms`);
console.log(`  P95: ${result.p95Time}ms`);

// Assert performance requirements
if (result.avgTime > 2000) {
  console.warn('Agent is slower than expected');
}
```

## Verify Agent Tool Usage

Ensure Agent calls the right tools:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, createTool, reliabilityEval } from 'praisonai';

const calculatorTool = createTool({
  name: 'calculator',
  description: 'Perform calculations',
  parameters: { type: 'object', properties: { expression: { type: 'string' } } },
  execute: async ({ expression }) => eval(expression).toString()
});

const agent = new Agent({
  name: 'Calculator Agent',
  instructions: 'Use the calculator tool for math.',
  tools: [calculatorTool]
});

// Track tool calls
const toolCalls: string[] = [];
agent.onToolCall((name) => toolCalls.push(name));

await agent.chat('What is 15 * 7?');

// Verify correct tools were called
const result = await reliabilityEval({
  expectedToolCalls: ['calculator'],
  actualToolCalls: toolCalls
});

console.log('Tool Usage:');
console.log(`  Passed: ${result.passed}`);
console.log(`  Missing: ${result.details?.missing}`);
console.log(`  Extra: ${result.details?.extra}`);
```

## Agent Test Suite

Run comprehensive Agent evaluations:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, EvalSuite } from 'praisonai';

const agent = new Agent({
  name: 'Support Agent',
  instructions: 'You help customers with their questions.'
});

const suite = new EvalSuite();

// Test accuracy across multiple scenarios
const testCases = [
  { input: 'What are your hours?', expected: '9am to 5pm' },
  { input: 'How do I return an item?', expected: 'return policy' },
  { input: 'What is your phone number?', expected: '555-1234' }
];

for (const test of testCases) {
  const response = await agent.chat(test.input);
  await suite.runAccuracy(`accuracy-${test.input}`, {
    input: test.input,
    expectedOutput: test.expected,
    actualOutput: response,
    threshold: 0.7
  });
}

// Test performance
await suite.runPerformance('response-time', {
  func: async () => await agent.chat('Hello'),
  iterations: 5
});

// Print results
suite.printSummary();

const summary = suite.getSummary();
console.log(`\nResults: ${summary.passed}/${summary.total} passed`);
console.log(`Average Score: ${summary.avgScore.toFixed(2)}`);
```

## Multi-Agent Evaluation

Evaluate Agent teams:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, Agents, EvalSuite, performanceEval } from 'praisonai';

const researcher = new Agent({
  name: 'Researcher',
  instructions: 'Research topics thoroughly.'
});

const writer = new Agent({
  name: 'Writer',
  instructions: 'Write clear summaries.'
});

const agents = new AgentTeam({
  agents: [researcher, writer],
  tasks: [
    { agent: researcher, description: 'Research: {topic}' },
    { agent: writer, description: 'Summarize the research' }
  ]
});

// Benchmark the entire pipeline
const result = await performanceEval({
  func: async () => await agents.start({ topic: 'AI trends' }),
  iterations: 3,
  warmupRuns: 1
});

console.log('Multi-Agent Pipeline Performance:');
console.log(`  Avg: ${result.avgTime}ms`);
console.log(`  P95: ${result.p95Time}ms`);
```

## Agent Regression Testing

Catch quality regressions:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, EvalSuite } from 'praisonai';

const agent = new Agent({
  name: 'Production Agent',
  instructions: 'You are a helpful assistant.'
});

async function runRegressionTests() {
  const suite = new EvalSuite();
  
  // Load test cases from file
  const testCases = require('./agent-test-cases.json');
  
  for (const test of testCases) {
    const response = await agent.chat(test.input);
    await suite.runAccuracy(test.name, {
      input: test.input,
      expectedOutput: test.expected,
      actualOutput: response,
      threshold: test.threshold || 0.8
    });
  }
  
  const summary = suite.getSummary();
  
  // Fail CI if tests don't pass
  if (summary.failed > 0) {
    console.error(`❌ ${summary.failed} tests failed`);
    process.exit(1);
  }
  
  console.log(`✅ All ${summary.total} tests passed`);
}

runRegressionTests();
```

## A/B Testing Agents

Compare Agent versions:

```typescript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
import { Agent, accuracyEval, performanceEval } from 'praisonai';

const agentV1 = new Agent({
  name: 'Agent v1',
  instructions: 'You are a helpful assistant.',
  model: 'gpt-3.5-turbo'
});

const agentV2 = new Agent({
  name: 'Agent v2',
  instructions: 'You are a helpful assistant. Be concise.',
  model: 'gpt-4o-mini'
});

async function compareAgentTeam(query: string, expected: string) {
  const [r1, r2] = await Promise.all([
    agentV1.chat(query),
    agentV2.chat(query)
  ]);
  
  const [acc1, acc2] = await Promise.all([
    accuracyEval({ input: query, expectedOutput: expected, actualOutput: r1 }),
    accuracyEval({ input: query, expectedOutput: expected, actualOutput: r2 })
  ]);
  
  console.log('Comparison:');
  console.log(`  v1 Score: ${acc1.score.toFixed(2)}`);
  console.log(`  v2 Score: ${acc2.score.toFixed(2)}`);
  console.log(`  Winner: ${acc1.score > acc2.score ? 'v1' : 'v2'}`);
}

await compareAgentTeam('What is TypeScript?', 'typed JavaScript');
```

## Evaluation Types

| Type            | Purpose                             |
| --------------- | ----------------------------------- |
| **Accuracy**    | Test if Agent gives correct answers |
| **Performance** | Measure response time and latency   |
| **Reliability** | Verify correct tool usage           |

## Related

* [Guardrails](/docs/js/guardrails) - Validate Agent outputs
* [Observability](/docs/js/observability) - Monitor Agent behavior
* [Testing](/docs/js/testing) - Unit testing Agents
