Judge

Judge provides a simple, unified API for evaluating agent outputs using LLM-as-judge. It supports accuracy evaluation, criteria-based evaluation, and custom judges.

How It Works

Quick Start

Accuracy Check
Criteria Check
With Agent

from praisonaiagents.eval import Judge

result = Judge().run(output="4", expected="4")
print(f"Score: {result.score}/10")
print(f"Passed: {result.passed}")

from praisonaiagents.eval import Judge

judge = Judge(criteria="Response is helpful and accurate")
result = judge.run(output="Hello! How can I help you today?")
print(f"Score: {result.score}/10")

from praisonaiagents import Agent
from praisonaiagents.eval import Judge

agent = Agent(instructions="You are a math tutor")
result = Judge().run(
    agent=agent,
    input_text="What is 2+2?",
    expected="4"
)

Judge Types

AccuracyJudge

Compares output against expected output

CriteriaJudge

Evaluates against custom criteria

Configuration

model

string

default:"gpt-4o-mini"

LLM model to use for judging

temperature

float

default:"0.1"

Temperature for LLM calls (lower = more consistent)

threshold

float

default:"7.0"

Score threshold for passing (1-10 scale)

criteria

string

Custom criteria for evaluation

from praisonaiagents.eval import Judge, JudgeConfig

config = JudgeConfig(
    model="gpt-4o",
    temperature=0.1,
    threshold=8.0,
    criteria="Response is accurate and well-formatted"
)

judge = Judge(config=config)

Custom Judges

Create Custom Judge

from praisonaiagents.eval import Judge

class RecipeJudge(Judge):
    """Judge for evaluating recipe quality."""
    
    CRITERIA_PROMPT = """Evaluate this recipe:

CRITERIA: Recipe is complete with ingredients and steps

RECIPE:
{output}

Score 1-10 based on completeness and clarity.

SCORE: [1-10]
REASONING: [explanation]
"""

from praisonaiagents.eval import add_judge

add_judge("recipe", RecipeJudge)

Use Judge

from praisonaiagents.eval import get_judge

RecipeJudge = get_judge("recipe")
judge = RecipeJudge()
result = judge.run(output=recipe_text)

Registry Functions

Function	Description
`add_judge(name, class)`	Register a custom judge
`get_judge(name)`	Get a judge class by name
`list_judges()`	List all registered judges
`remove_judge(name)`	Remove a registered judge

from praisonaiagents.eval import add_judge, get_judge, list_judges

# List available judges
print(list_judges())  # ['accuracy', 'criteria']

# Get a judge
AccuracyJudge = get_judge("accuracy")

JudgeResult

score

float

Quality score from 1-10

passed

bool

Whether score >= threshold

reasoning

string

Explanation for the score

suggestions

list

Improvement suggestions

result = judge.run(output="Hello!")

print(result.score)        # 8.5
print(result.passed)       # True
print(result.reasoning)    # "Good greeting..."
print(result.suggestions)  # ["Could add more context"]
print(result.to_dict())    # Full dictionary

CLI Usage

praisonai eval judge --output "4" --expected "4"

Async Support

import asyncio
from praisonaiagents.eval import Judge

async def evaluate():
    judge = Judge(criteria="Response is helpful")
    result = await judge.run_async(output="Hello!")
    return result

result = asyncio.run(evaluate())

Best Practices

Choose the Right Mode

Use accuracy mode when you have a known expected output
Use criteria mode for subjective quality evaluation
Create custom judges for domain-specific evaluation

Set Appropriate Thresholds

Default threshold is 7.0 (70%)
Increase for critical evaluations
Decrease for exploratory testing

Use Specific Criteria

Be specific in your criteria
Include measurable aspects
Avoid vague terms like “good” or “nice”

Judge uses lazy loading - litellm is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

How It Works

Quick Start

Judge Types

AccuracyJudge

CriteriaJudge

Configuration

Custom Judges

Registry Functions

JudgeResult

CLI Usage

Async Support

Best Practices

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

​How It Works

​Quick Start

​Judge Types

AccuracyJudge

CriteriaJudge

​Configuration

​Custom Judges

​Registry Functions

​JudgeResult

​CLI Usage

​Async Support

​Best Practices

How It Works

Quick Start

Judge Types

Configuration

Custom Judges

Registry Functions

JudgeResult

CLI Usage

Async Support

Best Practices