Skip to main content
Judge provides a simple, unified API for evaluating agent outputs using LLM-as-judge. It supports accuracy evaluation, criteria-based evaluation, and custom judges.

How It Works

Quick Start

from praisonaiagents.eval import Judge

result = Judge().run(output="4", expected="4")
print(f"Score: {result.score}/10")
print(f"Passed: {result.passed}")

Judge Types

AccuracyJudge

Compares output against expected output

CriteriaJudge

Evaluates against custom criteria

Configuration

model
string
default:"gpt-4o-mini"
LLM model to use for judging
temperature
float
default:"0.1"
Temperature for LLM calls (lower = more consistent)
threshold
float
default:"7.0"
Score threshold for passing (1-10 scale)
criteria
string
Custom criteria for evaluation
from praisonaiagents.eval import Judge, JudgeConfig

config = JudgeConfig(
    model="gpt-4o",
    temperature=0.1,
    threshold=8.0,
    criteria="Response is accurate and well-formatted"
)

judge = Judge(config=config)

Custom Judges

1

Create Custom Judge

from praisonaiagents.eval import Judge

class RecipeJudge(Judge):
    """Judge for evaluating recipe quality."""
    
    CRITERIA_PROMPT = """Evaluate this recipe:

CRITERIA: Recipe is complete with ingredients and steps

RECIPE:
{output}

Score 1-10 based on completeness and clarity.

SCORE: [1-10]
REASONING: [explanation]
"""
2

Register Judge

from praisonaiagents.eval import add_judge

add_judge("recipe", RecipeJudge)
3

Use Judge

from praisonaiagents.eval import get_judge

RecipeJudge = get_judge("recipe")
judge = RecipeJudge()
result = judge.run(output=recipe_text)

Registry Functions

FunctionDescription
add_judge(name, class)Register a custom judge
get_judge(name)Get a judge class by name
list_judges()List all registered judges
remove_judge(name)Remove a registered judge
from praisonaiagents.eval import add_judge, get_judge, list_judges

# List available judges
print(list_judges())  # ['accuracy', 'criteria']

# Get a judge
AccuracyJudge = get_judge("accuracy")

JudgeResult

score
float
Quality score from 1-10
passed
bool
Whether score >= threshold
reasoning
string
Explanation for the score
suggestions
list
Improvement suggestions
result = judge.run(output="Hello!")

print(result.score)        # 8.5
print(result.passed)       # True
print(result.reasoning)    # "Good greeting..."
print(result.suggestions)  # ["Could add more context"]
print(result.to_dict())    # Full dictionary

CLI Usage

praisonai eval judge --output "4" --expected "4"

Async Support

import asyncio
from praisonaiagents.eval import Judge

async def evaluate():
    judge = Judge(criteria="Response is helpful")
    result = await judge.run_async(output="Hello!")
    return result

result = asyncio.run(evaluate())

Best Practices

  • Use accuracy mode when you have a known expected output
  • Use criteria mode for subjective quality evaluation
  • Create custom judges for domain-specific evaluation
  • Default threshold is 7.0 (70%)
  • Increase for critical evaluations
  • Decrease for exploratory testing
  • Be specific in your criteria
  • Include measurable aspects
  • Avoid vague terms like “good” or “nice”
Judge uses lazy loading - litellm is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.