Recipe Judge

Recipe Judge analyzes execution traces from multi-agent workflows to evaluate context flow, memory usage, and knowledge retrieval effectiveness.

How It Works

Quick Start

CLI
Python

# Run a recipe with trace saving
praisonai recipe run my-recipe --save --name my-test-run

# Judge the execution
praisonai recipe judge my-test-run

# Judge with fix recommendations
praisonai recipe judge my-test-run --yaml agents.yaml

from praisonai.replay import (
    ContextTraceReader,
    ContextEffectivenessJudge,
    format_judge_report,
)

# Load trace
reader = ContextTraceReader("my-test-run")
events = reader.get_all()

# Judge
judge = ContextEffectivenessJudge(mode="context")
report = judge.judge_trace(events, session_id="my-test-run")

# Display
print(format_judge_report(report))

Evaluation Modes

Context Mode

Evaluates context flow between agents (default)

Memory Mode

Evaluates memory store/search effectiveness

Knowledge Mode

Evaluates knowledge retrieval quality

CLI Commands

praisonai recipe judge run-abc123

Auto-Chunking (Default)

Auto-chunking is enabled by default. The judge automatically detects when agent outputs exceed the model’s context window and intelligently chunks them for evaluation. Use --no-auto-chunk to disable.

Default (Auto-Chunk)
Disable Auto-Chunk
Force Chunked

Auto-chunking is enabled by default:

praisonai recipe judge my-trace-id

The judge will automatically chunk large outputs.

Disable auto-chunking for faster evaluation:

praisonai recipe judge my-trace-id --no-auto-chunk

Outputs will be truncated if they exceed context limits.

Force chunked evaluation for all outputs:

praisonai recipe judge my-trace-id --chunked

All outputs are chunked regardless of size.

How Auto-Chunking Works

Token Estimation: Estimates token count using character/word heuristics
Context Check: Compares against model’s context window (with 80% safety margin)
Smart Chunking: If needed, splits output into optimal chunks
Parallel Evaluation: Each chunk is evaluated independently
Score Aggregation: Chunk scores are combined using weighted average

When to Disable Auto-Chunking

Speed: Auto-chunking adds latency for token counting
Simple Tasks: Short outputs that never exceed context
Cost: Chunked evaluation uses more API calls
Debugging: To see raw truncated behavior

Manual Chunked Evaluation

For large agent outputs that exceed LLM context limits, chunked evaluation splits the output into multiple chunks, evaluates each separately, and aggregates the scores. This preserves ALL content instead of truncating.

When to Use Chunked Evaluation

Agent outputs exceed 3000+ characters
You’re seeing [TRUNCATED] in evaluation results
Important information in the middle of outputs is being lost
You need comprehensive evaluation of long-form content

Aggregation Strategies

Strategy	Description	Use Case
`weighted_average`	Weight by chunk size (default)	Balanced evaluation
`average`	Simple average	Equal weight to all chunks
`min`	Use minimum score	Conservative/strict evaluation
`max`	Use maximum score	Optimistic evaluation

Chunk Configuration

Option	Default	Description
`--chunk-size`	8000	Max characters per chunk (optimized for 128K context models)
`--max-chunks`	5	Max chunks per agent (allows up to 40K chars total)
`--aggregation`	weighted_average	Score aggregation strategy

Scoring Criteria

Task Achievement (1-10)

Did the agent accomplish what it was asked to do?

10: Perfect task completion
6-9: Mostly complete with minor issues
3-5: Partial completion
1-2: Failed to complete task

Context Utilization (1-10)

How well did the agent use provided context?

10: Fully utilized all relevant context
6-9: Good use of context
3-5: Partial context usage
1-2: Ignored important context

Output Quality (1-10)

Does the output match expected format and quality?

10: Perfect output quality
6-9: Good quality with minor issues
3-5: Acceptable but needs improvement
1-2: Poor quality output

Instruction Following (1-10)

Did the agent follow specific instructions?

10: Followed all instructions precisely
6-9: Mostly followed instructions
3-5: Partially followed
1-2: Ignored instructions

Hallucination Score (1-10)

Did the agent make up facts? (10 = no hallucination)

10: No hallucination
6-9: Minor inaccuracies
3-5: Some fabricated content
1-2: Severe hallucination

Error Handling (1-10)

How well did the agent handle errors?

10: Excellent error recovery
6-9: Good error handling
3-5: Basic error handling
1-2: Poor error handling

Judge Report

session_id

string

Trace session identifier

overall_score

float

Average score across all agents (1-10)

agent_scores

list

Per-agent evaluation scores

recommendations

list

Actionable improvement suggestions

failures_detected

int

Number of agents with detected failures

# Example report structure
report = {
    "session_id": "run-abc123",
    "overall_score": 7.5,
    "agent_scores": [
        {
            "agent_name": "Researcher",
            "task_achievement_score": 8.0,
            "context_utilization_score": 7.0,
            "output_quality_score": 8.0,
            "failure_detected": False,
        }
    ],
    "recommendations": [
        "Researcher: Improve context utilization",
        "Writer: Add expected output format"
    ]
}

Fix Workflow

Run Recipe with Trace

praisonai recipe run my-recipe --save --name test-run

Judge the Trace

praisonai recipe judge test-run --yaml agents.yaml --output plan.yaml

Review Plan

cat plan.yaml

Apply Fixes

praisonai recipe apply plan.yaml --confirm

Re-run and Verify

praisonai recipe run my-recipe --save --name test-run-v2
praisonai recipe judge test-run-v2

Python API

from praisonai.replay import (
    ContextEffectivenessJudge,
    ContextTraceReader,
    JudgeReport,
    format_judge_report,
    generate_plan_from_report,
)

# Initialize judge with mode
judge = ContextEffectivenessJudge(
    model="gpt-4o-mini",
    temperature=0.1,
    mode="context",  # or "memory", "knowledge"
)

# Load and judge trace
reader = ContextTraceReader("my-trace-id")
events = reader.get_all()

report = judge.judge_trace(
    events,
    session_id="my-trace-id",
    yaml_file="agents.yaml",  # Optional: for fix recommendations
    evaluate_tools=True,
    evaluate_context_flow=True,
)

# Display report
print(format_judge_report(report))

# Generate fix plan
if report.overall_score < 7.0:
    plan = generate_plan_from_report(report, "agents.yaml")
    plan.save("fixes.yaml")

Integration with Core Judge

Recipe Judge is also available through the unified Judge registry.

from praisonaiagents.eval import get_judge, list_judges

# List available judges
print(list_judges())  # ['accuracy', 'criteria', 'recipe']

# Get RecipeJudge
RecipeJudge = get_judge("recipe")
judge = RecipeJudge(mode="context")

# For simple output evaluation
result = judge.run(output="Recipe output text")
print(f"Score: {result.score}/10")

Best Practices

Always Save Traces

Use --save flag when running recipes to enable judging:

praisonai recipe run my-recipe --save

Use Named Traces

Name your traces for easy reference:

praisonai recipe run my-recipe --save --name experiment-v1

Include YAML for Fixes

Provide the YAML file to get actionable fix recommendations:

praisonai recipe judge trace-id --yaml agents.yaml

Set Clear Goals

Override recipe goal for better evaluation:

praisonai recipe judge trace-id --goal "Generate blog post from URL"

Recipe Judge uses lazy loading - litellm is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

How It Works

Quick Start

Evaluation Modes

Context Mode

Memory Mode

Knowledge Mode

CLI Commands

Auto-Chunking (Default)

Manual Chunked Evaluation

Scoring Criteria

Judge Report

Fix Workflow

Python API

Integration with Core Judge

Best Practices

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

​How It Works

​Quick Start

​Evaluation Modes

Context Mode

Memory Mode

Knowledge Mode

​CLI Commands

​Auto-Chunking (Default)

​Manual Chunked Evaluation

​Scoring Criteria

​Judge Report

​Fix Workflow

​Python API

​Integration with Core Judge

​Best Practices

How It Works

Quick Start

Evaluation Modes

CLI Commands

Auto-Chunking (Default)

Manual Chunked Evaluation

Scoring Criteria

Judge Report

Fix Workflow

Python API

Integration with Core Judge

Best Practices