Skip to main content
Recipe Judge analyzes execution traces from multi-agent workflows to evaluate context flow, memory usage, and knowledge retrieval effectiveness.

How It Works

Quick Start

# Run a recipe with trace saving
praisonai recipe run my-recipe --save --name my-test-run

# Judge the execution
praisonai recipe judge my-test-run

# Judge with fix recommendations
praisonai recipe judge my-test-run --yaml agents.yaml

Evaluation Modes

Context Mode

Evaluates context flow between agents (default)

Memory Mode

Evaluates memory store/search effectiveness

Knowledge Mode

Evaluates knowledge retrieval quality

CLI Commands

praisonai recipe judge run-abc123

Auto-Chunking (Default)

Auto-chunking is enabled by default. The judge automatically detects when agent outputs exceed the model’s context window and intelligently chunks them for evaluation. Use --no-auto-chunk to disable.
Auto-chunking is enabled by default:
praisonai recipe judge my-trace-id
The judge will automatically chunk large outputs.
  1. Token Estimation: Estimates token count using character/word heuristics
  2. Context Check: Compares against model’s context window (with 80% safety margin)
  3. Smart Chunking: If needed, splits output into optimal chunks
  4. Parallel Evaluation: Each chunk is evaluated independently
  5. Score Aggregation: Chunk scores are combined using weighted average
  • Speed: Auto-chunking adds latency for token counting
  • Simple Tasks: Short outputs that never exceed context
  • Cost: Chunked evaluation uses more API calls
  • Debugging: To see raw truncated behavior

Manual Chunked Evaluation

For large agent outputs that exceed LLM context limits, chunked evaluation splits the output into multiple chunks, evaluates each separately, and aggregates the scores. This preserves ALL content instead of truncating.
  • Agent outputs exceed 3000+ characters
  • You’re seeing [TRUNCATED] in evaluation results
  • Important information in the middle of outputs is being lost
  • You need comprehensive evaluation of long-form content
StrategyDescriptionUse Case
weighted_averageWeight by chunk size (default)Balanced evaluation
averageSimple averageEqual weight to all chunks
minUse minimum scoreConservative/strict evaluation
maxUse maximum scoreOptimistic evaluation
OptionDefaultDescription
--chunk-size8000Max characters per chunk (optimized for 128K context models)
--max-chunks5Max chunks per agent (allows up to 40K chars total)
--aggregationweighted_averageScore aggregation strategy

Scoring Criteria

Did the agent accomplish what it was asked to do?
  • 10: Perfect task completion
  • 6-9: Mostly complete with minor issues
  • 3-5: Partial completion
  • 1-2: Failed to complete task
How well did the agent use provided context?
  • 10: Fully utilized all relevant context
  • 6-9: Good use of context
  • 3-5: Partial context usage
  • 1-2: Ignored important context
Does the output match expected format and quality?
  • 10: Perfect output quality
  • 6-9: Good quality with minor issues
  • 3-5: Acceptable but needs improvement
  • 1-2: Poor quality output
Did the agent follow specific instructions?
  • 10: Followed all instructions precisely
  • 6-9: Mostly followed instructions
  • 3-5: Partially followed
  • 1-2: Ignored instructions
Did the agent make up facts? (10 = no hallucination)
  • 10: No hallucination
  • 6-9: Minor inaccuracies
  • 3-5: Some fabricated content
  • 1-2: Severe hallucination
How well did the agent handle errors?
  • 10: Excellent error recovery
  • 6-9: Good error handling
  • 3-5: Basic error handling
  • 1-2: Poor error handling

Judge Report

session_id
string
Trace session identifier
overall_score
float
Average score across all agents (1-10)
agent_scores
list
Per-agent evaluation scores
recommendations
list
Actionable improvement suggestions
failures_detected
int
Number of agents with detected failures
# Example report structure
report = {
    "session_id": "run-abc123",
    "overall_score": 7.5,
    "agent_scores": [
        {
            "agent_name": "Researcher",
            "task_achievement_score": 8.0,
            "context_utilization_score": 7.0,
            "output_quality_score": 8.0,
            "failure_detected": False,
        }
    ],
    "recommendations": [
        "Researcher: Improve context utilization",
        "Writer: Add expected output format"
    ]
}

Fix Workflow

1

Run Recipe with Trace

praisonai recipe run my-recipe --save --name test-run
2

Judge the Trace

praisonai recipe judge test-run --yaml agents.yaml --output plan.yaml
3

Review Plan

cat plan.yaml
4

Apply Fixes

praisonai recipe apply plan.yaml --confirm
5

Re-run and Verify

praisonai recipe run my-recipe --save --name test-run-v2
praisonai recipe judge test-run-v2

Python API

from praisonai.replay import (
    ContextEffectivenessJudge,
    ContextTraceReader,
    JudgeReport,
    format_judge_report,
    generate_plan_from_report,
)

# Initialize judge with mode
judge = ContextEffectivenessJudge(
    model="gpt-4o-mini",
    temperature=0.1,
    mode="context",  # or "memory", "knowledge"
)

# Load and judge trace
reader = ContextTraceReader("my-trace-id")
events = reader.get_all()

report = judge.judge_trace(
    events,
    session_id="my-trace-id",
    yaml_file="agents.yaml",  # Optional: for fix recommendations
    evaluate_tools=True,
    evaluate_context_flow=True,
)

# Display report
print(format_judge_report(report))

# Generate fix plan
if report.overall_score < 7.0:
    plan = generate_plan_from_report(report, "agents.yaml")
    plan.save("fixes.yaml")

Integration with Core Judge

Recipe Judge is also available through the unified Judge registry.
from praisonaiagents.eval import get_judge, list_judges

# List available judges
print(list_judges())  # ['accuracy', 'criteria', 'recipe']

# Get RecipeJudge
RecipeJudge = get_judge("recipe")
judge = RecipeJudge(mode="context")

# For simple output evaluation
result = judge.run(output="Recipe output text")
print(f"Score: {result.score}/10")

Best Practices

Use --save flag when running recipes to enable judging:
praisonai recipe run my-recipe --save
Name your traces for easy reference:
praisonai recipe run my-recipe --save --name experiment-v1
Provide the YAML file to get actionable fix recommendations:
praisonai recipe judge trace-id --yaml agents.yaml
Override recipe goal for better evaluation:
praisonai recipe judge trace-id --goal "Generate blog post from URL"
Recipe Judge uses lazy loading - litellm is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.