Recipe Judge analyzes execution traces from multi-agent workflows to evaluate context flow, memory usage, and knowledge retrieval effectiveness.
How It Works
Quick Start
- CLI
- Python
Evaluation Modes
Context Mode
Evaluates context flow between agents (default)
Memory Mode
Evaluates memory store/search effectiveness
Knowledge Mode
Evaluates knowledge retrieval quality
CLI Commands
Auto-Chunking (Default)
Auto-chunking is enabled by default. The judge automatically detects when agent outputs exceed the model’s context window and intelligently chunks them for evaluation. Use
--no-auto-chunk to disable.- Default (Auto-Chunk)
- Disable Auto-Chunk
- Force Chunked
Auto-chunking is enabled by default:The judge will automatically chunk large outputs.
How Auto-Chunking Works
How Auto-Chunking Works
- Token Estimation: Estimates token count using character/word heuristics
- Context Check: Compares against model’s context window (with 80% safety margin)
- Smart Chunking: If needed, splits output into optimal chunks
- Parallel Evaluation: Each chunk is evaluated independently
- Score Aggregation: Chunk scores are combined using weighted average
When to Disable Auto-Chunking
When to Disable Auto-Chunking
- Speed: Auto-chunking adds latency for token counting
- Simple Tasks: Short outputs that never exceed context
- Cost: Chunked evaluation uses more API calls
- Debugging: To see raw truncated behavior
Manual Chunked Evaluation
For large agent outputs that exceed LLM context limits, chunked evaluation splits the output into multiple chunks, evaluates each separately, and aggregates the scores. This preserves ALL content instead of truncating.
When to Use Chunked Evaluation
When to Use Chunked Evaluation
- Agent outputs exceed 3000+ characters
- You’re seeing
[TRUNCATED]in evaluation results - Important information in the middle of outputs is being lost
- You need comprehensive evaluation of long-form content
Aggregation Strategies
Aggregation Strategies
| Strategy | Description | Use Case |
|---|---|---|
weighted_average | Weight by chunk size (default) | Balanced evaluation |
average | Simple average | Equal weight to all chunks |
min | Use minimum score | Conservative/strict evaluation |
max | Use maximum score | Optimistic evaluation |
Chunk Configuration
Chunk Configuration
| Option | Default | Description |
|---|---|---|
--chunk-size | 8000 | Max characters per chunk (optimized for 128K context models) |
--max-chunks | 5 | Max chunks per agent (allows up to 40K chars total) |
--aggregation | weighted_average | Score aggregation strategy |
Scoring Criteria
Task Achievement (1-10)
Task Achievement (1-10)
Did the agent accomplish what it was asked to do?
- 10: Perfect task completion
- 6-9: Mostly complete with minor issues
- 3-5: Partial completion
- 1-2: Failed to complete task
Context Utilization (1-10)
Context Utilization (1-10)
How well did the agent use provided context?
- 10: Fully utilized all relevant context
- 6-9: Good use of context
- 3-5: Partial context usage
- 1-2: Ignored important context
Output Quality (1-10)
Output Quality (1-10)
Does the output match expected format and quality?
- 10: Perfect output quality
- 6-9: Good quality with minor issues
- 3-5: Acceptable but needs improvement
- 1-2: Poor quality output
Instruction Following (1-10)
Instruction Following (1-10)
Did the agent follow specific instructions?
- 10: Followed all instructions precisely
- 6-9: Mostly followed instructions
- 3-5: Partially followed
- 1-2: Ignored instructions
Hallucination Score (1-10)
Hallucination Score (1-10)
Did the agent make up facts? (10 = no hallucination)
- 10: No hallucination
- 6-9: Minor inaccuracies
- 3-5: Some fabricated content
- 1-2: Severe hallucination
Error Handling (1-10)
Error Handling (1-10)
How well did the agent handle errors?
- 10: Excellent error recovery
- 6-9: Good error handling
- 3-5: Basic error handling
- 1-2: Poor error handling
Judge Report
Trace session identifier
Average score across all agents (1-10)
Per-agent evaluation scores
Actionable improvement suggestions
Number of agents with detected failures
Fix Workflow
Python API
Integration with Core Judge
Recipe Judge is also available through the unified Judge registry.
Best Practices
Always Save Traces
Always Save Traces
Use
--save flag when running recipes to enable judging:Use Named Traces
Use Named Traces
Name your traces for easy reference:
Include YAML for Fixes
Include YAML for Fixes
Provide the YAML file to get actionable fix recommendations:
Set Clear Goals
Set Clear Goals
Override recipe goal for better evaluation:
Recipe Judge uses lazy loading - litellm is only imported when you actually run an evaluation, ensuring zero performance impact when not in use.

