Overview
What is a Dynamic Judge?
What is a Dynamic Judge?
A Dynamic Judge is a configurable evaluation system that can assess outputs against custom criteria. Unlike hardcoded judges that only evaluate agent recipes, dynamic judges accept:
- Custom criteria - Define what “good” means for your domain
- Custom prompt templates - Control how the LLM evaluates outputs
- Custom optimization rules - Pluggable fix patterns for your domain
Quick Start
CLI Usage
The simplest way to use custom criteria is via the CLI:Python API
JudgeCriteriaConfig
TheJudgeCriteriaConfig dataclass enables full control over evaluation:
| Field | Type | Description |
|---|---|---|
name | str | Name of the criteria configuration |
description | str | Description of what is being evaluated |
prompt_template | str | Custom prompt with {output} placeholder |
scoring_dimensions | List[str] | Dimensions to score (e.g., ["accuracy", "efficiency"]) |
threshold | float | Score threshold for passing (default: 7.0) |
Template Placeholders
Yourprompt_template can use these placeholders:
{output}- The output being evaluated{input}or{input_text}- The original input{expected}- Expected output (if provided)
OptimizationRuleProtocol
Create custom optimization rules for your domain:Rule Registry Functions
| Function | Description |
|---|---|
add_optimization_rule(name, rule_class) | Register a custom rule |
get_optimization_rule(name) | Get a rule by name |
list_optimization_rules() | List all registered rules |
remove_optimization_rule(name) | Remove a rule |
Architecture
Domain Examples
Water Flow Optimization
Water Flow Optimization
Data Pipeline Optimization
Data Pipeline Optimization
Manufacturing Quality
Manufacturing Quality
Backward Compatibility
The dynamic judge system is fully backward compatible:Best Practices
Be Specific
Define clear, measurable criteria. Vague criteria like “good output” lead to inconsistent scores.
Use Dimensions
Break evaluation into scoring dimensions for more granular feedback.
Set Appropriate Thresholds
Higher thresholds (8-9) for critical systems, lower (6-7) for exploratory work.
Test Your Criteria
Run your judge on known good/bad outputs to calibrate scoring.

