Quick Start
Commands
tracker run
Execute a task with full step tracking.
| Option | Description |
|---|---|
--max-iterations, -n | Maximum iterations (default: 20) |
--model, -m | LLM model to use |
--tools, -t | Comma-separated tool names |
--extended, -e | Include extended tools (may require API keys) |
--verbose, -v | Show full agent output |
--live/--no-live | Live step updates (default: on) |
tracker judge
Execute a task, then evaluate the execution trace with an LLM judge. Returns a score (1–10), pass/fail verdict, reasoning, and suggestions.
| Option | Description |
|---|---|
--criteria, -c | Custom evaluation criteria |
--expected, -e | Expected output for accuracy check |
--threshold | Pass/fail score threshold, 1–10 (default: 7.0) |
--max-iterations, -n | Maximum iterations (default: 20) |
--model, -m | LLM model for the agent |
--judge-model | LLM model for the judge (defaults to agent model) |
--tools, -t | Comma-separated tool names |
--extended | Include extended tools |
--verbose, -v | Show full agent output |
tracker tools
List all available tools.
tracker batch
Run multiple tasks from a JSON file and compare results.
| Option | Description |
|---|---|
--max-iterations, -n | Max iterations per task |
--model, -m | LLM model |
--output, -o | Output JSON file |
Default Tools
The tracker includes 31 built-in tools — no API keys required:| Category | Tools |
|---|---|
| Web Search | search_web, internet_search |
| Web Crawl | web_crawl, scrape_page, extract_text, extract_links |
| Files | read_file, write_file, list_files, copy_file, move_file, delete_file, get_file_info |
| Shell | execute_command, list_processes, get_system_info |
| Python | execute_code, analyze_code, format_code, lint_code |
| ACP | acp_create_file, acp_edit_file, acp_delete_file, acp_execute_command |
| LSP | lsp_list_symbols, lsp_find_definition, lsp_find_references, lsp_get_diagnostics |
| Scheduling | schedule_add, schedule_list, schedule_remove |
--extended to also load tools that need API keys (Tavily, Exa, Crawl4AI, You.com).
How the Judge Works
The judge evaluates five dimensions by default:- Task Completion — Did the agent finish the task?
- Tool Selection — Were the right tools used?
- Efficiency — Minimal unnecessary steps?
- Error Handling — Graceful error recovery?
- Output Quality — Accurate and useful result?
--criteria for domain-specific evaluation.
Best Practices
Use judge for CI/CD quality gates
Use judge for CI/CD quality gates
Run
tracker judge with --threshold 8.0 in your pipeline to catch regressions in agent behavior.Set --max-iterations low for testing
Set --max-iterations low for testing
Use
--max-iterations 5 during development to get fast feedback loops.Use --expected for deterministic tasks
Use --expected for deterministic tasks
For math, code execution, or factual queries, pass
--expected to enable accuracy scoring.Separate agent and judge models
Separate agent and judge models
Use
--judge-model gpt-4o with a cheaper agent model to get high-quality evaluation without increasing agent costs.
