Tracker - PraisonAI

Track every step an autonomous agent takes — tool calls, decisions, errors — then optionally judge execution quality with an LLM.

Quick Start

Run a Tracked Task

praisonai tracker run "Search for Python best practices and summarize"

Run and Judge

praisonai tracker judge "What is 2+2? Use execute_code" --expected "4"

Commands

`tracker run`

Execute a task with full step tracking.

praisonai tracker run "Read config.yaml and explain its structure" -v

Option	Description
`--max-iterations`, `-n`	Maximum iterations (default: 20)
`--model`, `-m`	LLM model to use
`--tools`, `-t`	Comma-separated tool names
`--extended`, `-e`	Include extended tools (may require API keys)
`--verbose`, `-v`	Show full agent output
`--live/--no-live`	Live step updates (default: on)

`tracker judge`

Execute a task, then evaluate the execution trace with an LLM judge. Returns a score (1–10), pass/fail verdict, reasoning, and suggestions.

# Default criteria (task completion, tool selection, efficiency, error handling, output quality)
praisonai tracker judge "Calculate fibonacci(10) using execute_code"

# Custom criteria
praisonai tracker judge "Search for AI news" --criteria "Must use search_web tool"

# Accuracy mode with expected output
praisonai tracker judge "What is 2+2?" --expected "4" --threshold 8.0

# Use a different model for judging
praisonai tracker judge "List files in /tmp" --judge-model gpt-4o

Option	Description
`--criteria`, `-c`	Custom evaluation criteria
`--expected`, `-e`	Expected output for accuracy check
`--threshold`	Pass/fail score threshold, 1–10 (default: 7.0)
`--max-iterations`, `-n`	Maximum iterations (default: 20)
`--model`, `-m`	LLM model for the agent
`--judge-model`	LLM model for the judge (defaults to agent model)
`--tools`, `-t`	Comma-separated tool names
`--extended`	Include extended tools
`--verbose`, `-v`	Show full agent output

Output:

⚖️ Agent Tracker + Judge

Phase 1: Executing task...
  [1] ✅ tool_call: execute_code (0.16s)
  [2] ✅ chat: thinking (3.21s)

Phase 2: Judging execution...
╭── ⚖️ Judge Verdict ──────────────────────────────────╮
│ ✅ Score: 9.0/10  ██████████████████░░                │
│ Threshold: 7.0  |  Verdict: PASS                     │
│                                                       │
│ Reasoning:                                            │
│ Correctly calculated 2+2=4 using execute_code tool.   │
╰───────────────────────────────────────────────────────╯
💡 Suggestions:
  • Streamline output to focus on final result

`tracker tools`

List all available tools.

praisonai tracker tools

`tracker batch`

Run multiple tasks from a JSON file and compare results.

praisonai tracker batch tasks.json -o results.json

Option	Description
`--max-iterations`, `-n`	Max iterations per task
`--model`, `-m`	LLM model
`--output`, `-o`	Output JSON file

Default Tools

The tracker includes 31 built-in tools — no API keys required:

Category	Tools
Web Search	`search_web`, `internet_search`
Web Crawl	`web_crawl`, `scrape_page`, `extract_text`, `extract_links`
Files	`read_file`, `write_file`, `list_files`, `copy_file`, `move_file`, `delete_file`, `get_file_info`
Shell	`execute_command`, `list_processes`, `get_system_info`
Python	`execute_code`, `analyze_code`, `format_code`, `lint_code`
ACP	`acp_create_file`, `acp_edit_file`, `acp_delete_file`, `acp_execute_command`
LSP	`lsp_list_symbols`, `lsp_find_definition`, `lsp_find_references`, `lsp_get_diagnostics`
Scheduling	`schedule_add`, `schedule_list`, `schedule_remove`

ACP (Agent-Centric Protocol) tools provide plan/approve/apply/verify workflows for safe file and command operations. LSP tools provide code intelligence features like symbol listing, go-to-definition, and diagnostics.

Use --extended to also load tools that need API keys (Tavily, Exa, Crawl4AI, You.com).

How the Judge Works

The judge evaluates five dimensions by default:

Task Completion — Did the agent finish the task?
Tool Selection — Were the right tools used?
Efficiency — Minimal unnecessary steps?
Error Handling — Graceful error recovery?
Output Quality — Accurate and useful result?

Override with --criteria for domain-specific evaluation.

Best Practices

Use judge for CI/CD quality gates

Run tracker judge with --threshold 8.0 in your pipeline to catch regressions in agent behavior.

Set --max-iterations low for testing

Use --max-iterations 5 during development to get fast feedback loops.

Use --expected for deterministic tasks

For math, code execution, or factual queries, pass --expected to enable accuracy scoring.

Separate agent and judge models

Use --judge-model gpt-4o with a cheaper agent model to get high-quality evaluation without increasing agent costs.

Eval

Evaluation framework for agents

Autonomy Modes

Configure agent autonomy levels

CLI

​Quick Start

​Commands

​tracker run

​tracker judge

​tracker tools

​tracker batch

​Default Tools

​How the Judge Works

​Best Practices

​Related

Eval

Autonomy Modes

Quick Start

Commands

`tracker run`

`tracker judge`

`tracker tools`

`tracker batch`

Default Tools

How the Judge Works

Best Practices

Related