Skip to main content
Track every step an autonomous agent takes — tool calls, decisions, errors — then optionally judge execution quality with an LLM.

Quick Start

1

Run a Tracked Task

praisonai tracker run "Search for Python best practices and summarize"
2

Run and Judge

praisonai tracker judge "What is 2+2? Use execute_code" --expected "4"

Commands

tracker run

Execute a task with full step tracking.
praisonai tracker run "Read config.yaml and explain its structure" -v
OptionDescription
--max-iterations, -nMaximum iterations (default: 20)
--model, -mLLM model to use
--tools, -tComma-separated tool names
--extended, -eInclude extended tools (may require API keys)
--verbose, -vShow full agent output
--live/--no-liveLive step updates (default: on)

tracker judge

Execute a task, then evaluate the execution trace with an LLM judge. Returns a score (1–10), pass/fail verdict, reasoning, and suggestions.
# Default criteria (task completion, tool selection, efficiency, error handling, output quality)
praisonai tracker judge "Calculate fibonacci(10) using execute_code"

# Custom criteria
praisonai tracker judge "Search for AI news" --criteria "Must use search_web tool"

# Accuracy mode with expected output
praisonai tracker judge "What is 2+2?" --expected "4" --threshold 8.0

# Use a different model for judging
praisonai tracker judge "List files in /tmp" --judge-model gpt-4o
OptionDescription
--criteria, -cCustom evaluation criteria
--expected, -eExpected output for accuracy check
--thresholdPass/fail score threshold, 1–10 (default: 7.0)
--max-iterations, -nMaximum iterations (default: 20)
--model, -mLLM model for the agent
--judge-modelLLM model for the judge (defaults to agent model)
--tools, -tComma-separated tool names
--extendedInclude extended tools
--verbose, -vShow full agent output
Output:
⚖️ Agent Tracker + Judge

Phase 1: Executing task...
  [1] ✅ tool_call: execute_code (0.16s)
  [2] ✅ chat: thinking (3.21s)

Phase 2: Judging execution...
╭── ⚖️ Judge Verdict ──────────────────────────────────╮
│ ✅ Score: 9.0/10  ██████████████████░░                │
│ Threshold: 7.0  |  Verdict: PASS                     │
│                                                       │
│ Reasoning:                                            │
│ Correctly calculated 2+2=4 using execute_code tool.   │
╰───────────────────────────────────────────────────────╯
💡 Suggestions:
  • Streamline output to focus on final result

tracker tools

List all available tools.
praisonai tracker tools

tracker batch

Run multiple tasks from a JSON file and compare results.
praisonai tracker batch tasks.json -o results.json
OptionDescription
--max-iterations, -nMax iterations per task
--model, -mLLM model
--output, -oOutput JSON file

Default Tools

The tracker includes 31 built-in tools — no API keys required:
CategoryTools
Web Searchsearch_web, internet_search
Web Crawlweb_crawl, scrape_page, extract_text, extract_links
Filesread_file, write_file, list_files, copy_file, move_file, delete_file, get_file_info
Shellexecute_command, list_processes, get_system_info
Pythonexecute_code, analyze_code, format_code, lint_code
ACPacp_create_file, acp_edit_file, acp_delete_file, acp_execute_command
LSPlsp_list_symbols, lsp_find_definition, lsp_find_references, lsp_get_diagnostics
Schedulingschedule_add, schedule_list, schedule_remove
ACP (Agent-Centric Protocol) tools provide plan/approve/apply/verify workflows for safe file and command operations. LSP tools provide code intelligence features like symbol listing, go-to-definition, and diagnostics.
Use --extended to also load tools that need API keys (Tavily, Exa, Crawl4AI, You.com).

How the Judge Works

The judge evaluates five dimensions by default:
  1. Task Completion — Did the agent finish the task?
  2. Tool Selection — Were the right tools used?
  3. Efficiency — Minimal unnecessary steps?
  4. Error Handling — Graceful error recovery?
  5. Output Quality — Accurate and useful result?
Override with --criteria for domain-specific evaluation.

Best Practices

Run tracker judge with --threshold 8.0 in your pipeline to catch regressions in agent behavior.
Use --max-iterations 5 during development to get fast feedback loops.
For math, code execution, or factual queries, pass --expected to enable accuracy scoring.
Use --judge-model gpt-4o with a cheaper agent model to get high-quality evaluation without increasing agent costs.