Skip to main content
Benchmark PraisonAI agents against Terminal-Bench 2.0, the Stanford/Laude Institute standard for evaluating AI coding agents in realistic terminal environments.

Quick Start

1

Install Dependencies

Install Harbor framework and PraisonAI with shell tools.
pip install harbor praisonaiagents[tools]
Set your API key:
export OPENAI_API_KEY="your-api-key"
2

Run Benchmark

Execute Terminal-Bench with PraisonAI external agent on a subset of tasks.
harbor run -d terminal-bench/terminal-bench-2 \
  --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
  --model openai/gpt-4o \
  --ae OPENAI_API_KEY=$OPENAI_API_KEY \
  -n 4

How It Works

Terminal-Bench provides standardized evaluation of AI agents on realistic coding tasks like compiling code, training models, and system administration.
ComponentPurpose
Terminal-Bench 2.089 curated tasks covering compilation, ML, servers
Harbor FrameworkContainer orchestration and parallel execution
Docker ContainerIsolated environment for safe code execution
PraisonAI AgentIntelligent agent using bash tools

Integration Approaches


YAML Configuration

Configure benchmark runs using Harbor’s YAML format for reproducible experiments.
# job.yaml - Terminal-Bench configuration
dataset: terminal-bench/terminal-bench-2

agent:
  import_path: examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent
  model_name: openai/gpt-4o
  env:
    OPENAI_API_KEY: "${OPENAI_API_KEY}"

n_concurrent: 8
n_attempts: 1

# Optional: Filter to specific tasks
# task_filter:
#   task_names: ["compile_simple_c", "install_python_package"]
Run with configuration:
harbor run -c examples/terminal_bench/job.yaml

Task Filtering & Selection

Run specific tasks for targeted testing or debugging.
harbor run -d terminal-bench/terminal-bench-2 \
  --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
  --model openai/gpt-4o \
  --ae OPENAI_API_KEY=$OPENAI_API_KEY \
  -i "terminal-bench/compile-cython-ext" \
  -i "terminal-bench/bn-fit-modify"
Run a subset for quick testing with the -l flag.
harbor run -d terminal-bench/terminal-bench-2 \
  --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
  --model openai/gpt-4o \
  --ae OPENAI_API_KEY=$OPENAI_API_KEY \
  -l 5 -n 2
Scale to higher concurrency using cloud providers.
harbor run -d terminal-bench/terminal-bench-2 \
  --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
  --model openai/gpt-4o \
  --env daytona -n 32 \
  --ae OPENAI_API_KEY=$OPENAI_API_KEY

Interpreting Results

Terminal-Bench uses binary scoring where each task either passes (1.0) or fails (0.0).
# Example output
praisonai (gpt-4o) on terminal-bench-2         
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 Metric Value
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 Agent praisonai (gpt-4o)│
 Dataset terminal-bench-2
 Trials 89
 Errors 0

 Mean 0.73

 Reward Distribution
   reward = 1.0 65
   reward = 0.0 24
└─────────────────────┴───────────────────┘
ScoreMeaning
1.0Task passed - verification script succeeded
0.0Task failed - verification script failed or agent error
MeanOverall success rate across all tasks
Model Performance: gpt-4o-mini typically scores near 0.0 on hard tasks. Use openai/gpt-4o or anthropic/claude-3-7-sonnet-20250219 for meaningful scores.

Example Output

Real benchmark session showing PraisonAI external agent results:
$ harbor run -d terminal-bench/terminal-bench-2 \
    --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
    --model openai/gpt-4o --ae OPENAI_API_KEY=$OPENAI_API_KEY -l 5 -n 2

  5/5 Mean: 0.000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:10:18 0:00:00
Results written to /tmp/harbor_results/2026-04-12__03-00-00/result.json

        praisonai (gpt-4o) on adhoc         
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 Metric Value
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 Agent praisonai (gpt-4o)│
 Dataset adhoc
 Trials 5
 Errors 0

 Mean 0.000

 Reward Distribution
   reward = 0.0 5
└─────────────────────┴───────────────────┘
Tasks included: Cython compilation, Bayesian network fitting, C source build, adaptive sampling, JavaScript filtering.

Best Practices

Always verify the benchmark works by testing with the oracle agent first.
harbor run -d terminal-bench/terminal-bench-2 -a oracle -l 1
This should achieve a perfect score (1.0) and confirm your setup is correct.
Choose models based on your goals:
  • Testing integration: openai/gpt-4o-mini (fast, cheap, low scores)
  • Real benchmarking: openai/gpt-4o or anthropic/claude-3-7-sonnet-20250219
  • Cost optimization: Start with 5-10 tasks before running full benchmark
Terminal-Bench tasks can be resource intensive:
  • Start with -n 2 concurrency for testing
  • Scale to -n 8 for serious benchmarking
  • Use cloud providers (Daytona, E2B, Modal) for -n 32+ concurrency
When tasks fail, examine the execution logs:
# Results are saved with timestamps
cat /tmp/harbor_results/2026-04-12__03-00-00/task-name/agent/output.txt
Common failure modes: timeout, missing dependencies, incorrect file paths.

Troubleshooting

ErrorSolution
"Object of type coroutine is not JSON serializable"Fixed in current PraisonAI version - update to latest
Docker not foundInstall Docker Desktop and ensure it’s running
Harbor import errorInstall Harbor: pip install harbor
API key not forwardedUse --ae OPENAI_API_KEY=$OPENAI_API_KEY flag
Permission denied in containerEnsure Docker has proper permissions
The coroutine serialization error mentioned in early Terminal-Bench integration docs has been fixed in the current SDK version. If you encounter it, update PraisonAI to the latest version.

Sandbox Execution

Safe code execution in isolated environments

Real API Testing

Testing agents with real API integrations