Benchmark PraisonAI agents against Terminal-Bench 2.0, the Stanford/Laude Institute standard for evaluating AI coding agents in realistic terminal environments.
Quick Start
Install Dependencies
Install Harbor framework and PraisonAI with shell tools. pip install harbor praisonaiagents[tools]
Set your API key: export OPENAI_API_KEY = " your-api-key "
Run Benchmark
Execute Terminal-Bench with PraisonAI external agent on a subset of tasks. harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
--model openai/gpt-4o \
--ae OPENAI_API_KEY= $OPENAI_API_KEY \
-n 4
How It Works
Terminal-Bench provides standardized evaluation of AI agents on realistic coding tasks like compiling code, training models, and system administration.
Component Purpose Terminal-Bench 2.0 89 curated tasks covering compilation, ML, servers Harbor Framework Container orchestration and parallel execution Docker Container Isolated environment for safe code execution PraisonAI Agent Intelligent agent using bash tools
Integration Approaches
The external agent approach uses PraisonAI’s Agent class directly with Harbor’s container environment. from harbor . agents . base import BaseAgent
from harbor . environments . base import BaseEnvironment
from harbor . models . agent . context import AgentContext
from praisonaiagents import Agent
from praisonaiagents . approval import get_approval_registry , AutoApproveBackend
class PraisonAIExternalAgent ( BaseAgent ):
@ staticmethod
def name () -> str :
return " praisonai "
async def run ( self , instruction : str , environment : BaseEnvironment , context : AgentContext ) -> None :
# Enable auto-approval for container-isolated execution
registry = get_approval_registry ()
registry . set_backend ( AutoApproveBackend (), agent_name = " terminal-agent " )
async def bash_tool ( command : str ) -> str :
""" Execute bash command in Harbor container. """
result = await environment . exec ( command = command , timeout_sec = 300 )
output_parts = []
if result . stdout :
output_parts . append ( result . stdout . strip ())
if result . stderr :
output_parts . append ( f "[stderr]: { result . stderr . strip () } " )
if result . return_code != 0 :
output_parts . append ( f "[exit_code]: { result . return_code } " )
return " \n " . join ( output_parts ) if output_parts else " (no output) "
agent = Agent (
name = " terminal-agent " ,
instructions = " You are an expert terminal agent. Use bash_tool to execute shell commands. " ,
tools =[ bash_tool ],
llm = self . model_name or " openai/gpt-4o "
)
result = await agent . achat ( instruction )
Run command: harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
--model openai/gpt-4o \
--ae OPENAI_API_KEY= $OPENAI_API_KEY \
-n 4
The wrapper agent uses PraisonAI’s CLI interface inside the container. class PraisonAIWrapperAgent ( BaseAgent ):
@ staticmethod
def name () -> str :
return " praisonai-wrapper "
async def setup ( self , environment : BaseEnvironment ) -> None :
# Install PraisonAI inside container
await environment . exec ( command = " pip install praisonai --quiet " )
async def run ( self , instruction : str , environment : BaseEnvironment , context : AgentContext ) -> None :
model = self . model_name or " openai/gpt-4o "
# Build CLI command
command = f 'praisonai " { instruction } " --model { model } '
result = await environment . exec (
command = command ,
timeout_sec = 600 ,
env ={ " OPENAI_API_KEY " : os . environ . get ( " OPENAI_API_KEY " )}
)
Run command: harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_wrapper_agent:PraisonAIWrapperAgent \
--model openai/gpt-4o \
--ae OPENAI_API_KEY= $OPENAI_API_KEY \
-n 4
YAML Configuration
Configure benchmark runs using Harbor’s YAML format for reproducible experiments.
# job.yaml - Terminal-Bench configuration
dataset : terminal-bench/terminal-bench-2
agent :
import_path : examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent
model_name : openai/gpt-4o
env :
OPENAI_API_KEY : " ${OPENAI_API_KEY} "
n_concurrent : 8
n_attempts : 1
# Optional: Filter to specific tasks
# task_filter:
# task_names: ["compile_simple_c", "install_python_package"]
Run with configuration:
harbor run -c examples/terminal_bench/job.yaml
Task Filtering & Selection
Run specific tasks for targeted testing or debugging. harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
--model openai/gpt-4o \
--ae OPENAI_API_KEY= $OPENAI_API_KEY \
-i " terminal-bench/compile-cython-ext " \
-i " terminal-bench/bn-fit-modify "
Run a subset for quick testing with the -l flag. harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
--model openai/gpt-4o \
--ae OPENAI_API_KEY= $OPENAI_API_KEY \
-l 5 -n 2
Scale to higher concurrency using cloud providers. harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
--model openai/gpt-4o \
--env daytona -n 32 \
--ae OPENAI_API_KEY= $OPENAI_API_KEY
Interpreting Results
Terminal-Bench uses binary scoring where each task either passes (1.0) or fails (0.0).
# Example output
praisonai (gpt-4o) on terminal-bench-2
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ Agent │ praisonai (gpt-4o)│
│ Dataset │ terminal-bench-2 │
│ Trials │ 89 │
│ Errors │ 0 │
│ │ │
│ Mean │ 0.73 │
│ │ │
│ Reward Distribution │ │
│ reward = 1.0 │ 65 │
│ reward = 0.0 │ 24 │
└─────────────────────┴───────────────────┘
Score Meaning 1.0 Task passed - verification script succeeded 0.0 Task failed - verification script failed or agent error Mean Overall success rate across all tasks
Model Performance: gpt-4o-mini typically scores near 0.0 on hard tasks. Use openai/gpt-4o or anthropic/claude-3-7-sonnet-20250219 for meaningful scores.
Example Output
Real benchmark session showing PraisonAI external agent results:
$ harbor run -d terminal-bench/terminal-bench-2 \
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
--model openai/gpt-4o --ae OPENAI_API_KEY= $OPENAI_API_KEY -l 5 -n 2
5/5 Mean: 0.000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:10:18 0:00:00
Results written to /tmp/harbor_results/2026-04-12__03-00-00/result.json
praisonai (gpt-4o) on adhoc
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ Agent │ praisonai (gpt-4o)│
│ Dataset │ adhoc │
│ Trials │ 5 │
│ Errors │ 0 │
│ │ │
│ Mean │ 0.000 │
│ │ │
│ Reward Distribution │ │
│ reward = 0.0 │ 5 │
└─────────────────────┴───────────────────┘
Tasks included: Cython compilation, Bayesian network fitting, C source build, adaptive sampling, JavaScript filtering.
Best Practices
Always verify the benchmark works by testing with the oracle agent first. harbor run -d terminal-bench/terminal-bench-2 -a oracle -l 1
This should achieve a perfect score (1.0) and confirm your setup is correct.
Choose models based on your goals:
Testing integration: openai/gpt-4o-mini (fast, cheap, low scores)
Real benchmarking: openai/gpt-4o or anthropic/claude-3-7-sonnet-20250219
Cost optimization: Start with 5-10 tasks before running full benchmark
Terminal-Bench tasks can be resource intensive:
Start with -n 2 concurrency for testing
Scale to -n 8 for serious benchmarking
Use cloud providers (Daytona, E2B, Modal) for -n 32+ concurrency
When tasks fail, examine the execution logs: # Results are saved with timestamps
cat /tmp/harbor_results/2026-04-12__03-00-00/task-name/agent/output.txt
Common failure modes: timeout, missing dependencies, incorrect file paths.
Troubleshooting
Error Solution "Object of type coroutine is not JSON serializable"Fixed in current PraisonAI version - update to latest Docker not foundInstall Docker Desktop and ensure it’s running Harbor import errorInstall Harbor: pip install harbor API key not forwardedUse --ae OPENAI_API_KEY=$OPENAI_API_KEY flag Permission denied in containerEnsure Docker has proper permissions
The coroutine serialization error mentioned in early Terminal-Bench integration docs has been fixed in the current SDK version. If you encounter it, update PraisonAI to the latest version.
Sandbox Execution Safe code execution in isolated environments
Real API Testing Testing agents with real API integrations