OpenThoughts-TBLite Benchmark

Leaderboard Comparison

OpenThoughts-TBLite scores compared to published results

Note: Published leaderboard models were evaluated using the terminus-2 agent harness. Our Composer 2 result uses the cursor-cli agent. Different agent harnesses may affect scores, so direct comparison should be interpreted with caution.

Score Distribution

Breakdown of task outcomes across 100 tasks

Agent Execution Time

Time spent by Composer 2 solving each task

Task-by-Task Results

All 100 tasks with scores, status, and execution time

Methodology

Caveats
  • Single run per task (no multi-attempt averaging); results may vary with repeated runs
  • Local Docker execution; networking and resource constraints may differ from Daytona cloud sandboxes used for official leaderboard
  • Uses cursor-cli agent harness, not terminus-2 used for published leaderboard scores
  • 5 tasks timed out and are excluded from the mean score, which may slightly inflate the reported result
  • Oracle baseline achieves only 62.2% mean due to 26 missing oracle solutions and some buggy oracle scripts