Leaderboard Comparison
OpenThoughts-TBLite scores compared to published results
Note: Published leaderboard models were evaluated using the
terminus-2 agent harness.
Our Composer 2 result uses the cursor-cli agent.
Different agent harnesses may affect scores, so direct comparison should be interpreted with caution.
Score Distribution
Breakdown of task outcomes across 100 tasks
Agent Execution Time
Time spent by Composer 2 solving each task
Task-by-Task Results
All 100 tasks with scores, status, and execution time
Methodology
Caveats
- Single run per task (no multi-attempt averaging); results may vary with repeated runs
- Local Docker execution; networking and resource constraints may differ from Daytona cloud sandboxes used for official leaderboard
- Uses
cursor-cliagent harness, notterminus-2used for published leaderboard scores - 5 tasks timed out and are excluded from the mean score, which may slightly inflate the reported result
- Oracle baseline achieves only 62.2% mean due to 26 missing oracle solutions and some buggy oracle scripts