Benchmarks, Accountability, and What Matters
Our Terminal Bench submission did not meet the standard we set for ourselves. We made real improvements to our agent harness for the benchmark, but we also resorted to methods that compromised our results. The methodology was wrong, and we take full accountability.
Benchmarks have become a huge focus for our industry, driving launch posts and informing which agents get adopted. At the same time, benchmaxxing is rampant. Several of the highest-ranked submissions on Terminal Bench today actively inject task-specific guidance, cherry-pick trials, and refuse to publish trajectories. We anticipate many more will be removed soon, but the deeper issue is systemic. We got caught up in the race, and that was a mistake.
This is a turning point for us, and maybe for others, to focus on what matters outside of benchmarks: building a product people love. We’ve built an incredible, high-caliber team that has spent the last six months heads-down building a frontier agent, and the work speaks for itself:
- Cloud sandboxes that run your code in isolation
- Auto-generated skills and hooks based on your past sessions
- Fine-tuned subagent models purpose-built for subtasks
- Session sharing so your team can pick up where you left off
- Hands-off mode with built-in safety controls
- Support for 300+ models
- PM Mode for planning specs
- and much more
We’re committed to doing better going forward, which means focusing on transparency and verifiability. It’s been an important week to reflect, but it’s time to get back to building.
— Daljeet & Tejpal