What is Terminal-Bench?

Terminal-Bench measures how reliably a model can complete chained tasks in a terminal: running commands, editing files, recovering from errors, and finishing workflows end to end.

What does it evaluate?

It focuses on task completion, instruction adherence, and error recovery across multiple steps, not just single-turn answer quality.

Why does it matter?

For coding agents and operational automation, consistency across execution steps is critical. Terminal-Bench is a practical signal for that capability.

What is Terminal-Bench?

What does it evaluate?

Why does it matter?

Related terms