Skip to main content
Back to List
Natural Language Processing

Terminal-Bench

An agent-style benchmark that evaluates multi-step execution in terminal environments

#Terminal-Bench#terminal bench#agent benchmark#terminal task evaluation

What is Terminal-Bench?

Terminal-Bench measures how reliably a model can complete chained tasks in a terminal: running commands, editing files, recovering from errors, and finishing workflows end to end.

What does it evaluate?

It focuses on task completion, instruction adherence, and error recovery across multiple steps, not just single-turn answer quality.

Why does it matter?

For coding agents and operational automation, consistency across execution steps is critical. Terminal-Bench is a practical signal for that capability.

Related terms