What is SWE-bench?

SWE-bench evaluates whether a model can resolve real issues from open-source repositories. Instead of abstract coding quizzes, it tests repository understanding, patch generation, and test execution success.

How is it measured?

A model receives issue context, generates a patch, and is scored by whether the patch passes the associated tests. This makes SWE-bench closer to practical software maintenance than syntax-only evaluation.

Why does it matter?

In production coding workflows, "looks correct" is not enough. Teams need fixes that actually run and pass tests. SWE-bench helps compare that capability.

What is SWE-bench?

How is it measured?

Why does it matter?

Related terms