Evals (AI Evaluation)

What are Evals?

Evals (Evaluations) are structured frameworks for measuring AI model or agent outputs against defined criteria. Beyond simple pass/fail testing, evals track each step of multi-step tasks and automatically detect quality regressions compared to prior versions.

How do evals differ from regular tests?

Traditional software tests check whether an output matches a fixed expected value. AI evals require different approaches because outputs are not deterministic.

Criteria-based scoring: Does the output satisfy specific conditions (format, required information, etc.)?
LLM-as-a-Judge: A more capable model scores the output against a rubric
Trajectory analysis: Evaluates not just the final answer but the reasoning path taken to reach it

Why do evals matter?

If an agent reaches the right answer via flawed logic, that is a reliability problem waiting to surface. Evals assess both output quality and reasoning soundness.

Related Terms

LLM-as-a-Judge
Harness Engineering
Verification Loop

What are Evals?

How do evals differ from regular tests?

Why do evals matter?

Related Terms

Related terms