LLM-as-a-Judge

What is LLM-as-a-Judge?

LLM-as-a-Judge is an evaluation methodology where a powerful language model — such as GPT-4o or Claude — scores the outputs of another model or agent against a predefined rubric. Instead of human reviewers, the model serves as the evaluator.

Why is it used?

AI agent outputs often have no fixed correct answer, making simple keyword-matching approaches insufficient. LLM-as-a-Judge evaluates open-ended outputs against flexible criteria, making it well-suited for large-scale automated evaluation pipelines.

Limitations to watch for

Bias: The evaluating model may score outputs that resemble its own style more favorably.
Rubric quality: Vague or poorly defined criteria produce unreliable scores.
Cost: Every evaluation requires an LLM call, which compounds at scale.

Related Terms

Evals
Harness Engineering
Verification Loop
Hallucination

What is LLM-as-a-Judge?

Why is it used?

Limitations to watch for

Related Terms

Related terms