Skip to main content
operations

LLM-as-a-Judge

An evaluation methodology where a capable LLM scores another model's or agent's outputs against a predefined rubric

#LLM-as-a-Judge#LLM evaluator#AI evaluation#evals#rubric#automated evaluation

What is LLM-as-a-Judge?

LLM-as-a-Judge is an evaluation methodology where a powerful language model — such as GPT-4o or Claude — scores the outputs of another model or agent against a predefined rubric. Instead of human reviewers, the model serves as the evaluator.

Why is it used?

AI agent outputs often have no fixed correct answer, making simple keyword-matching approaches insufficient. LLM-as-a-Judge evaluates open-ended outputs against flexible criteria, making it well-suited for large-scale automated evaluation pipelines.

Limitations to watch for

  • Bias: The evaluating model may score outputs that resemble its own style more favorably.
  • Rubric quality: Vague or poorly defined criteria produce unreliable scores.
  • Cost: Every evaluation requires an LLM call, which compounds at scale.

Related Terms

Related terms