Skip to content

LLM-as-a-Judge

LLM-as-a-Judge is an evaluation approach where a capable language model assesses the outputs of your LLM application. It combines the nuance of human judgment with the scalability of automated evaluation — you can score thousands of traces per hour without human reviewers.

How It Works

The judge model receives three inputs:

  1. Evaluation criteria — a rubric defining what good and bad look like for this dimension
  2. Input context — the original user query or input
  3. Output to evaluate — your application’s response

The judge returns a score (numeric, boolean, or categorical) and optionally an explanation.

Key Advantages

Scalability A well-configured LLM judge can process your entire trace history continuously. This is not feasible with human annotation alone.

Nuance LLMs can assess qualities that simple metrics can’t — helpfulness, relevance, factual accuracy relative to a given context, tone adherence, and instruction following.

Repeatability Fixed rubrics produce consistent scoring. The same trace evaluated twice with the same rubric gets the same score, making regression detection reliable.

Evaluation Targets

LLM-as-a-Judge evaluators in XeroML can run against:

Live production data (online)

  • Observations (recommended) — scores individual LLM calls or pipeline steps. Runs in seconds per observation.
  • Traces (legacy) — scores complete workflow executions.

Offline testing

  • Experiments — runs against your test datasets for pre-deployment validation.

Observation-level evaluators are preferred for production monitoring because they complete faster and are more granular.

Setup Requirements

To use LLM-as-a-Judge:

  1. Configure an LLM Connection in your project settings (supports OpenAI, Anthropic, Azure OpenAI, Google Vertex, and others)
  2. Create an Evaluator template — a prompt that instructs the judge model
  3. Configure variable mapping — which trace fields map to which prompt variables

Evaluator Templates

XeroML provides built-in templates for common evaluation dimensions:

  • Helpfulness
  • Factual accuracy
  • Relevance to user intent
  • Toxicity / safety
  • Instruction following
  • Conciseness

You can also create fully custom templates for domain-specific quality criteria.

Configuring an Evaluator

  1. Go to Evaluations → Evaluators in your XeroML project
  2. Click New Evaluator and select LLM-as-a-Judge
  3. Choose a template or write a custom rubric
  4. Map your trace fields to the template variables (e.g., map input to {{user_message}})
  5. Configure the trigger — run on all new observations, or filter by tag, environment, or sampling rate
  6. Select the judge model from your configured LLM Connections
  7. Save and activate

Once active, the evaluator runs automatically on new observations as they arrive.

Sampling

For high-traffic applications, configure a sampling rate (e.g., 10%) to reduce evaluation costs while still getting representative quality signals. Sampled evaluators score a random subset of matching observations.

SDK Version Requirements

  • Python SDK v3 or later
  • TypeScript SDK v4 or later

Older SDK versions only support trace-level evaluators.