LLM-as-a-Judge

LLM-as-a-Judge is an evaluation approach where a capable language model assesses the outputs of your LLM application. It combines the nuance of human judgment with the scalability of automated evaluation — you can score thousands of traces per hour without human reviewers.

How It Works

The judge model receives three inputs:

Evaluation criteria — a rubric defining what good and bad look like for this dimension
Input context — the original user query or input
Output to evaluate — your application’s response

The judge returns a score (numeric, boolean, or categorical) and optionally an explanation.

Key Advantages

Scalability A well-configured LLM judge can process your entire trace history continuously. This is not feasible with human annotation alone.

Nuance LLMs can assess qualities that simple metrics can’t — helpfulness, relevance, factual accuracy relative to a given context, tone adherence, and instruction following.

Repeatability Fixed rubrics produce consistent scoring. The same trace evaluated twice with the same rubric gets the same score, making regression detection reliable.

Evaluation Targets

LLM-as-a-Judge evaluators in XeroML can run against:

Live production data (online)

Observations (recommended) — scores individual LLM calls or pipeline steps. Runs in seconds per observation.
Traces (legacy) — scores complete workflow executions.

Offline testing

Experiments — runs against your test datasets for pre-deployment validation.

Observation-level evaluators are preferred for production monitoring because they complete faster and are more granular.

Setup Requirements

To use LLM-as-a-Judge:

Configure an LLM Connection in your project settings (supports OpenAI, Anthropic, Azure OpenAI, Google Vertex, and others)
Create an Evaluator template — a prompt that instructs the judge model
Configure variable mapping — which trace fields map to which prompt variables

Evaluator Templates

XeroML provides built-in templates for common evaluation dimensions:

Helpfulness
Factual accuracy
Relevance to user intent
Toxicity / safety
Instruction following
Conciseness

You can also create fully custom templates for domain-specific quality criteria.

Configuring an Evaluator

Go to Evaluations → Evaluators in your XeroML project
Click New Evaluator and select LLM-as-a-Judge
Choose a template or write a custom rubric
Map your trace fields to the template variables (e.g., map input to {{user_message}})
Configure the trigger — run on all new observations, or filter by tag, environment, or sampling rate
Select the judge model from your configured LLM Connections
Save and activate

Once active, the evaluator runs automatically on new observations as they arrive.

Sampling

For high-traffic applications, configure a sampling rate (e.g., 10%) to reduce evaluation costs while still getting representative quality signals. Sampled evaluators score a random subset of matching observations.

SDK Version Requirements

Python SDK v3 or later
TypeScript SDK v4 or later

Older SDK versions only support trace-level evaluators.