LLM-as-a-Judge
LLM-as-a-Judge is an evaluation approach where a capable language model assesses the outputs of your LLM application. It combines the nuance of human judgment with the scalability of automated evaluation — you can score thousands of traces per hour without human reviewers.
How It Works
The judge model receives three inputs:
- Evaluation criteria — a rubric defining what good and bad look like for this dimension
- Input context — the original user query or input
- Output to evaluate — your application’s response
The judge returns a score (numeric, boolean, or categorical) and optionally an explanation.
Key Advantages
Scalability A well-configured LLM judge can process your entire trace history continuously. This is not feasible with human annotation alone.
Nuance LLMs can assess qualities that simple metrics can’t — helpfulness, relevance, factual accuracy relative to a given context, tone adherence, and instruction following.
Repeatability Fixed rubrics produce consistent scoring. The same trace evaluated twice with the same rubric gets the same score, making regression detection reliable.
Evaluation Targets
LLM-as-a-Judge evaluators in XeroML can run against:
Live production data (online)
- Observations (recommended) — scores individual LLM calls or pipeline steps. Runs in seconds per observation.
- Traces (legacy) — scores complete workflow executions.
Offline testing
- Experiments — runs against your test datasets for pre-deployment validation.
Observation-level evaluators are preferred for production monitoring because they complete faster and are more granular.
Setup Requirements
To use LLM-as-a-Judge:
- Configure an LLM Connection in your project settings (supports OpenAI, Anthropic, Azure OpenAI, Google Vertex, and others)
- Create an Evaluator template — a prompt that instructs the judge model
- Configure variable mapping — which trace fields map to which prompt variables
Evaluator Templates
XeroML provides built-in templates for common evaluation dimensions:
- Helpfulness
- Factual accuracy
- Relevance to user intent
- Toxicity / safety
- Instruction following
- Conciseness
You can also create fully custom templates for domain-specific quality criteria.
Configuring an Evaluator
- Go to Evaluations → Evaluators in your XeroML project
- Click New Evaluator and select LLM-as-a-Judge
- Choose a template or write a custom rubric
- Map your trace fields to the template variables (e.g., map
inputto{{user_message}}) - Configure the trigger — run on all new observations, or filter by tag, environment, or sampling rate
- Select the judge model from your configured LLM Connections
- Save and activate
Once active, the evaluator runs automatically on new observations as they arrive.
Sampling
For high-traffic applications, configure a sampling rate (e.g., 10%) to reduce evaluation costs while still getting representative quality signals. Sampled evaluators score a random subset of matching observations.
SDK Version Requirements
- Python SDK v3 or later
- TypeScript SDK v4 or later
Older SDK versions only support trace-level evaluators.