Evaluation

Evaluation replaces guesswork with data. Instead of asking “does this prompt feel better?” you can run systematic checks that give you a repeatable, objective measure of your application’s behavior.

XeroML supports evaluation at two stages of the development lifecycle:

Offline evaluation — test your application against a fixed dataset before deploying a change. Run a new prompt version against your test cases, review the scores, iterate until results look good, then deploy.

Online evaluation — score live production traces as they arrive. Catch regressions in real time, build ground truth from production edge cases, and feed findings back into your offline datasets.

These two loops reinforce each other: production traces reveal edge cases your dataset didn’t cover; adding them to your dataset makes the next round of offline testing more robust.

Core Concepts

→ Core Concepts — understand scores, datasets, experiments, and the evaluation loop

Evaluation Methods

Method	Best for
LLM-as-a-Judge	Scalable automated scoring of nuanced qualities (helpfulness, accuracy, tone)
Annotation Queues	Structured human review workflows for ground truth creation
Custom Scores	Programmatic scoring via API/SDK — deterministic checks, user feedback, external signals

Experiments

Run controlled experiments to compare prompt versions, models, or pipeline configurations:

Resource	Description
Datasets	Create and manage test case collections
Experiments via SDK	Run experiments programmatically against XeroML or local datasets
Prompt Experiments	Run prompt variant experiments directly from the UI

Getting Started

The recommended sequence:

Read Core Concepts to understand scores and datasets
Create a dataset from production traces or from scratch
Run your first experiment with the current prompt version as a baseline
Set up a live evaluator to score production traces automatically

Once you have baseline scores, every prompt change is a data question: did the scores improve?