Skip to content

Evaluation

Evaluation replaces guesswork with data. Instead of asking “does this prompt feel better?” you can run systematic checks that give you a repeatable, objective measure of your application’s behavior.

XeroML supports evaluation at two stages of the development lifecycle:

Offline evaluation — test your application against a fixed dataset before deploying a change. Run a new prompt version against your test cases, review the scores, iterate until results look good, then deploy.

Online evaluation — score live production traces as they arrive. Catch regressions in real time, build ground truth from production edge cases, and feed findings back into your offline datasets.

These two loops reinforce each other: production traces reveal edge cases your dataset didn’t cover; adding them to your dataset makes the next round of offline testing more robust.

Core Concepts

Core Concepts — understand scores, datasets, experiments, and the evaluation loop

Evaluation Methods

MethodBest for
LLM-as-a-JudgeScalable automated scoring of nuanced qualities (helpfulness, accuracy, tone)
Annotation QueuesStructured human review workflows for ground truth creation
Custom ScoresProgrammatic scoring via API/SDK — deterministic checks, user feedback, external signals

Experiments

Run controlled experiments to compare prompt versions, models, or pipeline configurations:

ResourceDescription
DatasetsCreate and manage test case collections
Experiments via SDKRun experiments programmatically against XeroML or local datasets
Prompt ExperimentsRun prompt variant experiments directly from the UI

Getting Started

The recommended sequence:

  1. Read Core Concepts to understand scores and datasets
  2. Create a dataset from production traces or from scratch
  3. Run your first experiment with the current prompt version as a baseline
  4. Set up a live evaluator to score production traces automatically

Once you have baseline scores, every prompt change is a data question: did the scores improve?