Evaluation
Evaluation replaces guesswork with data. Instead of asking “does this prompt feel better?” you can run systematic checks that give you a repeatable, objective measure of your application’s behavior.
XeroML supports evaluation at two stages of the development lifecycle:
Offline evaluation — test your application against a fixed dataset before deploying a change. Run a new prompt version against your test cases, review the scores, iterate until results look good, then deploy.
Online evaluation — score live production traces as they arrive. Catch regressions in real time, build ground truth from production edge cases, and feed findings back into your offline datasets.
These two loops reinforce each other: production traces reveal edge cases your dataset didn’t cover; adding them to your dataset makes the next round of offline testing more robust.
Core Concepts
→ Core Concepts — understand scores, datasets, experiments, and the evaluation loop
Evaluation Methods
| Method | Best for |
|---|---|
| LLM-as-a-Judge | Scalable automated scoring of nuanced qualities (helpfulness, accuracy, tone) |
| Annotation Queues | Structured human review workflows for ground truth creation |
| Custom Scores | Programmatic scoring via API/SDK — deterministic checks, user feedback, external signals |
Experiments
Run controlled experiments to compare prompt versions, models, or pipeline configurations:
| Resource | Description |
|---|---|
| Datasets | Create and manage test case collections |
| Experiments via SDK | Run experiments programmatically against XeroML or local datasets |
| Prompt Experiments | Run prompt variant experiments directly from the UI |
Getting Started
The recommended sequence:
- Read Core Concepts to understand scores and datasets
- Create a dataset from production traces or from scratch
- Run your first experiment with the current prompt version as a baseline
- Set up a live evaluator to score production traces automatically
Once you have baseline scores, every prompt change is a data question: did the scores improve?