Evaluation Core Concepts
The Evaluation Loop
LLM evaluation works best as a cycle between offline testing and online monitoring:
Offline evaluation Before deploying a change, run your updated prompt or model against a fixed set of test cases. Review scores, iterate on the prompt, and repeat until results are satisfactory. Then deploy.
Online evaluation After deployment, run evaluators on live production traces. This catches regressions that your test set didn’t cover — real users find edge cases faster than any synthetic dataset.
Closing the loop When production evaluation surfaces a failure, add that trace to your dataset. The next offline experiment covers that case, making future deployments safer.
Scores
A score is the fundamental output of any evaluation. Scores attach to a trace or observation and carry a name, a value, and optional metadata.
| Field | Description |
|---|---|
name | The dimension being measured (e.g., accuracy, helpfulness, toxicity) |
value | Numeric (any float), boolean (true/false), or categorical (string) |
comment | Optional explanation of the score |
source | Where the score came from: llm, human, api |
Multiple scores can be attached to a single trace, covering different dimensions.
Evaluation Methods
XeroML supports four scoring mechanisms:
| Method | Source | Scale |
|---|---|---|
| LLM-as-a-Judge | Automated, model-based | High — thousands per hour |
| Scores via UI | Human, manual | Low — spot checks |
| Annotation Queues | Human, structured | Medium — systematic workflows |
| Scores via API/SDK | Automated or human | Any |
Datasets
A dataset is a collection of test cases (dataset items). Each item contains:
- Input — the scenario to test (user message, document, query)
- Expected output (optional) — the ideal response, for reference-based scoring
Datasets are versioned. Every modification creates a new version timestamp, so experiments are reproducible against the exact dataset state at the time they ran.
Dataset Items
A dataset item is a single test case. Items can be created:
- Manually in the UI
- Via SDK
- By adding production traces to a dataset (one-click in the trace UI)
Items can include structured inputs (JSON) or plain text, and optionally include expected outputs for comparison.
Experiments
An experiment is a single run of your application against all items in a dataset. The workflow:
- Define a task — the application function being evaluated (a function that takes an item’s input and returns an output)
- Run the task against every item in the dataset
- Score the outputs using one or more evaluation methods
- Review aggregate scores in the dashboard
Multiple experiment runs on the same dataset create a comparison view: you can see exactly which items regressed or improved between runs.
Experiment Runs
An experiment run is one complete execution of the task against a dataset. Each run produces:
- One trace per dataset item (in XeroML’s trace store)
- Scores for each trace (from evaluators)
- Aggregate metrics across the run
You can run experiments from the XeroML UI (for prompt experiments) or via the SDK (for full application experiments).
Score Types
XeroML supports three value types for scores:
| Type | Example values |
|---|---|
| Numeric | 0.85, 7.2, -1.0 |
| Boolean | true, false |
| Categorical | "pass", "fail", "needs-review" |
Use numeric scores for ranking and regression detection, boolean for pass/fail checks, and categorical for multi-class quality labels.