Evaluation Core Concepts

The Evaluation Loop

LLM evaluation works best as a cycle between offline testing and online monitoring:

Offline evaluation Before deploying a change, run your updated prompt or model against a fixed set of test cases. Review scores, iterate on the prompt, and repeat until results are satisfactory. Then deploy.

Online evaluation After deployment, run evaluators on live production traces. This catches regressions that your test set didn’t cover — real users find edge cases faster than any synthetic dataset.

Closing the loop When production evaluation surfaces a failure, add that trace to your dataset. The next offline experiment covers that case, making future deployments safer.

Scores

A score is the fundamental output of any evaluation. Scores attach to a trace or observation and carry a name, a value, and optional metadata.

Field	Description
`name`	The dimension being measured (e.g., `accuracy`, `helpfulness`, `toxicity`)
`value`	Numeric (any float), boolean (`true`/`false`), or categorical (`string`)
`comment`	Optional explanation of the score
`source`	Where the score came from: `llm`, `human`, `api`

Multiple scores can be attached to a single trace, covering different dimensions.

Evaluation Methods

XeroML supports four scoring mechanisms:

Method	Source	Scale
LLM-as-a-Judge	Automated, model-based	High — thousands per hour
Scores via UI	Human, manual	Low — spot checks
Annotation Queues	Human, structured	Medium — systematic workflows
Scores via API/SDK	Automated or human	Any

Datasets

A dataset is a collection of test cases (dataset items). Each item contains:

Input — the scenario to test (user message, document, query)
Expected output (optional) — the ideal response, for reference-based scoring

Datasets are versioned. Every modification creates a new version timestamp, so experiments are reproducible against the exact dataset state at the time they ran.

Dataset Items

A dataset item is a single test case. Items can be created:

Manually in the UI
Via SDK
By adding production traces to a dataset (one-click in the trace UI)

Items can include structured inputs (JSON) or plain text, and optionally include expected outputs for comparison.

Experiments

An experiment is a single run of your application against all items in a dataset. The workflow:

Define a task — the application function being evaluated (a function that takes an item’s input and returns an output)
Run the task against every item in the dataset
Score the outputs using one or more evaluation methods
Review aggregate scores in the dashboard

Multiple experiment runs on the same dataset create a comparison view: you can see exactly which items regressed or improved between runs.

Experiment Runs

An experiment run is one complete execution of the task against a dataset. Each run produces:

One trace per dataset item (in XeroML’s trace store)
Scores for each trace (from evaluators)
Aggregate metrics across the run

You can run experiments from the XeroML UI (for prompt experiments) or via the SDK (for full application experiments).

Score Types

XeroML supports three value types for scores:

Type	Example values
Numeric	`0.85`, `7.2`, `-1.0`
Boolean	`true`, `false`
Categorical	`"pass"`, `"fail"`, `"needs-review"`

Use numeric scores for ranking and regression detection, boolean for pass/fail checks, and categorical for multi-class quality labels.