Skip to content

Evaluation Core Concepts

The Evaluation Loop

LLM evaluation works best as a cycle between offline testing and online monitoring:

Offline evaluation Before deploying a change, run your updated prompt or model against a fixed set of test cases. Review scores, iterate on the prompt, and repeat until results are satisfactory. Then deploy.

Online evaluation After deployment, run evaluators on live production traces. This catches regressions that your test set didn’t cover — real users find edge cases faster than any synthetic dataset.

Closing the loop When production evaluation surfaces a failure, add that trace to your dataset. The next offline experiment covers that case, making future deployments safer.

Scores

A score is the fundamental output of any evaluation. Scores attach to a trace or observation and carry a name, a value, and optional metadata.

FieldDescription
nameThe dimension being measured (e.g., accuracy, helpfulness, toxicity)
valueNumeric (any float), boolean (true/false), or categorical (string)
commentOptional explanation of the score
sourceWhere the score came from: llm, human, api

Multiple scores can be attached to a single trace, covering different dimensions.

Evaluation Methods

XeroML supports four scoring mechanisms:

MethodSourceScale
LLM-as-a-JudgeAutomated, model-basedHigh — thousands per hour
Scores via UIHuman, manualLow — spot checks
Annotation QueuesHuman, structuredMedium — systematic workflows
Scores via API/SDKAutomated or humanAny

Datasets

A dataset is a collection of test cases (dataset items). Each item contains:

  • Input — the scenario to test (user message, document, query)
  • Expected output (optional) — the ideal response, for reference-based scoring

Datasets are versioned. Every modification creates a new version timestamp, so experiments are reproducible against the exact dataset state at the time they ran.

Dataset Items

A dataset item is a single test case. Items can be created:

  • Manually in the UI
  • Via SDK
  • By adding production traces to a dataset (one-click in the trace UI)

Items can include structured inputs (JSON) or plain text, and optionally include expected outputs for comparison.

Experiments

An experiment is a single run of your application against all items in a dataset. The workflow:

  1. Define a task — the application function being evaluated (a function that takes an item’s input and returns an output)
  2. Run the task against every item in the dataset
  3. Score the outputs using one or more evaluation methods
  4. Review aggregate scores in the dashboard

Multiple experiment runs on the same dataset create a comparison view: you can see exactly which items regressed or improved between runs.

Experiment Runs

An experiment run is one complete execution of the task against a dataset. Each run produces:

  • One trace per dataset item (in XeroML’s trace store)
  • Scores for each trace (from evaluators)
  • Aggregate metrics across the run

You can run experiments from the XeroML UI (for prompt experiments) or via the SDK (for full application experiments).

Score Types

XeroML supports three value types for scores:

TypeExample values
Numeric0.85, 7.2, -1.0
Booleantrue, false
Categorical"pass", "fail", "needs-review"

Use numeric scores for ranking and regression detection, boolean for pass/fail checks, and categorical for multi-class quality labels.