Datasets
Datasets are collections of test cases used to evaluate your LLM application systematically. Each item in a dataset provides an input scenario (and optionally an expected output) that your application is run against during an experiment.
Creating a Dataset
- Navigate to Evaluation → Datasets
- Click New Dataset
- Enter a name. Use
/separators to organize into virtual folders:evaluation/qa-dataset - Optionally add a description and metadata
- Click Create
from xeroml import get_client
xeroml = get_client()
dataset = xeroml.create_dataset( name="evaluation/qa-dataset", description="Q&A pairs from customer support tickets", metadata={"author": "eval-team", "date": "2025-01", "type": "qa"},)import { XeroMLClient } from "@xeroml/client";
const xeroml = new XeroMLClient();
const dataset = await xeroml.createDataset({ name: "evaluation/qa-dataset", description: "Q&A pairs from customer support tickets", metadata: { author: "eval-team", date: "2025-01", type: "qa" },});Adding Dataset Items
dataset = xeroml.get_dataset("evaluation/qa-dataset")
dataset.upsert_item( input={"question": "How do I reset my API key?"}, expected_output={"answer": "Go to Settings → API Keys and click Regenerate."}, metadata={"source": "support-ticket-1234"},)Add multiple items at once:
items = [ {"input": {"question": q}, "expected_output": {"answer": a}} for q, a in qa_pairs]
for item in items: dataset.upsert_item(**item)In the Traces list view:
- Filter traces by tag, score, or date range
- Select the traces you want to add
- Click Add to Dataset
- Map trace fields to dataset item fields (input → input, output → expected_output)
This is the fastest way to build a dataset that reflects real production usage.
Dataset Organization
Use / in dataset names to create virtual folder hierarchies:
evaluation/ qa-dataset rag-evaluation safety-checksexperiments/ prompt-v2-baseline model-comparisonThe UI displays these as nested folders automatically.
Schema Validation
Optionally enforce a JSON Schema on your dataset items to ensure consistency:
dataset = xeroml.create_dataset( name="structured-qa", input_schema={ "type": "object", "properties": { "question": {"type": "string"}, "context": {"type": "string"}, }, "required": ["question"] }, expected_output_schema={ "type": "object", "properties": { "answer": {"type": "string"}, }, "required": ["answer"] })Items that don’t match the schema are rejected with a detailed error message.
Dataset Versioning
Every modification to a dataset (add, update, delete, archive item) creates a new version timestamp. When running experiments, XeroML records which dataset version was used, making results reproducible.
To retrieve a dataset as it was at a specific point in time:
dataset = xeroml.get_dataset("evaluation/qa-dataset", as_of="2025-01-15T10:00:00Z")Next Steps
Once you have a dataset, run experiments against it: