Datasets

Datasets are collections of test cases used to evaluate your LLM application systematically. Each item in a dataset provides an input scenario (and optionally an expected output) that your application is run against during an experiment.

Creating a Dataset

Navigate to Evaluation → Datasets
Click New Dataset
Enter a name. Use / separators to organize into virtual folders: evaluation/qa-dataset
Optionally add a description and metadata
Click Create

from xeroml import get_client

xeroml = get_client()

dataset = xeroml.create_dataset(
    name="evaluation/qa-dataset",
    description="Q&A pairs from customer support tickets",
    metadata={"author": "eval-team", "date": "2025-01", "type": "qa"},
)

import { XeroMLClient } from "@xeroml/client";

const xeroml = new XeroMLClient();

const dataset = await xeroml.createDataset({
  name: "evaluation/qa-dataset",
  description: "Q&A pairs from customer support tickets",
  metadata: { author: "eval-team", date: "2025-01", type: "qa" },
});

dataset = xeroml.get_dataset("evaluation/qa-dataset")

dataset.upsert_item(
    input={"question": "How do I reset my API key?"},
    expected_output={"answer": "Go to Settings → API Keys and click Regenerate."},
    metadata={"source": "support-ticket-1234"},
)

Add multiple items at once:

items = [
    {"input": {"question": q}, "expected_output": {"answer": a}}
    for q, a in qa_pairs
]

for item in items:
    dataset.upsert_item(**item)

Dataset Organization

Use / in dataset names to create virtual folder hierarchies:

evaluation/
  qa-dataset
  rag-evaluation
  safety-checks
experiments/
  prompt-v2-baseline
  model-comparison

The UI displays these as nested folders automatically.

Schema Validation

Optionally enforce a JSON Schema on your dataset items to ensure consistency:

dataset = xeroml.create_dataset(
    name="structured-qa",
    input_schema={
        "type": "object",
        "properties": {
            "question": {"type": "string"},
            "context": {"type": "string"},
        },
        "required": ["question"]
    },
    expected_output_schema={
        "type": "object",
        "properties": {
            "answer": {"type": "string"},
        },
        "required": ["answer"]
    }
)

Items that don’t match the schema are rejected with a detailed error message.

Dataset Versioning

Every modification to a dataset (add, update, delete, archive item) creates a new version timestamp. When running experiments, XeroML records which dataset version was used, making results reproducible.

To retrieve a dataset as it was at a specific point in time:

dataset = xeroml.get_dataset("evaluation/qa-dataset", as_of="2025-01-15T10:00:00Z")

Next Steps

Once you have a dataset, run experiments against it:

→ Experiments via SDK

→ Prompt Experiments (UI)