Define datasets | Opik Documentation

In Opik 2.0, datasets and experiments are project-scoped. Make sure to specify a project_name when creating datasets and running experiments so they are associated with the correct project.

The optimizer evaluates candidate prompts against datasets stored in Opik. If you are brand new to datasets in Opik, start with Manage datasets; this page highlights specific tips to get you started.

Datasets are a crucial component of the optimizer SDK, serving as a key component to run and evaluate (score) each dataset item using optimizers to develop a better outcome. Without datasets, it’s not possible to steer the optimizer on what is good and bad.

Dataset schema

Every item is a JSON object. Required keys depend on your prompt template; optional keys help with analysis. Schemas are optional—define only the fields your prompt or metrics actually consume.

Field	Purpose
`inputs` (e.g., `question`, `context`)	Values substituted into your `ChatPrompt` placeholders.
`answer` / `label`	Ground truth used by metrics.
`metadata`	Arbitrary dict for tagging scenario, split, or difficulty.

Create or load datasets

Create via SDK

1 import opik
2 
3 client = opik.Opik()
4 dataset = client.get_or_create_dataset(name="agent-opt-support", project_name="my-project")
5 dataset.insert([
6     {"question": "Summarize Opik.", "answer": "Opik is an LLM observability platform."},
7     {"question": "List two optimizer types.", "answer": "MetaPrompt and HRPO."},
8 ])

Upload from file

Prepare a CSV or Parquet file with column headers that match your prompt variables.
Load the file via Python (e.g., pandas) and call dataset.insert(...) or related helpers from the Dataset SDK.
Verify in the UI that rows include metadata if you plan to filter by scenario.

Use built-in samples

The optimizer SDK provides ready-made datasets for quick experiments:

1 from opik_optimizer import datasets
2 hotpot = datasets.hotpot(count=300)
3 tiny = datasets.tiny_test()

These datasets live in sdks/opik_optimizer/src/opik_optimizer/datasets and mirror the notebook examples. Most helpers accept common slice controls like:

split (e.g., "train", "validation")
count and start (slice size + offset after shuffling)
seed (deterministic shuffle)
filter_by (filter rows before slicing) Some helpers also expose prefer_presets to override dataset-defined presets.

Filter dataset rows

Use filter_by to select a subset of rows before slicing. Filters support:

exact match ({"task_id": "e57337a4"})
membership ({"type": {"bridge", "comparison"}})
callables ({"task_id": lambda value: value.startswith("e57")})

1 from opik_optimizer.datasets import arc_agi2, hotpot
2 
3 dataset = arc_agi2(
4     split="train",
5     count=1,
6     prefer_presets=False,
7     filter_by={"task_id": "e57337a4"},
8 )
9 
10 dataset = hotpot(
11     split="validation",
12     count=100,
13     filter_by={"type": {"bridge", "comparison"}},
14 )

Use Hugging Face datasets

If you already have a Hugging Face dataset, you can ingest it into Opik and use it with any optimizer:

1 from datasets import load_dataset
2 import opik
3 
4 hf = load_dataset("your-org/your-dataset", split="train")
5 records = [
6     {"question": row["question"], "answer": row["answer"]}
7     for row in hf.select(range(200))
8 ]
9 
10 client = opik.Opik()
11 dataset = client.get_or_create_dataset("your-hf-train", project_name="my-project")
12 dataset.insert(records)

You can also wrap Hugging Face datasets with a custom helper by following the patterns in sdks/opik_optimizer/src/opik_optimizer/datasets (using DatasetSpec + DatasetHandle) if you want the same split/count/filter_by interface as the built-in datasets.

Train/validation splits

Overfitting occurs when an optimized prompt performs well on the examples it was trained on but fails to generalize to new, unseen data. To prevent this, split your dataset into separate sets:

Training dataset (70-80%): Used by the optimizer to generate prompt improvements
Validation dataset (20-30%): Used to evaluate and rank candidate prompts during optimization, helping select prompts that generalize well
Test dataset (optional, separate): Held out completely until after optimization to measure final real-world performance

The optimizer uses the training set for learning and the validation set for selection, ensuring the best prompt works beyond the training examples.

1 import opik
2 
3 client = opik.Opik()
4 
5 # Create training dataset (70-80% of your data)
6 training_dataset = client.get_or_create_dataset(name="agent-opt-train", project_name="my-project")
7 training_dataset.insert([
8     {"question": "What is Opik?", "answer": "Opik is an LLM observability platform."},
9     {"question": "List optimizer types.", "answer": "MetaPrompt, Evolutionary, etc."},
10     # ... more training examples
11 ])
12 
13 # Create validation dataset (20-30% of your data)
14 validation_dataset = client.get_or_create_dataset(name="agent-opt-val", project_name="my-project")
15 validation_dataset.insert([
16     {"question": "Explain Opik's purpose.", "answer": "Opik helps monitor LLMs."},
17     {"question": "Name two optimizers.", "answer": "GEPA and Few-Shot Bayesian."},
18     # ... more validation examples
19 ])
20 
21 # Use both during optimization
22 result = optimizer.optimize_prompt(
23     prompt=my_prompt,
24     dataset=training_dataset,
25     validation_dataset=validation_dataset,
26     metric=my_metric,
27 )

Split recommendations:

70/30 or 80/20 is standard for training/validation splits
Ensure diversity in both sets to cover different scenarios
Keep validation data unseen during prompt development
Use the same distribution in both sets to ensure valid evaluation

Testing on held-out data

After optimization completes, evaluate the final prompt on a completely held-out test dataset to confirm it generalizes to production scenarios:

1 from opik.evaluation import evaluate_prompt
2 
3 # After optimization, test on unseen data
4 test_dataset = client.get_dataset(name="agent-opt-test")
5 
6 test_results = evaluate_prompt(
7     prompt=result.prompt,  # Best prompt from optimization
8     dataset=test_dataset,
9     scoring_metrics=[my_metric],
10     task_threads=4,
11     project_name="my-project",
12 )
13 
14 print(f"Test score: {test_results.mean_scores}")

This final test score gives you confidence that improvements will transfer to real-world usage.

Best practices

Keep datasets immutable during an optimization run; create a new dataset version if you need to add rows.
Use validation datasets to avoid overfitting—split your data 70/30 or 80/20 between training and validation sets.
Log context fields if you run RAG-style prompts so failure analyses can surface missing passages.
Track splits via metadata (e.g., metadata["split"] = "eval") for additional organization beyond separate datasets.
Document ownership using dataset descriptions so teams know who curates each collection.
Keep schema + prompt in sync – if your prompt expects {context}, ensure every dataset row defines that key or provide defaults in the optimizer.

Validation checklist

Confirm row counts in the Opik Datasets tab (or by running len(dataset.get_items()) in Python) before and after uploads.
Spot-check rows in the dashboard’s Dataset viewer.
If rows include multimodal assets or tool payloads, confirm they appear in the trace tree once you run an optimization.
Run an initial small-batch optimization with a few rows of data to validate everything end to end.

Next steps

Define how you will score results with Define metrics, then follow Optimize prompts to launch experiments. For domain-specific scoring, extend the dataset with extra fields and reference them inside Custom metrics.