In Opik 2.0, datasets and experiments are project-scoped. Make sure to specify a project_name when creating datasets and running experiments so they are associated with the correct project.
The optimizer evaluates candidate prompts against datasets stored in Opik. If you are brand new to datasets in Opik, start with Manage datasets; this page highlights specific tips to get you started.
Datasets are a crucial component of the optimizer SDK, serving as a key component to run and evaluate (score) each dataset item using optimizers to develop a better outcome. Without datasets, it’s not possible to steer the optimizer on what is good and bad.
Every item is a JSON object. Required keys depend on your prompt template; optional keys help with analysis. Schemas are optional—define only the fields your prompt or metrics actually consume.
dataset.insert(...) or related helpers from the Dataset SDK.metadata if you plan to filter by scenario.The optimizer SDK provides ready-made datasets for quick experiments:
These datasets live in sdks/opik_optimizer/src/opik_optimizer/datasets and mirror the notebook examples. Most helpers accept common slice controls like:
split (e.g., "train", "validation")count and start (slice size + offset after shuffling)seed (deterministic shuffle)filter_by (filter rows before slicing)
Some helpers also expose prefer_presets to override dataset-defined presets.Use filter_by to select a subset of rows before slicing. Filters support:
{"task_id": "e57337a4"}){"type": {"bridge", "comparison"}}){"task_id": lambda value: value.startswith("e57")})If you already have a Hugging Face dataset, you can ingest it into Opik and use it with any optimizer:
You can also wrap Hugging Face datasets with a custom helper by following the patterns in sdks/opik_optimizer/src/opik_optimizer/datasets (using DatasetSpec + DatasetHandle) if you want the same split/count/filter_by interface as the built-in datasets.
Overfitting occurs when an optimized prompt performs well on the examples it was trained on but fails to generalize to new, unseen data. To prevent this, split your dataset into separate sets:
The optimizer uses the training set for learning and the validation set for selection, ensuring the best prompt works beyond the training examples.
Split recommendations:
After optimization completes, evaluate the final prompt on a completely held-out test dataset to confirm it generalizes to production scenarios:
This final test score gives you confidence that improvements will transfer to real-world usage.
metadata["split"] = "eval") for additional organization beyond separate datasets.{context}, ensure every dataset row defines that key or provide defaults in the optimizer.len(dataset.get_items()) in Python) before and after uploads.Define how you will score results with Define metrics, then follow Optimize prompts to launch experiments. For domain-specific scoring, extend the dataset with extra fields and reference them inside Custom metrics.