Define datasets
The optimizer evaluates candidate prompts against datasets stored in Opik. If you are brand new to datasets in Opik, start with Manage datasets; this page highlights specific tips to get you started.
Datasets are a crucial component of the optimizer SDK, serving as a key component to run and evaluate (score) each dataset item using optimizers to develop a better outcome. Without datasets, it’s not possible to steer the optimizer on what is good and bad.
Dataset schema
Every item is a JSON object. Required keys depend on your prompt template; optional keys help with analysis. Schemas are optional—define only the fields your prompt or metrics actually consume.
Create or load datasets
Upload from file
- Prepare a CSV or Parquet file with column headers that match your prompt variables.
- Load the file via Python (e.g., pandas) and call
dataset.insert(...)or related helpers from the Dataset SDK. - Verify in the UI that rows include
metadataif you plan to filter by scenario.
Train/validation splits
Overfitting occurs when an optimized prompt performs well on the examples it was trained on but fails to generalize to new, unseen data. To prevent this, split your dataset into separate sets:
- Training dataset (70-80%): Used by the optimizer to generate prompt improvements
- Validation dataset (20-30%): Used to evaluate and rank candidate prompts during optimization, helping select prompts that generalize well
- Test dataset (optional, separate): Held out completely until after optimization to measure final real-world performance
The optimizer uses the training set for learning and the validation set for selection, ensuring the best prompt works beyond the training examples.
Split recommendations:
- 70/30 or 80/20 is standard for training/validation splits
- Ensure diversity in both sets to cover different scenarios
- Keep validation data unseen during prompt development
- Use the same distribution in both sets to ensure valid evaluation
Testing on held-out data
After optimization completes, evaluate the final prompt on a completely held-out test dataset to confirm it generalizes to production scenarios:
This final test score gives you confidence that improvements will transfer to real-world usage.
Best practices
- Keep datasets immutable during an optimization run; create a new dataset version if you need to add rows.
- Use validation datasets to avoid overfitting—split your data 70/30 or 80/20 between training and validation sets.
- Log context fields if you run RAG-style prompts so failure analyses can surface missing passages.
- Track splits via metadata (e.g.,
metadata["split"] = "eval") for additional organization beyond separate datasets. - Document ownership using dataset descriptions so teams know who curates each collection.
- Keep schema + prompt in sync – if your prompt expects
{context}, ensure every dataset row defines that key or provide defaults in the optimizer.
Validation checklist
- Confirm row counts in the Opik Datasets tab (or by running
len(dataset.get_items())in Python) before and after uploads. - Spot-check rows in the dashboard’s Dataset viewer.
- If rows include multimodal assets or tool payloads, confirm they appear in the trace tree once you run an optimization.
- Run an initial small-batch optimization with a few rows of data to validate everything end to end.
Next steps
Define how you will score results with Define metrics, then follow Optimize prompts to launch experiments. For domain-specific scoring, extend the dataset with extra fields and reference them inside Custom metrics.