Optimizer Datasets & Testing | Opik Documentation

Overview

Datasets and testing are critical components of the Opik Agent Optimizer’s effectiveness. This section outlines the requirements for datasets and the methodologies for testing to ensure optimal performance.

This page focuses on dataset requirements specifically for Opik Agent Optimizer. For a general guide on creating, managing, and versioning your own datasets within the Opik platform (using the SDK or UI), please refer to the Manage Datasets documentation.

Dataset Requirements for Optimization

Format

For use with Opik Agent Optimizer, datasets should generally be a list of dictionaries. Each dictionary represents a data point (an “item”) and should contain key-value pairs that your chosen optimizer and configuration will use. Typically, this includes:

At least one field for input to the LLM (e.g., a question, a statement to complete).
At least one field for the expected output or reference (e.g., the ground truth answer, a desired summary).

1 # Example of a dataset item format
2 dataset = [
3     {
4         "question": "What is the capital of France?",         # Input field
5         "expected_answer": "Paris",                           # Output/reference field
6         "category": "geography"                               # Optional metadata
7     },
8     {
9         "user_query": "Translate 'hello' to Spanish.",
10         "ground_truth_translation": "Hola",
11         "difficulty": "easy"
12     }
13     # ... more examples
14 ]

The exact field names ("question", "expected_answer", etc.) are up to you. You can reference any of these fields in your ChatPrompt and metric function.

Size Recommendations

Minimum: 50 examples
- Provides basic coverage for optimization.
- Suitable for simple use cases or initial experimentation.
Optimal: 100-500 examples
- Offers a better representation of real-world scenarios.
- Leads to more reliable and generalizable optimization results.
Maximum: Context window dependent and budget-dependent
- Limited by the LLM’s maximum context length (especially for few-shot examples).
- Larger datasets increase evaluation time and cost during optimization.

Quality Guidelines

Diversity
- Cover a wide range of scenarios, topics, and complexities relevant to your task.
- Include different types of inputs and potential edge cases.
- Represent real-world usage patterns accurately.
Accuracy
- Ensure ground truth outputs/references are correct, consistent, and unambiguous.
- Validate all examples carefully.
- Maintain consistency in formatting if the LLM is expected to learn it.
Representation
- Balance different types of examples if your task has distinct sub-categories.
- Include both simple and complex cases to test robustness.

Demo Datasets (`opik_optimizer.datasets`)

To help you get started quickly and for testing purposes, the opik-optimizer package includes a demo module opik_optimizer.datasets that includes a number of example datasets.

1 from opik_optimizer.datasets import tiny_test, hotpot_300
2 
3 # Load a small demo dataset for quick testing
4 tiny_test_dataset = tiny_test()
5 
6 # Load a subset of the HotpotQA dataset
7 hotpot_dataset = hotpot_300()
8 
9 print(f"Loaded {len(tiny_test_dataset.get_items())} items from tiny-test.")

When you load a demo dataset, it first checks if a dataset with that name (or {name}_test if test_mode=True) already exists in your Opik project. If so, and it has data, it returns the existing dataset. Otherwise, it loads the data and creates/populates the dataset in Opik.

Some available demo datasets include:

tiny_test: A micro dataset also known as tiny qa benchmark.
hotpot_300, hotpot_500: Subsets of the HotpotQA dataset for question answering.
halu_eval_300: A subset of the HaluEval dataset for hallucination detection.
gsm8k: A subset of the GSM8K dataset for grade-school math problems.
ai2_arc: A subset of the AI2 ARC dataset for science questions.
truthful_qa: A subset of the TruthfulQA dataset for evaluating truthfulness.
And several others (see sdks/opik_optimizer/src/opik_optimizer/datasets/__init__.py in the Opik repository for the full list and their structures).

These demo datasets provide ready-to-use examples with various structures and are used throughout our documentation examples.

Relationship to Manage Datasets:

The opik_optimizer.datasets module is for loading pre-defined example datasets provided with the SDK for demonstration and testing the optimizers.
The general Manage Datasets guide, on the other hand, explains how you can create, upload, and manage your own custom datasets using the Opik SDK (client.create_dataset(), dataset.insert(), etc.) or the Opik UI. These are the datasets you would typically curate with your own data for optimizing prompts for your specific applications.

Testing Methodology with Opik Agent Optimizer

While the optimizers primarily use a single dataset (passed to optimize_prompt) for both guiding the optimization (e.g., selecting few-shot examples or generating new prompts) and evaluating candidate prompts, a robust testing methodology often involves more.

Optimization Dataset: The dataset used during the optimize_prompt call. The optimizer will use this to score prompts and make decisions. Performance on this dataset shows how well the optimizer adapted to it.

Hold-out / Test Dataset (Recommended): After optimization, it’s crucial to evaluate the best prompt found by the optimizer on a separate dataset that was not seen during optimization. This gives a more realistic measure of how well the prompt generalizes. Note optimizers will not carry out an automatic “hold out” for you specifically.

1 from opik_optimizer import ChatPrompt
2 
3 # After running optimizer.optimize_prompt(dataset=optimization_dataset, ...)
4 best_prompt = results.prompt
5 
6 # Evaluate on a separate test_dataset
7 test_score = optimizer.evaluate_prompt(
8     dataset=my_holdout_test_dataset,
9     metric=metric,
10     prompt=ChatPrompt(messages=best_prompt)
11 )
12 print(f"Score on hold-out test set: {test_score}")

Metrics: Use a comprehensive set of metrics relevant to your task. While one metric might be used for optimization (metric), evaluating the final prompt with multiple metrics can provide a fuller picture.
Baseline Comparison: Always compare the performance of the optimized prompt against your original, unoptimized prompt (and potentially other baselines) on the same test dataset.

The opik_optimizer itself does not enforce a strict train/validation/test split within the optimize_prompt call in the same way some traditional ML model training frameworks do. You are responsible for managing your datasets appropriately for robust evaluation (e.g. by having a separate test set as described above).

How Data is Used in Optimization

Guidance & Candidate Generation:
- FewShotBayesianOptimizer: Selects examples from the dataset to form few-shot prompts.
- MetaPromptOptimizer / EvolutionaryOptimizer (with LLM operators): May use context from dataset items (e.g., sample inputs/outputs) when instructing an LLM to generate or modify prompt candidates.
- MiproOptimizer: Uses the dataset to bootstrap demonstrations for DSPy modules.
Evaluation:
- All optimizers evaluate candidate prompts by running them against (a subset of) the provided dataset.
- The variables in the ChatPrompt are populated with the values from the dataset items.
- The LLM generates an output.
- This output is scored using the configured metric.
- The resulting score guides the optimizer’s search process.

Common Questions

How many records should I use for the optimization dataset?

As mentioned in “Size Recommendations”: Aim for at least 50 examples for basic testing, and 100-500 for more robust and generalizable results. The ideal number depends on task complexity, data diversity, and available resources.

Does the optimizer use both input and output data from my dataset?

Yes. The input fields are used to generate the actual prompts that are sent to the LLM. The output fields (ground truth) are used by the evaluation metric (defined in metric) to score the LLM’s responses, and this score is what the optimizer tries to maximize.

Can I use a different dataset for final testing than the one used for optimization?

Yes, and this is highly recommended for a more objective assessment of the prompt’s generalization ability. First, run optimize_prompt with your main optimization dataset. Then, use the evaluate_prompt method of the optimizer (or a general evaluation utility) with the best prompt found and a separate, unseen test dataset.

Next Steps

Review the Manage Datasets guide for creating and managing your own custom datasets in Opik.
Check the FAQ for more questions and troubleshooting tips.
Explore the API Reference for technical details on ChatPrompt and optimizer methods.
See Example Projects & Cookbooks for runnable Colab notebooks demonstrating dataset usage in optimization workflows.