Sampling controls
Balance dataset subsampling and multi-completion evaluation
Balance dataset subsampling and multi-completion evaluation
When optimizing prompts, there are two independent sampling layers you can control:
n_samples, n_samples_minibatch, n_samples_strategy).n in model_parameters).Use both to balance cost, stability, and exploration.
Available in Opik Optimizer v3.0.0+.
n_samples limits how many dataset rows are evaluated per trial. It applies to the evaluation dataset (the
validation_dataset if provided, otherwise dataset).
Notes:
n_samples accepts an integer, a fractional float, a percent string (e.g. "10%"), or the special values "all", "full", or None.n_samples is larger than the evaluation dataset size, the optimizer falls back to the full dataset and logs a warning.n_samples_strategy controls how dataset rows are selected when n_samples is set. The default strategy is
"random_sorted", which:
n_samples IDs.If your dataset items do not include IDs, the optimizer falls back to the dataset order.
Only "random_sorted" is supported today. Passing another strategy will raise a ValueError.
Some optimizers run inner-loop evaluations (for example, HRPO and GEPA). Use n_samples_minibatch to cap
those inner evaluations without reducing the outer evaluation size.
If n_samples_minibatch is not set, it defaults to n_samples.
For fully deterministic evaluations, you can pass an explicit list of dataset item IDs to evaluate_prompt.
This bypasses the sampling strategy and is mutually exclusive with n_samples.
When debug logging is enabled, evaluation logs include the sampling mode and resolved dataset size.
Single-sample evaluation can be noisy. The n parameter lets you generate multiple candidate outputs per
example and select the best one, introducing variety and reducing evaluation variance.
When you set n > 1 in your prompt’s model_parameters, the optimizer:
best_by_metric policy)In optimizers that already generate multiple prompt variants per round, n is
applied per evaluation, so total candidate evaluations scale by prompts_per_round * n.
For tasks that execute generated code (like ARC-AGI or tool-driven agents), this means each prompt produces multiple candidate programs that are executed and scored, and the best candidate is used for optimization feedback.
Set the n parameter in your ChatPrompt.model_parameters:
Higher temperature values increase diversity between the N candidates. Consider using temperature: 0.7-1.0 with n > 1 to maximize variety.
The low-level call_model and call_model_async helpers return a single
response unless you pass return_all=True. Optimizers handle n internally,
so you only need return_all when calling those helpers directly.
Single-sample evaluation is noisy. With n=3, the optimizer scores each candidate and uses the best result, which makes optimization more robust to stochastic failures.
Inspired by code generation benchmarks (pass@k), this approach measures whether a prompt can produce correct output, not just whether it usually does.
This is useful when:
Some tasks naturally have multiple valid answers. Using n > 1 helps the optimizer find prompts that can generate any valid answer.
Currently, the optimizer supports these selection policies:
best_by_metric (default): score each candidate with the metric and pick the best.first: pick the first candidate (fast, deterministic, but ignores scoring).concat: join all candidates into one output string.random: pick a random candidate (seeded if provided).max_logprob: pick the candidate with the highest average token logprob (provider support required; logprobs must be enabled in model kwargs).Use the selection_policy key in model_parameters to override. The optimizer
routes these policies through a shared candidate-selection utility so behavior
is consistent across optimizers:
For max_logprob, enable logprobs in your model kwargs (provider support varies):
When selection_policy=best_by_metric, the optimizer:
The trace metadata includes:
n_requested: Number of completions requestedcandidates_scored: Number of candidates evaluatedcandidate_scores: List of all scores (best_by_metric only)candidate_logprobs: List of logprob scores (max_logprob only)chosen_index: Index of the selected candidateUsing n > 1 increases API costs proportionally. With n=3, you pay roughly 3x the completion tokens per evaluation call.