Sampling controls

Balance dataset subsampling and multi-completion evaluation

When optimizing prompts, there are two independent sampling layers you can control:

  • Dataset subsampling: choose which dataset rows are evaluated (n_samples, n_samples_minibatch, n_samples_strategy).
  • Model sampling: request multiple completions per row (n in model_parameters).

Use both to balance cost, stability, and exploration.

Available in Opik Optimizer v3.0.0+.

Dataset subsampling (n_samples)

n_samples limits how many dataset rows are evaluated per trial. It applies to the evaluation dataset (the validation_dataset if provided, otherwise dataset).

1result = optimizer.optimize_prompt(
2 prompt=prompt,
3 dataset=dataset,
4 metric=metric,
5 n_samples=50,
6)

Notes:

  • n_samples accepts an integer, a fractional float, a percent string (e.g. "10%"), or the special values "all", "full", or None.
  • If n_samples is larger than the evaluation dataset size, the optimizer falls back to the full dataset and logs a warning.

Deterministic subsampling (n_samples_strategy)

n_samples_strategy controls how dataset rows are selected when n_samples is set. The default strategy is "random_sorted", which:

  1. Sorts dataset item IDs.
  2. Shuffles them deterministically using the optimizer seed and evaluation phase.
  3. Takes the first n_samples IDs.

If your dataset items do not include IDs, the optimizer falls back to the dataset order.

1result = optimizer.optimize_prompt(
2 prompt=prompt,
3 dataset=dataset,
4 metric=metric,
5 n_samples=50,
6 n_samples_strategy="random_sorted",
7)

Only "random_sorted" is supported today. Passing another strategy will raise a ValueError.

Minibatch sampling (n_samples_minibatch)

Some optimizers run inner-loop evaluations (for example, HRPO and GEPA). Use n_samples_minibatch to cap those inner evaluations without reducing the outer evaluation size.

1result = optimizer.optimize_prompt(
2 prompt=prompt,
3 dataset=dataset,
4 metric=metric,
5 n_samples=200,
6 n_samples_minibatch=25,
7)

If n_samples_minibatch is not set, it defaults to n_samples.

Explicit item selection (dataset_item_ids)

For fully deterministic evaluations, you can pass an explicit list of dataset item IDs to evaluate_prompt. This bypasses the sampling strategy and is mutually exclusive with n_samples.

1score = optimizer.evaluate_prompt(
2 prompt=prompt,
3 dataset=dataset,
4 metric=metric,
5 dataset_item_ids=["item-1", "item-2", "item-3"],
6)

When debug logging is enabled, evaluation logs include the sampling mode and resolved dataset size.

Multiple completions per example (n parameter)

Single-sample evaluation can be noisy. The n parameter lets you generate multiple candidate outputs per example and select the best one, introducing variety and reducing evaluation variance.

How It Works

When you set n > 1 in your prompt’s model_parameters, the optimizer:

  1. Requests N completions from the LLM in a single API call (pass@N)
  2. Scores each candidate output using your metric
  3. Selects the best candidate (best_by_metric policy)
  4. Logs all scores and selection info to the Opik trace

In optimizers that already generate multiple prompt variants per round, n is applied per evaluation, so total candidate evaluations scale by prompts_per_round * n.

For tasks that execute generated code (like ARC-AGI or tool-driven agents), this means each prompt produces multiple candidate programs that are executed and scored, and the best candidate is used for optimization feedback.

Configuration

Set the n parameter in your ChatPrompt.model_parameters:

1from opik_optimizer import ChatPrompt
2
3# Generate 3 candidates per evaluation, select best
4prompt = ChatPrompt(
5 model="gpt-4o-mini",
6 messages=[
7 {"role": "system", "content": "You are a helpful assistant."},
8 {"role": "user", "content": "Answer: {question}"},
9 ],
10 model_parameters={
11 "n": 3, # Generate 3 completions per call
12 "temperature": 0.7, # Higher temp = more variety between candidates
13 },
14)

Higher temperature values increase diversity between the N candidates. Consider using temperature: 0.7-1.0 with n > 1 to maximize variety.

The low-level call_model and call_model_async helpers return a single response unless you pass return_all=True. Optimizers handle n internally, so you only need return_all when calling those helpers directly.

Use Cases

Single-sample evaluation is noisy. With n=3, the optimizer scores each candidate and uses the best result, which makes optimization more robust to stochastic failures.

1# Before: Single sample - noisy evaluation
2prompt = ChatPrompt(model="gpt-4o-mini", messages=[...])
3# Score might be 0.6 or 0.9 depending on luck
4
5# After: Best-of-3 - more stable evaluation
6prompt = ChatPrompt(
7 model="gpt-4o-mini",
8 messages=[...],
9 model_parameters={"n": 3, "temperature": 0.8},
10)
11# Score reflects best achievable output

Inspired by code generation benchmarks (pass@k), this approach measures whether a prompt can produce correct output, not just whether it usually does.

1# Optimize for "can this prompt ever get it right?"
2prompt = ChatPrompt(
3 model="gpt-4o-mini",
4 messages=[...],
5 model_parameters={"n": 5}, # pass@5 style
6)

This is useful when:

  • Correctness matters more than consistency
  • You’ll use majority voting or best-of-k at inference time
  • Tasks have high variance (creative writing, complex reasoning)

Some tasks naturally have multiple valid answers. Using n > 1 helps the optimizer find prompts that can generate any valid answer.

1# Creative task: multiple valid outputs
2prompt = ChatPrompt(
3 model="gpt-4o-mini",
4 messages=[
5 {"role": "user", "content": "Write a haiku about {topic}"},
6 ],
7 model_parameters={"n": 3, "temperature": 1.0},
8)

Selection Policy

Currently, the optimizer supports these selection policies:

  • best_by_metric (default): score each candidate with the metric and pick the best.
  • first: pick the first candidate (fast, deterministic, but ignores scoring).
  • concat: join all candidates into one output string.
  • random: pick a random candidate (seeded if provided).
  • max_logprob: pick the candidate with the highest average token logprob (provider support required; logprobs must be enabled in model kwargs).

Use the selection_policy key in model_parameters to override. The optimizer routes these policies through a shared candidate-selection utility so behavior is consistent across optimizers:

1prompt = ChatPrompt(
2 model="gpt-4o-mini",
3 messages=[...],
4 model_parameters={
5 "n": 3,
6 "selection_policy": "first",
7 },
8)

For max_logprob, enable logprobs in your model kwargs (provider support varies):

1prompt = ChatPrompt(
2 model="gpt-4o-mini",
3 messages=[...],
4 model_parameters={
5 "n": 3,
6 "selection_policy": "max_logprob",
7 "logprobs": True,
8 "top_logprobs": 1,
9 },
10)

When selection_policy=best_by_metric, the optimizer:

  1. Each candidate is scored independently using your metric function
  2. The candidate with the highest score is selected as the final output
  3. All scores and the chosen index are logged to the trace metadata
1# What happens internally:
2candidates = ["output_1", "output_2", "output_3"]
3scores = [metric(item, c) for c in candidates] # [0.7, 0.9, 0.6]
4best_idx = argmax(scores) # 1
5final_output = candidates[best_idx] # "output_2"

The trace metadata includes:

  • n_requested: Number of completions requested
  • candidates_scored: Number of candidates evaluated
  • candidate_scores: List of all scores (best_by_metric only)
  • candidate_logprobs: List of logprob scores (max_logprob only)
  • chosen_index: Index of the selected candidate

Cost Considerations

Using n > 1 increases API costs proportionally. With n=3, you pay roughly 3x the completion tokens per evaluation call.

n valueRelative costVariance reduction
11xBaseline
3~3xSignificant