Multiple Completions (n parameter)

Introduce variety at each trial with pass@k evaluation

When optimizing prompts, single-sample evaluation can be noisy - a good prompt might fail on a particular trial due to LLM stochasticity. The n parameter lets you generate multiple candidate outputs per evaluation and select the best one, introducing variety and reducing evaluation variance.

Available in Opik Optimizer v3.0.0+.

How It Works

When you set n > 1 in your prompt’s model_parameters, the optimizer requests N completions per evaluation, scores each candidate, selects the best one, and logs all scores to the trace. The full explanation of how the n parameter works is maintained in Sampling controls.

Configuration

Set the n parameter in your ChatPrompt.model_parameters:

1from opik_optimizer import ChatPrompt
2
3# Generate 3 candidates per evaluation, select best
4prompt = ChatPrompt(
5 model="gpt-4o-mini",
6 messages=[
7 {"role": "system", "content": "You are a helpful assistant."},
8 {"role": "user", "content": "Answer: {question}"},
9 ],
10 model_parameters={
11 "n": 3, # Generate 3 completions per call
12 "temperature": 0.7, # Higher temp = more variety between candidates
13 },
14)

Higher temperature values increase diversity between the N candidates. Consider using temperature: 0.7-1.0 with n > 1 to maximize variety.

The low-level call_model and call_model_async helpers return a single response unless you pass return_all=True. Optimizers handle n internally, so you only need return_all when calling those helpers directly.

Use Cases

Single-sample evaluation is noisy. With n=3, the optimizer scores each candidate and uses the best result, which makes optimization more robust to stochastic failures.

1# Before: Single sample - noisy evaluation
2prompt = ChatPrompt(model="gpt-4o-mini", messages=[...])
3# Score might be 0.6 or 0.9 depending on luck
4
5# After: Best-of-3 - more stable evaluation
6prompt = ChatPrompt(
7 model="gpt-4o-mini",
8 messages=[...],
9 model_parameters={"n": 3, "temperature": 0.8},
10)
11# Score reflects best achievable output

Inspired by code generation benchmarks (pass@k), this approach measures whether a prompt can produce correct output, not just whether it usually does.

1# Optimize for "can this prompt ever get it right?"
2prompt = ChatPrompt(
3 model="gpt-4o-mini",
4 messages=[...],
5 model_parameters={"n": 5}, # pass@5 style
6)

This is useful when:

  • Correctness matters more than consistency
  • You’ll use majority voting or best-of-k at inference time
  • Tasks have high variance (creative writing, complex reasoning)

Some tasks naturally have multiple valid answers. Using n > 1 helps the optimizer find prompts that can generate any valid answer.

1# Creative task: multiple valid outputs
2prompt = ChatPrompt(
3 model="gpt-4o-mini",
4 messages=[
5 {"role": "user", "content": "Write a haiku about {topic}"},
6 ],
7 model_parameters={"n": 3, "temperature": 1.0},
8)

Selection Policy

Currently, the optimizer supports these selection policies:

  • best_by_metric (default): score each candidate with the metric and pick the best.
  • first: pick the first candidate (fast, deterministic, but ignores scoring).
  • concat: join all candidates into one output string.
  • random: pick a random candidate (seeded if provided).
  • max_logprob: pick the candidate with the highest average token logprob (provider support required; logprobs must be enabled in model kwargs).

Use the selection_policy key in model_parameters to override. The optimizer routes these policies through a shared candidate-selection utility so behavior is consistent across optimizers:

1prompt = ChatPrompt(
2 model="gpt-4o-mini",
3 messages=[...],
4 model_parameters={
5 "n": 3,
6 "selection_policy": "first",
7 },
8)

For max_logprob, enable logprobs in your model kwargs (provider support varies):

1prompt = ChatPrompt(
2 model="gpt-4o-mini",
3 messages=[...],
4 model_parameters={
5 "n": 3,
6 "selection_policy": "max_logprob",
7 "logprobs": True,
8 "top_logprobs": 1,
9 },
10)

When selection_policy=best_by_metric, the optimizer:

  1. Each candidate is scored independently using your metric function
  2. The candidate with the highest score is selected as the final output
  3. All scores and the chosen index are logged to the trace metadata
1# What happens internally:
2candidates = ["output_1", "output_2", "output_3"]
3scores = [metric(item, c) for c in candidates] # [0.7, 0.9, 0.6]
4best_idx = argmax(scores) # 1
5final_output = candidates[best_idx] # "output_2"

The trace metadata includes:

  • n_requested: Number of completions requested
  • candidates_scored: Number of candidates evaluated
  • candidate_scores: List of all scores (best_by_metric only)
  • candidate_logprobs: List of logprob scores (max_logprob only)
  • chosen_index: Index of the selected candidate

Cost Considerations

Using n > 1 increases API costs proportionally. With n=3, you pay roughly 3x the completion tokens per evaluation call.

n valueRelative costVariance reduction
11xBaseline
3~3xSignificant
5~5xHigh
10~10xVery high

Recommendations:

  • Start with n=3 for most use cases
  • Use n=5-10 only for high-variance tasks
  • Consider the total optimization budget when choosing N

Limitations

When allow_tool_use=True and tools are defined, the optimizer forces n=1. This is because tool-calling requires maintaining a coherent message thread, which isn’t compatible with multiple independent completions.

1# Tool-calling prompt - n will be forced to 1
2prompt = ChatPrompt(
3 model="gpt-4o-mini",
4 messages=[...],
5 tools=[...],
6 model_parameters={"n": 3}, # Ignored when tools are used
7)

Prompt synthesis steps that expect a single structured response (such as few-shot and parameter optimizers) ignore n to avoid returning multiple conflicting templates.

Some LLM providers don’t support the n parameter. Check your provider’s documentation. LiteLLM will drop unsupported parameters automatically.

Full Example

1from opik_optimizer import ChatPrompt, MetaPromptOptimizer
2from opik.evaluation.metrics import LevenshteinRatio
3
4# Create prompt with n=3 for variety
5prompt = ChatPrompt(
6 model="gpt-4o-mini",
7 messages=[
8 {"role": "system", "content": "Extract the key entities from the text."},
9 {"role": "user", "content": "{text}"},
10 ],
11 model_parameters={
12 "n": 3, # Generate 3 candidates
13 "temperature": 0.7, # Moderate variety
14 },
15)
16
17# Define metric
18def extraction_accuracy(dataset_item, llm_output):
19 expected = dataset_item["expected_entities"]
20 return LevenshteinRatio().score(expected, llm_output)
21
22# Optimize - each trial evaluates 3 candidates, picks best
23optimizer = MetaPromptOptimizer(model="gpt-4o")
24result = optimizer.optimize_prompt(
25 prompt=prompt,
26 dataset=my_dataset,
27 metric=extraction_accuracy,
28)
29
30print(f"Best prompt score: {result.score}")