Multiple Completions (n parameter)

When optimizing prompts, single-sample evaluation can be noisy - a good prompt might fail on a particular trial due to LLM stochasticity. The n parameter lets you generate multiple candidate outputs per evaluation and select the best one, introducing variety and reducing evaluation variance.

Available in Opik Optimizer v3.0.0+.

How It Works

When you set n > 1 in your prompt’s model_parameters, the optimizer requests N completions per evaluation, scores each candidate, selects the best one, and logs all scores to the trace. The full explanation of how the n parameter works is maintained in Sampling controls.

Configuration

Set the n parameter in your ChatPrompt.model_parameters:

1 from opik_optimizer import ChatPrompt
2 
3 # Generate 3 candidates per evaluation, select best
4 prompt = ChatPrompt(
5     model="gpt-4o-mini",
6     messages=[
7         {"role": "system", "content": "You are a helpful assistant."},
8         {"role": "user", "content": "Answer: {question}"},
9     ],
10     model_parameters={
11         "n": 3,  # Generate 3 completions per call
12         "temperature": 0.7,  # Higher temp = more variety between candidates
13     },
14 )

Higher temperature values increase diversity between the N candidates. Consider using temperature: 0.7-1.0 with n > 1 to maximize variety.

The low-level call_model and call_model_async helpers return a single response unless you pass return_all=True. Optimizers handle n internally, so you only need return_all when calling those helpers directly.

Use Cases

Reducing Evaluation Variance

Single-sample evaluation is noisy. With n=3, the optimizer scores each candidate and uses the best result, which makes optimization more robust to stochastic failures.

1 # Before: Single sample - noisy evaluation
2 prompt = ChatPrompt(model="gpt-4o-mini", messages=[...])
3 # Score might be 0.6 or 0.9 depending on luck
4 
5 # After: Best-of-3 - more stable evaluation
6 prompt = ChatPrompt(
7     model="gpt-4o-mini",
8     messages=[...],
9     model_parameters={"n": 3, "temperature": 0.8},
10 )
11 # Score reflects best achievable output

Pass@k Style Optimization

Inspired by code generation benchmarks (pass@k), this approach measures whether a prompt can produce correct output, not just whether it usually does.

1 # Optimize for "can this prompt ever get it right?"
2 prompt = ChatPrompt(
3     model="gpt-4o-mini",
4     messages=[...],
5     model_parameters={"n": 5},  # pass@5 style
6 )

This is useful when:

Correctness matters more than consistency
You’ll use majority voting or best-of-k at inference time
Tasks have high variance (creative writing, complex reasoning)

Handling Stochastic Tasks

Some tasks naturally have multiple valid answers. Using n > 1 helps the optimizer find prompts that can generate any valid answer.

1 # Creative task: multiple valid outputs
2 prompt = ChatPrompt(
3     model="gpt-4o-mini",
4     messages=[
5         {"role": "user", "content": "Write a haiku about {topic}"},
6     ],
7     model_parameters={"n": 3, "temperature": 1.0},
8 )

Selection Policy

Currently, the optimizer supports these selection policies:

best_by_metric (default): score each candidate with the metric and pick the best.
first: pick the first candidate (fast, deterministic, but ignores scoring).
concat: join all candidates into one output string.
random: pick a random candidate (seeded if provided).
max_logprob: pick the candidate with the highest average token logprob (provider support required; logprobs must be enabled in model kwargs).

Use the selection_policy key in model_parameters to override. The optimizer routes these policies through a shared candidate-selection utility so behavior is consistent across optimizers:

1 prompt = ChatPrompt(
2     model="gpt-4o-mini",
3     messages=[...],
4     model_parameters={
5         "n": 3,
6         "selection_policy": "first",
7     },
8 )

For max_logprob, enable logprobs in your model kwargs (provider support varies):

1 prompt = ChatPrompt(
2     model="gpt-4o-mini",
3     messages=[...],
4     model_parameters={
5         "n": 3,
6         "selection_policy": "max_logprob",
7         "logprobs": True,
8         "top_logprobs": 1,
9     },
10 )

When selection_policy=best_by_metric, the optimizer:

Each candidate is scored independently using your metric function
The candidate with the highest score is selected as the final output
All scores and the chosen index are logged to the trace metadata

1 # What happens internally:
2 candidates = ["output_1", "output_2", "output_3"]
3 scores = [metric(item, c) for c in candidates]  # [0.7, 0.9, 0.6]
4 best_idx = argmax(scores)  # 1
5 final_output = candidates[best_idx]  # "output_2"

The trace metadata includes:

n_requested: Number of completions requested
candidates_scored: Number of candidates evaluated
candidate_scores: List of all scores (best_by_metric only)
candidate_logprobs: List of logprob scores (max_logprob only)
chosen_index: Index of the selected candidate

Cost Considerations

Using n > 1 increases API costs proportionally. With n=3, you pay roughly 3x the completion tokens per evaluation call.

n value	Relative cost	Variance reduction
1	1x	Baseline
3	~3x	Significant
5	~5x	High
10	~10x	Very high

Recommendations:

Start with n=3 for most use cases
Use n=5-10 only for high-variance tasks
Consider the total optimization budget when choosing N

Limitations

Tool-calling forces n=1

When allow_tool_use=True and tools are defined, the optimizer forces n=1. This is because tool-calling requires maintaining a coherent message thread, which isn’t compatible with multiple independent completions.

1 # Tool-calling prompt - n will be forced to 1
2 prompt = ChatPrompt(
3     model="gpt-4o-mini",
4     messages=[...],
5     tools=[...],
6     model_parameters={"n": 3},  # Ignored when tools are used
7 )

Some optimizers ignore n

Prompt synthesis steps that expect a single structured response (such as few-shot and parameter optimizers) ignore n to avoid returning multiple conflicting templates.

Not all providers support n

Some LLM providers don’t support the n parameter. Check your provider’s documentation. LiteLLM will drop unsupported parameters automatically.

Full Example

1 from opik_optimizer import ChatPrompt, MetaPromptOptimizer
2 from opik.evaluation.metrics import LevenshteinRatio
3 
4 # Create prompt with n=3 for variety
5 prompt = ChatPrompt(
6     model="gpt-4o-mini",
7     messages=[
8         {"role": "system", "content": "Extract the key entities from the text."},
9         {"role": "user", "content": "{text}"},
10     ],
11     model_parameters={
12         "n": 3,  # Generate 3 candidates
13         "temperature": 0.7,  # Moderate variety
14     },
15 )
16 
17 # Define metric
18 def extraction_accuracy(dataset_item, llm_output):
19     expected = dataset_item["expected_entities"]
20     return LevenshteinRatio().score(expected, llm_output)
21 
22 # Optimize - each trial evaluates 3 candidates, picks best
23 optimizer = MetaPromptOptimizer(model="gpt-4o")
24 result = optimizer.optimize_prompt(
25     prompt=prompt,
26     dataset=my_dataset,
27     metric=extraction_accuracy,
28 )
29 
30 print(f"Best prompt score: {result.score}")

Optimize prompts - Core optimization guide
Define metrics - Create custom metrics
Custom metrics - Advanced metric patterns
API Reference - Full parameter documentation