Optimize prompts effectively with Opik

Use this playbook whenever you need to improve a prompt (single-turn or agentic) and want a repeatable process rather than manual tweaks.

1. Establish baselines

Record the current prompt and score using your production metric.
Log at least 10 representative dataset rows so the optimizer can generalize.
Capture latency and token costs—optimizations should not regress them unexpectedly.

2. Choose an optimizer

Scenario	Recommended optimizer
General prompt copy edits	MetaPrompt
Complex failure analysis	HRPO
Need diverse candidates	Evolutionary
Few-shot heavy prompts	Few-Shot Bayesian
Tune sampling params	Parameter optimizer

3. Configure the run

1 from opik_optimizer import HRPO
2 
3 optimizer = HRPO(
4     model="openai/gpt-4o",
5     max_parallel_batches=4,
6     seed=42,
7 )
8 result = optimizer.optimize_prompt(
9     prompt=my_prompt,
10     dataset=my_dataset,
11     metric=answer_quality,
12     max_trials=5,
13     n_samples=50,
14 )

Set project_name on the optimizer to group runs by team or initiative.
Start with max_trials = 3–5. Increase once you confirm the metric is reliable.
Use n_samples to limit cost during early exploration; rerun on the full dataset before promoting a prompt.
For optimizers with inner-loop evaluations (HRPO, GEPA), set n_samples_minibatch to keep those steps lightweight.
Use n_samples_strategy to keep subsampling deterministic (default: "random_sorted").

Optimize tools (MCP)

Tool optimization is now documented separately. Use it when you want to improve MCP tool descriptions without changing prompt text.

Optimize tools (MCP)

Target specific sections inside a prompt (advanced)

If you need finer control than roles (for example, only optimize a specific assistant message), use prompt_segments to extract and update parts by segment ID.

Intent/Trigger: use segment-level updates when you need to constrain changes to exact message segments.

Required parameters: prompt, dataset, metric
Optional parameters: segment update args (updates passed to prompt_segments.apply_segment_updates)
Minimal valid payload: optimizer.optimize_prompt(prompt=updated_prompt, dataset=my_dataset, metric=answer_quality)

1 from opik_optimizer.utils import prompt_segments
2 
3 segments = prompt_segments.extract_prompt_segments(my_prompt)
4 for segment in segments:
5     print(segment.segment_id, segment.role)
6 
7 # Update only message:1 (second message)
8 updates = {"message:1": "User question: {user_query}"}
9 updated_prompt = prompt_segments.apply_segment_updates(my_prompt, updates)
10 
11 # Use the updated prompt in optimization (the original prompt is unchanged)
12 result = optimizer.optimize_prompt(
13     prompt=updated_prompt,
14     dataset=my_dataset,
15     metric=answer_quality,
16 )

Optimize multiple prompts together

You can pass a dict of ChatPrompt objects to optimize a coordinated prompt bundle (for example, a multi-agent setup or system/user prompt pair that must stay in sync). Each key names a prompt and is preserved through optimization.

1 from opik_optimizer import MetaPromptOptimizer, ChatPrompt
2 
3 prompts = {
4     "researcher": ChatPrompt(
5         name="researcher",
6         messages=[
7             {"role": "system", "content": "Gather facts and cite sources."},
8             {"role": "user", "content": "{question}"},
9         ],
10     ),
11     "synthesizer": ChatPrompt(
12         name="synthesizer",
13         messages=[
14             {"role": "system", "content": "Summarize findings clearly."},
15             {"role": "user", "content": "{question}"},
16         ],
17     ),
18 }
19 
20 optimizer = MetaPromptOptimizer(model="openai/gpt-4o-mini", prompts_per_round=2)
21 result = optimizer.optimize_prompt(
22     prompt=prompts,
23     dataset=my_dataset,
24     metric=answer_quality,
25     max_trials=3,
26 )

result.prompt returns a dict keyed by the same names so you can update each agent prompt together.

4. Evaluate outcomes

Compare result.score vs. result.initial_score to ensure material improvement.
Review the history attribute for regression reasons.
Use Dashboard results to visualize per-trial performance.

5. Ship safely

Export the prompt

result.prompt returns the best-performing ChatPrompt. Serialize it as JSON and check it into your repo.

Automate regression tests

Wire the optimizer run into CI with a smaller dataset so future prompt edits have guardrails.

Monitor in production

Trace the new prompt with Opik tracing to confirm real-world performance matches experiment results.

Optimization Studio
Define datasets
Define metrics
Chaining optimizers
Avoiding overfitting – Prevent your prompt from memorizing the training data by using separate validation datasets