Optimize prompts | Opik Documentation

Use this playbook whenever you need to improve a prompt (single-turn or agentic) and want a repeatable process rather than manual tweaks.

1. Establish baselines

Record the current prompt and score using your production metric.
Log at least 10 representative dataset rows so the optimizer can generalize.
Capture latency and token costs—optimizations should not regress them unexpectedly.

2. Choose an optimizer

Scenario	Recommended optimizer
General prompt copy edits	MetaPrompt
Complex failure analysis	Hierarchical Reflective
Need diverse candidates	Evolutionary
Few-shot heavy prompts	Few-Shot Bayesian
Tune sampling params	Parameter optimizer

3. Configure the run

1 from opik_optimizer import HierarchicalReflectiveOptimizer
2 
3 optimizer = HierarchicalReflectiveOptimizer(
4     model="openai/gpt-4o",
5     max_parallel_batches=4,
6     seed=42,
7 )
8 result = optimizer.optimize_prompt(
9     prompt=my_prompt,
10     dataset=my_dataset,
11     metric=answer_quality,
12     max_trials=5,
13     n_samples=50,
14 )

Set project_name on the ChatPrompt to group runs by team or initiative.
Start with max_trials = 3–5. Increase once you confirm the metric is reliable.
Use n_samples to limit cost during early exploration; rerun on the full dataset before promoting a prompt.

4. Evaluate outcomes

Compare result.score vs. result.initial_score to ensure material improvement.
Review the history attribute for regression reasons.
Use Dashboard results to visualize per-trial performance.

5. Ship safely

Export the prompt

result.prompt returns the best-performing ChatPrompt. Serialize it as JSON and check it into your repo.

Automate regression tests

Wire the optimizer run into CI with a smaller dataset so future prompt edits have guardrails.

Monitor in production

Trace the new prompt with Opik tracing to confirm real-world performance matches experiment results.