Optimize prompts

Use this playbook whenever you need to improve a prompt (single-turn or agentic) and want a repeatable process rather than manual tweaks.

1. Establish baselines

  • Record the current prompt and score using your production metric.
  • Log at least 10 representative dataset rows so the optimizer can generalize.
  • Capture latency and token costs—optimizations should not regress them unexpectedly.

2. Choose an optimizer

ScenarioRecommended optimizer
General prompt copy editsMetaPrompt
Complex failure analysisHierarchical Reflective
Need diverse candidatesEvolutionary
Few-shot heavy promptsFew-Shot Bayesian
Tune sampling paramsParameter optimizer

3. Configure the run

1from opik_optimizer import HierarchicalReflectiveOptimizer
2
3optimizer = HierarchicalReflectiveOptimizer(
4 model="openai/gpt-4o",
5 max_parallel_batches=4,
6 seed=42,
7)
8result = optimizer.optimize_prompt(
9 prompt=my_prompt,
10 dataset=my_dataset,
11 metric=answer_quality,
12 max_trials=5,
13 n_samples=50,
14)
  • Set project_name on the ChatPrompt to group runs by team or initiative.
  • Start with max_trials = 3–5. Increase once you confirm the metric is reliable.
  • Use n_samples to limit cost during early exploration; rerun on the full dataset before promoting a prompt.

4. Evaluate outcomes

  • Compare result.score vs. result.initial_score to ensure material improvement.
  • Review the history attribute for regression reasons.
  • Use Dashboard results to visualize per-trial performance.

5. Ship safely

1

Export the prompt

result.prompt returns the best-performing ChatPrompt. Serialize it as JSON and check it into your repo.

2

Automate regression tests

Wire the optimizer run into CI with a smaller dataset so future prompt edits have guardrails.

3

Monitor in production

Trace the new prompt with Opik tracing to confirm real-world performance matches experiment results.