Agent Optimization

Opik Agent Optimizer is a comprehensive toolkit designed to enhance the performance and efficiency of your Large Language Model (LLM) applications. Rather than manually editing prompts and running evaluation, you can use Opik Agent Optimizer to automatically optimize your prompts.

Opik Agent Optimizer Dashboard

Getting Started

Here’s a simple step-by-step guide to get you up and running with Opik Agent Optimizer:

1

1. Set up your account and API key

First, you’ll need an Opik account and API key:

$# Install Opik and the optimizer package
>pip install opik opik-optimizer
>
># Configure your API key
>opik configure

If you don’t have an account yet, sign up here.

2

2. Run the optimization code

The goal of the Opik Agent Optimizer is to allow you to optimize your existing agent, simply pass your prompt or agent to the optimizer:

1import opik
2from opik.evaluation.metrics import score_result
3import opik_optimizer
4
5# Define the prompt to optimize
6prompt = opik_optimizer.ChatPrompt(
7 messages=[
8 {"role": "system", "content": "You are a helpful assistant."},
9 {"role": "user", "content": "{question}"},
10 ],
11 model="gpt-4o"
12)
13
14# Define the metric for
15def short_response(dataset_item, llm_output):
16 if len(llm_output) < 100:
17 return score_result.ScoreResult(name="short_response", value=1, reason="Response is short as expected.")
18 else:
19 return score_result.ScoreResult(name="short_response", value=0, reason="Response is too long.")
20
21# Define the dataset to optimize on
22client = opik.Opik()
23dataset = client.get_or_create_dataset(name='prompt_optimization')
24dataset.insert([
25 {"question": "What is agent optimization?"},
26 {"question": "What are best practices for prompt optimization?"},
27])
28
29# Run the optimizer
30optimizer = opik_optimizer.HierarchicalReflectiveOptimizer()
31result = optimizer.optimize_prompt(
32 prompt=prompt,
33 dataset=dataset,
34 metric=short_response,
35 max_trials=1
36)
37
38result.display()
3

3. Analyze your results

After optimization completes, you can view results in multiple ways:

Call result.display() to see a summary in your terminal:

$result.display()
CLI optimization results

Optimization Algorithms

The optimizer implements both proprietary and open-source optimization algorithms. Each one has it’s strengths and weaknesses, we recommend first trying out either GEPA or the Hierarchical Reflective Optimizer as a first step:

AlgorithmDescription
MetaPrompt OptimizationUses an LLM (“reasoning model”) to critique and iteratively refine an initial instruction prompt. Good for general prompt wording, clarity, and structural improvements. Supports MCP tool calling optimization.
Hierarchical Reflective OptimizationUses hierarchical root cause analysis to systematically improve prompts by analyzing failures in batches, synthesizing findings, and addressing identified failure modes. Best for complex prompts requiring systematic refinement based on understanding why they fail.
Few-shot Bayesian OptimizationSpecifically for chat models, this optimizer uses Bayesian optimization (Optuna) to find the optimal number and combination of few-shot examples (demonstrations) to accompany a system prompt.
Evolutionary OptimizationEmploys genetic algorithms to evolve a population of prompts. Can discover novel prompt structures and supports multi-objective optimization (e.g., score vs. length). Can use LLMs for advanced mutation/crossover.
GEPA OptimizationWraps the external GEPA package to optimize a single system prompt for single-turn tasks using a reflection model. Requires pip install gepa.
Parameter OptimizationOptimizes LLM call parameters (temperature, top_p, etc.) using Bayesian optimization. Uses Optuna for efficient parameter search with global and local search phases. Best for tuning model behavior without changing the prompt.

If you would like us to implement another optimization algorithm, reach out to us on Github or feel free to contribute by extending optimizers.

Benchmark results

We are currently working on the benchmarking results, these are early preliminary results and are subject to change. You can learn more about our benchmarks here.

Each optimization algorithm is evaluated against different use-cases and datasets:

  1. Arc: The ai2_arc dataset contains a set of multiple choice science questions
  2. GSM8K: The gsm8k dataset contains a set of math problems
  3. medhallu: The medhallu dataset contains a set of medical questions
  4. RagBench: The ragbench dataset contains a set of retrieval (RAG) examples.

Our latest benchmarks shows the following results:

RankAlgorithm/OptimizerAverage ScoreArcGSM8KRagBench
1Hierarchical Reflective Optimization67.83%92.70%28.00%82.8%
2Few-shot Bayesian Optimization59.17%28.09%59.26%90.15%
3Evolutionary Optimization52.51%40.00%25.53%92.00%
4MetaPrompt Optimization38.75%25.00%26.93%64.31%
5GEPA Optimization32.27%6.55%26.08%64.17%
6No optimization11.85%1.69%24.06%9.81%

The results above are benchmarks tested against gpt-4o-mini, we are using various metrics depending on the dataset including Levenshtein Ratio, Answer Relevance and Hallucination. The results might change if you use a different model, configuration, dataset and starting prompt(s).

Next Steps

  1. Explore different optimization algorithms to choose the best one for your use case
  2. Understand prompt engineering best practices
  3. Set up your own evaluation datasets
  4. Review the API reference for detailed configuration options

🚀 Want to see Opik Agent Optimizer in action? Check out our Example Projects & Cookbooks for runnable Colab notebooks covering real-world optimization workflows, including HotPotQA and synthetic data generation.