Opik Agent Optimization Overview

Opik Agent Optimizer, including the optimizers described are currently in Public Beta. We are actively working on improving these features and welcome your feedback on GitHub!

Opik Agent Optimizer is a comprehensive toolkit designed to enhance the performance and efficiency of your Large Language Model (LLM) applications. Rather than manually editing prompts and running evaluation, you can use Opik Agent Optimizer to automatically optimize your prompts.

Why optimize prompts ?

Prompt engineering is a skill that can be difficult to master as highlighted by the Anthropic, OpenAI and Google prompt engineering guides. There are a lot of techniques that can be used to help LLMs generate the desired output.

Prompt optimization solves many of the issues that come with prompt engineering:

Prompt engineering is not easily repeatable or scalable alone
Possible performance degration across models, you need to tune prompts for each model
Optimization may unlock performance, cost and reliability improvements
As systems evolve, manually tuning multiple prompts becomes increasingly difficult

So when should you use prompt optimization?

Aspect	Prompt Engineering	Prompt Optimization
Scope	Broad, includes designing, experimenting, refining	Narrow, improving already existing prompts
Goal	Create a working prompt for a specific task	Maximize performance (accuracy, efficiency, etc.)
Involves	Initial drafting, understanding model behavior	Tweaking wording, structure, or context
When used	Early in task setup or experimentation	After a baseline prompt is in place

Optimizing a prompt

You can optimize any prompt in just a few lines of code

1 import opik_optimizer
2 from opik.evaluation.metrics import LevenshteinRatio
3 
4 # Load a demo dataset
5 dataset = opik_optimizer.datasets.tiny_test()
6 
7 # Define a metric
8 def levenshtein_ratio(dataset_item, llm_output):
9     metric = LevenshteinRatio()
10     return metric.score(reference=dataset_item['label'], output=llm_output)
11 
12 # Define the prompt to optimize
13 prompt = opik_optimizer.ChatPrompt(
14   messages=[
15     {"role": "system", "content": "You are a helpful assistant."},
16     {"role": "user", "content": "{text}"},
17   ]
18 )
19 
20 # Run the optimization
21 optimizer = opik_optimizer.MetaPromptOptimizer(model="openai/gpt-4")
22 opt_result = optimizer.optimize_prompt(
23   prompt=prompt,
24   dataset=dataset,
25   metric=levenshtein_ratio
26 )
27 
28 print(opt_result.prompt)

You can learn more about running your first optimization in the Quickstart guide.

Optimization Algorithms

Supported algorithms

The Opik Agent Optimizer is an experimental Python library that aims at implementing Prompt and Agent Optimization algorithms in a consistent format.

All optimizers leverage LiteLLM for broad model compatibility. This means you can use models from OpenAI, Azure, Anthropic, Google, local Ollama instances, and many more. For details on how to specify different models, see the LiteLLM Support for Optimizers guide.

The following algorithms have been implemented:

Algorithm	Description
MetaPrompt Optimization	Uses an LLM (“reasoning model”) to critique and iteratively refine an initial instruction prompt. Good for general prompt wording, clarity, and structural improvements.
Few-shot Bayesian Optimization	Specifically for chat models, this optimizer uses Bayesian optimization (Optuna) to find the optimal number and combination of few-shot examples (demonstrations) to accompany a system prompt.
MIPRO Optimization	A prompt engineering algorithm that uses a MIPRO (Multi-Instance Prompt Refinement) approach to generate a set of candidate prompts and then uses a Bayesian optimization algorithm to identify the best prompt.
Evolutionary Optimization	Employs genetic algorithms to evolve a population of prompts. Can discover novel prompt structures and supports multi-objective optimization (e.g., score vs. length). Can use LLMs for advanced mutation/crossover.

If you would like us to implement another optimization algorithm, reach out to us on Github or feel free to contribute by extending optimizers.

Benchmark results

We are currently working on the benchmarking results, these are early preliminary results and are subject to change. You can learn more about our benchmarks here.

Each optimization algorithm is evaluated against different use-cases and datasets:

Arc: The ai2_arc dataset contains a set of multiple choice science questions
GSM8K: The gsm8k dataset contains a set of math problems
medhallu: The medhallu dataset contains a set of medical questions
RagBench: The ragbench dataset contains a set of retrieval (RAG) examples.

Rank	Algorithm/Optimizer	Average Score	Arc	GSM8K	medhallu	RagBench
1	Few-shot Bayesian Optimization	67.33%	28.09%	59.26%	91.80%	90.15%
2	Evolutionary Optimization	63.13%	40.00%	25.53%	95.00%	92.00%
3	Mipro Optimization (w/ no tools)	55.91%	19.70%	39.70%	92.70%	89.28%
4	MetaPrompt Optimization	52.01%	25.00%	26.93%	91.79%	64.31%
5	No optimization	11.99%	1.69%	24.06%	12.38%	9.81%

The results above are benchmarks tested against gpt-4o-mini, we are using various metrics depending on the dataset including Levenshtein Ratio, Answer Relevance and Hallucination. The results might change if you use a different model, configuration, dataset and starting prompt(s).