Opik Agent Optimization Overview

Automate enhancing LLM prompts and agent performance.

Opik Agent Optimizer, including the optimizers described are currently in Public Beta. We are actively working on improving these features and welcome your feedback on GitHub!

Introducing Opik Agent Optimizer

Opik Agent Optimizer is a comprehensive toolkit designed to enhance the performance and efficiency of your Large Language Model (LLM) applications. Rather than manually editing prompts and running evaluation, you can use Opik Agent Optimizer to automatically optimize your prompts.

Opik Agent Optimizer Dashboard

Why optimize prompts ?

Prompt engineering is a skill that can be difficult to master as highlighted by the Anthropic, OpenAI and Google prompt engineering guides. There are a lot of techniques that can be used to help LLMs generate the desired output.

Prompt optimization solves many of the issues that come with prompt engineering:

  1. Prompt engineering is not easily repeatable or scalable alone
  2. Possible performance degration across models, you need to tune prompts for each model
  3. Optimization may unlock performance, cost and reliability improvements
  4. As systems evolve, manually tuning multiple prompts becomes increasingly difficult

So when should you use prompt optimization?

AspectPrompt EngineeringPrompt Optimization
ScopeBroad, includes designing, experimenting, refiningNarrow, improving already existing prompts
GoalCreate a working prompt for a specific taskMaximize performance (accuracy, efficiency, etc.)
InvolvesInitial drafting, understanding model behaviorTweaking wording, structure, or context
When usedEarly in task setup or experimentationAfter a baseline prompt is in place

Optimizing a prompt

You can optimize any prompt in just a few lines of code

1import opik_optimizer
2from opik.evaluation.metrics import LevenshteinRatio
3
4# Load a demo dataset
5dataset = opik_optimizer.datasets.tiny_test()
6
7# Define a metric
8def levenshtein_ratio(dataset_item, llm_output):
9 metric = LevenshteinRatio()
10 return metric.score(reference=dataset_item['label'], output=llm_output)
11
12# Define the prompt to optimize
13prompt = opik_optimizer.ChatPrompt(
14 messages=[
15 {"role": "system", "content": "You are a helpful assistant."},
16 {"role": "user", "content": "{text}"},
17 ]
18)
19
20# Run the optimization
21optimizer = opik_optimizer.MetaPromptOptimizer(model="openai/gpt-4")
22opt_result = optimizer.optimize_prompt(
23 prompt=prompt,
24 dataset=dataset,
25 metric=levenshtein_ratio
26)
27
28print(opt_result.prompt)

You can learn more about running your first optimization in the Quickstart guide.

Optimization Algorithms

Supported algorithms

The Opik Agent Optimizer is an experimental Python library that aims at implementing Prompt and Agent Optimization algorithms in a consistent format.

All optimizers leverage LiteLLM for broad model compatibility. This means you can use models from OpenAI, Azure, Anthropic, Google, local Ollama instances, and many more. For details on how to specify different models, see the LiteLLM Support for Optimizers guide.

The following algorithms have been implemented:

AlgorithmDescription
MetaPrompt OptimizationUses an LLM (“reasoning model”) to critique and iteratively refine an initial instruction prompt. Good for general prompt wording, clarity, and structural improvements.
Few-shot Bayesian OptimizationSpecifically for chat models, this optimizer uses Bayesian optimization (Optuna) to find the optimal number and combination of few-shot examples (demonstrations) to accompany a system prompt.
MIPRO OptimizationA prompt engineering algorithm that uses a MIPRO (Multi-Instance Prompt Refinement) approach to generate a set of candidate prompts and then uses a Bayesian optimization algorithm to identify the best prompt.
Evolutionary OptimizationEmploys genetic algorithms to evolve a population of prompts. Can discover novel prompt structures and supports multi-objective optimization (e.g., score vs. length). Can use LLMs for advanced mutation/crossover.

If you would like us to implement another optimization algorithm, reach out to us on Github or feel free to contribute by extending optimizers.

Benchmark results

We are currently working on the benchmarking results, these are early preliminary results and are subject to change. You can learn more about our benchmarks here.

Each optimization algorithm is evaluated against different use-cases and datasets:

  1. Arc: The ai2_arc dataset contains a set of multiple choice science questions
  2. GSM8K: The gsm8k dataset contains a set of math problems
  3. medhallu: The medhallu dataset contains a set of medical questions
  4. RagBench: The ragbench dataset contains a set of retrieval (RAG) examples.
RankAlgorithm/OptimizerAverage ScoreArcGSM8KmedhalluRagBench
1Few-shot Bayesian Optimization67.33%28.09%59.26%91.80%90.15%
2Evolutionary Optimization63.13%40.00%25.53%95.00%92.00%
3Mipro Optimization (w/ no tools)55.91%19.70%39.70%92.70%89.28%
4MetaPrompt Optimization52.01%25.00%26.93%91.79%64.31%
5No optimization11.99%1.69%24.06%12.38%9.81%

The results above are benchmarks tested against gpt-4o-mini, we are using various metrics depending on the dataset including Levenshtein Ratio, Answer Relevance and Hallucination. The results might change if you use a different model, configuration, dataset and starting prompt(s).