Opik Agent Optimization Overview
Opik Agent Optimizer, including the optimizers described are currently in Public Beta. We are actively working on improving these features and welcome your feedback on GitHub!

Opik Agent Optimizer is a comprehensive toolkit designed to enhance the performance and efficiency of your Large Language Model (LLM) applications. Rather than manually editing prompts and running evaluation, you can use Opik Agent Optimizer to automatically optimize your prompts.

Why optimize prompts ?
Prompt engineering is a skill that can be difficult to master as highlighted by the Anthropic, OpenAI and Google prompt engineering guides. There are a lot of techniques that can be used to help LLMs generate the desired output.
Prompt optimization solves many of the issues that come with prompt engineering:
- Prompt engineering is not easily repeatable or scalable alone
- Possible performance degration across models, you need to tune prompts for each model
- Optimization may unlock performance, cost and reliability improvements
- As systems evolve, manually tuning multiple prompts becomes increasingly difficult
So when should you use prompt optimization?
Optimizing a prompt
You can optimize any prompt in just a few lines of code
You can learn more about running your first optimization in the Quickstart guide.
Optimization Algorithms
Supported algorithms
The Opik Agent Optimizer is an experimental Python library that aims at implementing Prompt and Agent Optimization algorithms in a consistent format.
All optimizers leverage LiteLLM for broad model compatibility. This means you can use models from OpenAI, Azure, Anthropic, Google, local Ollama instances, and many more. For details on how to specify different models, see the LiteLLM Support for Optimizers guide.
The following algorithms have been implemented:
If you would like us to implement another optimization algorithm, reach out to us on Github or feel free to contribute by extending optimizers.
Benchmark results
We are currently working on the benchmarking results, these are early preliminary results and are subject to change. You can learn more about our benchmarks here.
Each optimization algorithm is evaluated against different use-cases and datasets:
- Arc: The ai2_arc dataset contains a set of multiple choice science questions
- GSM8K: The gsm8k dataset contains a set of math problems
- medhallu: The medhallu dataset contains a set of medical questions
- RagBench: The ragbench dataset contains a set of retrieval (RAG) examples.
The results above are benchmarks tested against gpt-4o-mini
, we are using various metrics
depending on the dataset including Levenshtein Ratio, Answer Relevance and Hallucination. The
results might change if you use a different model, configuration, dataset and starting prompt(s).