Overview

Prompt engineering is the practice of designing and refining prompts to help LLMs generate the desired output. Typically prompt engineering is a manual process that involves editing the prompt, evaluating it, reviewing the results and trying again.

Prompt optimization is the process of automating the prompt engineering process.

Agent Optimization

Why optimize prompts ?

Prompt engineering is a skill that can be difficult to master as highlighted by the Anthropic , OpenAI and Google prompt engineering guides. There are a lot of techniques that can be used to help LLMs generate the desired output.

Prompt optimization solves many of the issues that come with prompt engineering:

  1. Prompt engineering is not easily repeatable or scalable alone
  2. Variations across models could lead to performance degration, you need to tune prompts for each model
  3. Optimization may unlock performance, cost and reliability improvements
  4. As systems evolve to be more interdependent, manually tuning multiple prompts becomes increasingly difficult.

So when should you use prompt optimization?

AspectPrompt EngineeringPrompt Optimization
ScopeBroad — includes designing, experimenting, refiningNarrow — improving already existing prompts
GoalCreate a working prompt for a specific taskMaximize performance (accuracy, efficiency, etc.)
InvolvesInitial drafting, understanding model behaviorTweaking wording, structure, or context
When usedEarly in task setup or experimentationAfter a baseline prompt is in place

Optimization algorithms

Supported algorithms

The Opik Optimizer is an experimental Python library that aims at implementing Prompt and Agent Optimization algorithms in a consistent format.

The following algorithms have been implemented:

AlgorithmDescriptionGuide
MetaPrompt OptimizationA prompt engineering algorithm that uses a meta-prompt to generate a set of candidate prompts and then uses a Bayesian optimization algorithm to identify the best prompt.MetaPrompt Optimization
Few-shot Bayesian OptimizationA prompt engineering algorithm that uses a few-shot learning approach to generate a set of candidate prompts and then uses a Bayesian optimization algorithm to identify the best prompt.Few-shot Bayesian Optimization
MIPRO OptimizationA prompt engineering algorithm that uses a MIPRO (Multi-Instance Prompt Refinement) approach to generate a set of candidate prompts and then uses a Bayesian optimization algorithm to identify the best prompt.Mipro Optimization

If you would like us to implement another optimization algorithm, reach out to us on Github.

Benchmark results

We are currently working on the benchmarking results, these are early preliminary results and are subject to change.

Each optimization algorithm is evaluated against different use-cases and datasets:

  1. Arc: The ai2_arc dataset contains a set of multiple choice science questions
  2. GSM8K: The gsm8k dataset contains a set of math problems
  3. Medhallucination: The medhallucination dataset contains a set of medical questions
  4. RagBench: The ragbench dataset contains a set of questions and answers
RankAlgorithmAverage ScoreArc datasetGSM8K datasetMedhallucinationRagBench
1Few-shot Bayesian Optimization0.67330.28090.59260.91800.9015
2Mipro Optimization0.55910.01970.39700.92700.8928
3MetaPrompt Optimization0.52010.25000.26930.91790.6431
4No optimization0.11990.01690.24060.12380.0981

The results above are for gpt-40-mini, the results might change if you use a different model.

Note: This results are preliminary and subject to change, you can learn more about our benchmarks here._