Optimizer Frequently Asked Questions

Common questions about the Opik Agent Optimizer

Getting started

The Agent Optimizer provides a unified interface for optimizing your existing prompts and agents with cutting edge optimization algorithms. In addition to giving you access to cutting edge academic optimizers like GEPA, we also provide a set of algorithms that have been developed in-house based on production applications.

The optimizer will allow you to improve the performance of your agents without the need for manual prompt engineering. You can also use the optimizer to reduce the size of the prompts in your agents, reducing cost and latency while maintaining performance.

To get started, you will need:

  1. The prompt you want to optimize
  2. A dataset of examples to optimizer on, you can start with as little as 10
  3. A metric to evaluate the performance of the prompt

Once you have these, check out the Quickstart Guide to run your first optimization.

Yes, we would be more than happy to help you setup the Opik Optimizer for your use case ! You can join our Slack community and ask for help there.

Optimization Algorithms

Opik Agent Optimizer supports a wide range of optimization algorithms including:

If you would like us to add a new optimization algorithm, simply create an issue on our GitHub repository and we will be happy to add it !

Knowing which optimizer to use depends on your specific needs. As a rule of thumb, we recommend starting with the Hierarchical Reflective Optimization as this has been shown to be a strong baseline for most tasks.

You can also try to use:

  1. GEPA: This is one of the top performing academic optimizers and is a good option if you have a complex task.
  2. FewShotBayesianOptimizer: If you have a task that is quite repetitive in the formatting of the prompt and responses then this is a good option.

While some optimizations run for shorts period of time, it is common for optimizations to take a couple of hours to complete. As you are starting out, we recommend setting the max_trials parameter to a reasonable number and increasing / decreasing it as you go.

The number of samples you need depends on your task. As a rule of thumb, we recommend starting with at least 10 samples. The more samples you have, the more accurate the optimization will be.

It’s important to note that there are two models at play when it comes to optimizing prompts:

  1. The chatPrompt model: This is the model that you use the prompt with, should be the same as your application.
  2. The optimizer model: This is the model that the optimizer uses to improve your prompt, you will get the best performance by using the most powerful model for the optimization.

The Opik Agent Optimizer uses LiteLLM to support a wide range of models including:

  • OpenAI (e.g., GPT-5, GPT-4.5, GPT-4)
  • Anthropic (e.g., Claude 3.7 Sonnet, Claude 3.7 Haiku)
  • Google (e.g., Gemini 2.0 Flash, Gemini 2.0 Pro)
  • Cohere (e.g., Command, Command-Pro)
  • Mistral AI (e.g., Mistral-7B-Instruct, Mistral-7B-Instruct-Pro)
  • Locally hosted models (e.g., via Ollama, Hugging Face Inference Endpoints)
  • And many others supported by LiteLLM.

There are a few things you can try to improve the performance of your optimization:

  1. Review your optimization metric, ideally it should provide the model with an insightful reason for the score it gives. By improving the quality of the metric, the optimizer will be able to make better optimizations.
  2. Review your dataset, ideally it should be a diverse set of examples that cover the different scenarios you want to optimize for.
  3. Use more powerful models for both the chatPrompt model and the optimizer, as models get more powerful, they will be able to generate better optimizations.

Common Errors

This error occurs when you pass an incorrect type to the optimizer’s optimize_prompt() method.

Solution: Ensure you’re using the ChatPrompt class to define your prompt:

1from opik_optimizer import ChatPrompt
2
3prompt = ChatPrompt(
4 messages=[
5 {"role": "system", "content": "Your system prompt here"},
6 {"role": "user", "content": "Your user prompt with {variable}"}
7 ],
8 model="gpt-4"
9)

This error occurs when the dataset passed to the optimizer is not a proper Dataset object.

Solution: Use the Dataset class to create your dataset:

1import opik
2
3client = opik.Opik()
4dataset = client.get_or_create_dataset(name='your-dataset-name')
5dataset.insert([
6 {"input": "example 1", "output": "expected 1"},
7 {"input": "example 2", "output": "expected 2"}
8])

This error occurs when the metric parameter is not callable or doesn’t have the correct signature.

Solution: Ensure your metric is a function that takes dataset_item and llm_output as arguments and returns a ScoreResult:

1from opik.evaluation.metrics import score_result
2
3def my_metric(dataset_item, llm_output):
4 # Your scoring logic here
5 score = calculate_score(dataset_item, llm_output)
6 return ScoreResult(
7 name="my-metric",
8 value=score,
9 reason="Explanation for the score"
10 )

This error occurs when your prompt template contains placeholders (e.g., {variable}) that don’t match your dataset fields.

Solution: Ensure all placeholders in your prompt match the keys in your dataset:

1# Prompt with {question} placeholder
2prompt = ChatPrompt(
3 user="Answer: {question}",
4 model="gpt-4"
5)
6
7# Dataset must have 'question' field
8dataset = Dataset.from_list([
9 {"question": "What is AI?", "output": "..."}
10])

This error occurs when trying to use the GepaOptimizer without the required gepa package installed.

Solution: Install the gepa package:

$pip install gepa

This error typically occurs when the LLM provider API key is not configured in your environment.

Solution: Set the appropriate environment variable for your LLM provider:

$# For OpenAI
>export OPENAI_API_KEY="your-api-key"
>
># For Anthropic
>export ANTHROPIC_API_KEY="your-api-key"
>
># For other providers, check the LiteLLM documentation

Open challenges & advanced topics

Research has shown “evil twin” prompts and unusual delimiters can perform well despite being hard to interpret. Optimizers explore the search space indiscriminately, so high-performing instructions aren’t always human-readable. When interpretability matters, prefer algorithms like Hierarchical Reflective that include reasoning traces or enforce structure via custom metrics.

Cost varies widely. Reflection-heavy optimizers (e.g., Hierarchical Reflective, GEPA) may call LLMs multiple times per trial, while MetaPrompt/Few-Shot Bayesian are lighter weight. Start with small n_samples and max_trials, monitor API usage, and review the Benchmarks page for sample-efficiency notes.

Optimizers tune prompts for the dataset you provide. Prompts may overfit if the dataset lacks coverage. Use diverse datasets, consider chaining optimizers (e.g., Evolutionary → Few-Shot Bayesian) to encourage generalization, and re-evaluate on unseen samples before shipping.

Yes—compose metrics using MultiMetricObjective or custom heuristics so optimizers weigh multiple goals. For complex trade-offs, capture human preferences in the metric reasons, or explore Pareto-aware optimizers like GEPA that surface trade-offs between accuracy and cost.

Opik supports optimizing prompts that handle images/videos, as well as agents built with LangGraph, Google ADK, or MCP toolchains. Use the Optimize agents and Optimize multimodal guides for modality-specific advice.

LLM-based metrics can be noisy. Always include deterministic checks when possible, and ensure your ScoreResult.reason is informative so reflective optimizers can identify true failure modes. See Define metrics and Custom metrics for best practices.

Yes. When optimizing prompts that handle sensitive content, bake alignment constraints into your dataset/metrics (e.g., moderation scores) and review outputs manually before deployment. Multi-objective setups help enforce safety alongside accuracy.

Next Steps