GEPA Optimizer | Opik Documentation

GepaOptimizer wraps the external GEPA package to optimize a single system prompt for single-turn tasks. It maps Opik datasets and metrics into GEPA’s expected format, runs GEPA’s optimization using a task model and a reflection model, and returns a standard OptimizationResult compatible with the Opik SDK.

GepaOptimizer is ideal when you have a single-turn task (one user input → one model response) and you want to optimize the system prompt using a reflection-driven search.

How it works

The GEPA optimizer companies two key approaches to optimize agents:

Reflection: The optimizer uses the outcomes from evaluations to improve the prompts.
Evolution: The optimizer uses an evolutionary algorithm to explore the space of prompts.

You can learn more about the algorithm in the GEPA paper but in short, the optimizer will:

Quickstart

1 """
2 Optimize a simple system prompt on the tiny_test dataset.
3 Requires: pip install gepa, and a valid OPENAI_API_KEY for LiteLLM-backed models.
4 """
5 from typing import Any, Dict
6 
7 from opik.evaluation.metrics import LevenshteinRatio
8 from opik.evaluation.metrics.score_result import ScoreResult
9 
10 from opik_optimizer import ChatPrompt, datasets
11 from opik_optimizer.gepa_optimizer import GepaOptimizer
12 
13 def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
14     return LevenshteinRatio().score(reference=dataset_item["label"], output=llm_output)
15 
16 dataset = datasets.tiny_test()
17 
18 prompt = ChatPrompt(
19     system="You are a helpful assistant. Answer concisely with the exact answer.",
20     user="{text}",
21 )
22 
23 optimizer = GepaOptimizer(
24     model="openai/gpt-4o-mini",
25     reflection_model="openai/gpt-4o",  # stronger reflector is often helpful
26     n_threads=6,
27     temperature=0.2,
28     max_tokens=200,
29 )
30 
31 result = optimizer.optimize_prompt(
32     prompt=prompt,
33     dataset=dataset,
34     metric=levenshtein_ratio,
35     max_trials=12,
36     reflection_minibatch_size=2,
37     n_samples=5,
38 )
39 
40 result.display()

Determinism and tool usage

GEPA’s seed is forwarded directly to the underlying gepa.optimize call, but any non-determinism in your prompt (tool calls, non-zero temperature, external APIs) will still introduce variance. To test seeding in isolation, disable tools or substitute cached responses.
GEPA emits its own baseline evaluation inside the optimization loop. You’ll see one baseline score from Opik’s wrapper and another from GEPA before the first trial; this is expected and does not double-charge the metric budget.
Reflection only triggers after GEPA accepts at least reflection_minibatch_size unique prompts. If the minibatch is larger than the trial budget, the optimizer logs a warning and skips reflection.

GEPA scores vs. Opik scores

The GEPA Score column reflects the aggregate score GEPA computes on its train/validation split when deciding which candidates stay on the Pareto front. It is useful for understanding how GEPA’s evolutionary search ranks prompts.
The Opik Score column is a fresh evaluation performed through Opik’s metric pipeline on the same dataset (respecting n_samples). This is the score you should use when comparing against your baseline or other optimizers.
Because the GEPA score is based on GEPA’s internal aggregation, it can diverge from the Opik score for the same prompt. This is expected—treat the GEPA score as a hint about why GEPA kept or discarded a candidate, and rely on the Opik score for final comparisons.

`skip_perfect_score`

When skip_perfect_score=True, GEPA immediately ignores any candidate whose GEPA score meets or exceeds the perfect_score threshold (default 1.0). This keeps the search moving toward imperfect prompts instead of spending budget refining already perfect ones.
Set skip_perfect_score=False if your metric tops out below 1.0, or if you still want to see how GEPA mutates a perfect-scoring prompt—for example, when you care about ties being broken by Opik’s rescoring step rather than GEPA’s aggregate.

Configuration Options

Optimizer parameters

The optimizer has the following parameters:

model

strDefaults to gpt-4o

LiteLLM model name for the optimization algorithm

model_parameters

dict[str, typing.Any] | None

Optional dict of LiteLLM parameters for optimizer’s internal LLM calls. Common params: temperature, max_tokens, max_completion_tokens, top_p.

n_threads

intDefaults to 6

Number of parallel threads for evaluation

verbose

intDefaults to 1

Controls internal logging/progress bars (0=off, 1=on)

seed

intDefaults to 42

Random seed for reproducibility

`optimize_prompt` parameters

The optimize_prompt method has the following parameters:

prompt

ChatPrompt

The prompt to optimize

dataset

Dataset

Opik Dataset to optimize on

metric

Callable

Metric function to evaluate on

experiment_config

dict | None

Optional configuration for the experiment

n_samples

int | None

Optional number of items to test in the dataset

auto_continue

boolDefaults to False

Whether to auto-continue optimization

agent_class

type[opik_optimizer.optimizable_agent.OptimizableAgent] | None

Optional agent class to use

project_name

strDefaults to Optimization

max_trials

intDefaults to 10

Maximum number of different prompts to test (default: 10)

reflection_minibatch_size

intDefaults to 3

Size of reflection minibatches (default: 3)

candidate_selection_strategy

strDefaults to pareto

Strategy for candidate selection (default: “pareto”)

skip_perfect_score

boolDefaults to True

Skip candidates with perfect scores (default: True)

perfect_score

floatDefaults to 1.0

Score considered perfect (default: 1.0)

use_merge

boolDefaults to False

Enable merge operations (default: False)

max_merge_invocations

intDefaults to 5

Maximum merge invocations (default: 5)

run_dir

str | None

Directory for run outputs (default: None)

track_best_outputs

boolDefaults to False

Track best outputs during optimization (default: False)

display_progress_bar

boolDefaults to False

Display progress bar (default: False)

seed

intDefaults to 42

Random seed for reproducibility (default: 42)

raise_on_exception

boolDefaults to True

Raise exceptions instead of continuing (default: True)

kwargs

Any

Model Support

There are two models to consider when using the HierarchicalReflectiveOptimizer:

HierarchicalReflectiveOptimizer.model: The model used for the root cause analysis and failure mode synthesis.
ChatPrompt.model: The model used to evaluate the prompt.

The model parameter accepts any LiteLLM-supported model string (e.g., "gpt-4o", "azure/gpt-4", "anthropic/claude-3-opus", "gemini/gemini-1.5-pro"). You can also pass in extra model parameters using the model_parameters parameter:

1 optimizer = GepaOptimizer(
2     model="anthropic/claude-3-opus-20240229",
3     model_parameters={
4         "temperature": 0.7,
5         "max_tokens": 4096
6     }
7 )

Next Steps

Explore specific Optimizers for algorithm details.
Refer to the FAQ for common questions and troubleshooting.
Refer to the API Reference for detailed configuration options.

How it works

Quickstart

Determinism and tool usage

GEPA scores vs. Opik scores

skip_perfect_score

Configuration Options

Optimizer parameters

optimize_prompt parameters

Model Support

Next Steps

`skip_perfect_score`

`optimize_prompt` parameters