GEPA Optimizer

Single-turn system prompt optimization with reflection

GepaOptimizer wraps the external GEPA package to optimize a single system prompt for single-turn tasks. It maps Opik datasets and metrics into GEPA’s expected format, runs GEPA’s optimization using a task model and a reflection model, and returns a standard OptimizationResult compatible with the Opik SDK.

GepaOptimizer is ideal when you have a single-turn task (one user input → one model response) and you want to optimize the system prompt using a reflection-driven search.

How it works

The GEPA optimizer companies two key approaches to optimize agents:

  1. Reflection: The optimizer uses the outcomes from evaluations to improve the prompts.
  2. Evolution: The optimizer uses an evolutionary algorithm to explore the space of prompts.

You can learn more about the algorithm in the GEPA paper but in short, the optimizer will:

GEPA Optimizer

Quickstart

1"""
2Optimize a simple system prompt on the tiny_test dataset.
3Requires: pip install gepa, and a valid OPENAI_API_KEY for LiteLLM-backed models.
4"""
5from typing import Any, Dict
6
7from opik.evaluation.metrics import LevenshteinRatio
8from opik.evaluation.metrics.score_result import ScoreResult
9
10from opik_optimizer import ChatPrompt, datasets
11from opik_optimizer.gepa_optimizer import GepaOptimizer
12
13def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
14 return LevenshteinRatio().score(reference=dataset_item["label"], output=llm_output)
15
16dataset = datasets.tiny_test()
17
18prompt = ChatPrompt(
19 system="You are a helpful assistant. Answer concisely with the exact answer.",
20 user="{text}",
21)
22
23optimizer = GepaOptimizer(
24 model="openai/gpt-4o-mini",
25 reflection_model="openai/gpt-4o", # stronger reflector is often helpful
26 n_threads=6,
27 temperature=0.2,
28 max_tokens=200,
29)
30
31result = optimizer.optimize_prompt(
32 prompt=prompt,
33 dataset=dataset,
34 metric=levenshtein_ratio,
35 max_trials=12,
36 reflection_minibatch_size=2,
37 n_samples=5,
38)
39
40result.display()

Determinism and tool usage

  • GEPA’s seed is forwarded directly to the underlying gepa.optimize call, but any non-determinism in your prompt (tool calls, non-zero temperature, external APIs) will still introduce variance. To test seeding in isolation, disable tools or substitute cached responses.
  • GEPA emits its own baseline evaluation inside the optimization loop. You’ll see one baseline score from Opik’s wrapper and another from GEPA before the first trial; this is expected and does not double-charge the metric budget.
  • Reflection only triggers after GEPA accepts at least reflection_minibatch_size unique prompts. If the minibatch is larger than the trial budget, the optimizer logs a warning and skips reflection.

GEPA scores vs. Opik scores

  • The GEPA Score column reflects the aggregate score GEPA computes on its train/validation split when deciding which candidates stay on the Pareto front. It is useful for understanding how GEPA’s evolutionary search ranks prompts.
  • The Opik Score column is a fresh evaluation performed through Opik’s metric pipeline on the same dataset (respecting n_samples). This is the score you should use when comparing against your baseline or other optimizers.
  • Because the GEPA score is based on GEPA’s internal aggregation, it can diverge from the Opik score for the same prompt. This is expected—treat the GEPA score as a hint about why GEPA kept or discarded a candidate, and rely on the Opik score for final comparisons.

skip_perfect_score

  • When skip_perfect_score=True, GEPA immediately ignores any candidate whose GEPA score meets or exceeds the perfect_score threshold (default 1.0). This keeps the search moving toward imperfect prompts instead of spending budget refining already perfect ones.
  • Set skip_perfect_score=False if your metric tops out below 1.0, or if you still want to see how GEPA mutates a perfect-scoring prompt—for example, when you care about ties being broken by Opik’s rescoring step rather than GEPA’s aggregate.

Configuration Options

Optimizer parameters

The optimizer has the following parameters:

model
strDefaults to gpt-4o
LiteLLM model name for the optimization algorithm
model_parameters
dict[str, typing.Any] | None
Optional dict of LiteLLM parameters for optimizer’s internal LLM calls. Common params: temperature, max_tokens, max_completion_tokens, top_p.
n_threads
intDefaults to 6
Number of parallel threads for evaluation
verbose
intDefaults to 1
Controls internal logging/progress bars (0=off, 1=on)
seed
intDefaults to 42
Random seed for reproducibility

optimize_prompt parameters

The optimize_prompt method has the following parameters:

prompt
ChatPrompt
The prompt to optimize
dataset
Dataset
Opik Dataset to optimize on
metric
Callable
Metric function to evaluate on
experiment_config
dict | None
Optional configuration for the experiment
n_samples
int | None
Optional number of items to test in the dataset
auto_continue
boolDefaults to False
Whether to auto-continue optimization
agent_class
type[opik_optimizer.optimizable_agent.OptimizableAgent] | None
Optional agent class to use
project_name
strDefaults to Optimization
max_trials
intDefaults to 10
Maximum number of different prompts to test (default: 10)
reflection_minibatch_size
intDefaults to 3
Size of reflection minibatches (default: 3)
candidate_selection_strategy
strDefaults to pareto
Strategy for candidate selection (default: “pareto”)
skip_perfect_score
boolDefaults to True
Skip candidates with perfect scores (default: True)
perfect_score
floatDefaults to 1.0
Score considered perfect (default: 1.0)
use_merge
boolDefaults to False
Enable merge operations (default: False)
max_merge_invocations
intDefaults to 5
Maximum merge invocations (default: 5)
run_dir
str | None
Directory for run outputs (default: None)
track_best_outputs
boolDefaults to False
Track best outputs during optimization (default: False)
display_progress_bar
boolDefaults to False
Display progress bar (default: False)
seed
intDefaults to 42
Random seed for reproducibility (default: 42)
raise_on_exception
boolDefaults to True
Raise exceptions instead of continuing (default: True)
kwargs
Any

Model Support

There are two models to consider when using the HierarchicalReflectiveOptimizer:

  • HierarchicalReflectiveOptimizer.model: The model used for the root cause analysis and failure mode synthesis.
  • ChatPrompt.model: The model used to evaluate the prompt.

The model parameter accepts any LiteLLM-supported model string (e.g., "gpt-4o", "azure/gpt-4", "anthropic/claude-3-opus", "gemini/gemini-1.5-pro"). You can also pass in extra model parameters using the model_parameters parameter:

1optimizer = GepaOptimizer(
2 model="anthropic/claude-3-opus-20240229",
3 model_parameters={
4 "temperature": 0.7,
5 "max_tokens": 4096
6 }
7)

Next Steps

  1. Explore specific Optimizers for algorithm details.
  2. Refer to the FAQ for common questions and troubleshooting.
  3. Refer to the API Reference for detailed configuration options.