GEPA Optimizer

Single-turn system prompt optimization with reflection

GepaOptimizer wraps the external GEPA package to optimize a single system prompt for single-turn tasks. It maps Opik datasets and metrics into GEPA’s expected format, runs GEPA’s optimization using a task model and a reflection model, and returns a standard OptimizationResult compatible with the Opik SDK.

GepaOptimizer is ideal when you have a single-turn task (one user input → one model response) and you want to optimize the system prompt using a reflection-driven search.

How it works

The GEPA optimizer companies two key approaches to optimize agents:

  1. Reflection: The optimizer uses the outcomes from evaluations to improve the prompts.
  2. Evolution: The optimizer uses an evolutionary algorithm to explore the space of prompts.

You can learn more about the algorithm in the GEPA paper but in short, the optimizer will:

GEPA Optimizer

Quickstart

1"""
2Optimize a simple system prompt on the tiny_test dataset.
3Requires: pip install gepa, and a valid OPENAI_API_KEY for LiteLLM-backed models.
4"""
5from typing import Any, Dict
6
7from opik.evaluation.metrics import LevenshteinRatio
8from opik.evaluation.metrics.score_result import ScoreResult
9
10from opik_optimizer import ChatPrompt, datasets
11from opik_optimizer.gepa_optimizer import GepaOptimizer
12
13def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
14 return LevenshteinRatio().score(reference=dataset_item["label"], output=llm_output)
15
16dataset = datasets.tiny_test()
17
18prompt = ChatPrompt(
19 system="You are a helpful assistant. Answer concisely with the exact answer.",
20 user="{text}",
21)
22
23optimizer = GepaOptimizer(
24 model="openai/gpt-4o-mini",
25 n_threads=6,
26 temperature=0.2,
27 max_tokens=200,
28)
29
30result = optimizer.optimize_prompt(
31 prompt=prompt,
32 dataset=dataset,
33 metric=levenshtein_ratio,
34 max_trials=12,
35 reflection_minibatch_size=2,
36 n_samples=5,
37)
38
39result.display()

Determinism and tool usage

  • GEPA’s seed is forwarded directly to the underlying gepa.optimize call, but any non-determinism in your prompt (tool calls, non-zero temperature, external APIs) will still introduce variance. To test seeding in isolation, disable tools or substitute cached responses.
  • GEPA emits its own baseline evaluation inside the optimization loop. You’ll see one baseline score from Opik’s wrapper and another from GEPA before the first trial; this is expected and does not double-charge the metric budget.
  • Reflection only triggers after GEPA accepts at least reflection_minibatch_size unique prompts. If the minibatch is larger than the trial budget, the optimizer logs a warning and skips reflection.

GEPA scores vs. Opik scores

  • The GEPA Score column reflects the aggregate score GEPA computes on its train/validation split when deciding which candidates stay on the Pareto front. It is useful for understanding how GEPA’s evolutionary search ranks prompts.
  • The Opik Score column is a fresh evaluation performed through Opik’s metric pipeline on the same dataset (respecting n_samples). This is the score you should use when comparing against your baseline or other optimizers.
  • Because the GEPA score is based on GEPA’s internal aggregation, it can diverge from the Opik score for the same prompt. This is expected—treat the GEPA score as a hint about why GEPA kept or discarded a candidate, and rely on the Opik score for final comparisons.

skip_perfect_score

  • When skip_perfect_score=True, GEPA immediately ignores any candidate whose GEPA score meets or exceeds the perfect_score threshold (default 1.0). This keeps the search moving toward imperfect prompts instead of spending budget refining already perfect ones.
  • Set skip_perfect_score=False if your metric tops out below 1.0, or if you still want to see how GEPA mutates a perfect-scoring prompt—for example, when you care about ties being broken by Opik’s rescoring step rather than GEPA’s aggregate.

Configuration Options

Optimizer parameters

The optimizer has the following parameters:

model
strDefaults to gpt-4o
LiteLLM model name for the optimization algorithm
model_parameters
dict[str, typing.Any] | None
Optional dict of LiteLLM parameters for optimizer’s internal LLM calls. Common params: temperature, max_tokens, max_completion_tokens, top_p.
n_threads
intDefaults to 6
Number of parallel threads for evaluation
verbose
intDefaults to 1
Controls internal logging/progress bars (0=off, 1=on)
seed
intDefaults to 42
Random seed for reproducibility

optimize_prompt parameters

The optimize_prompt method has the following parameters:

prompt
ChatPrompt
The prompt to optimize
dataset
Dataset
Opik Dataset to optimize on
metric
Callable
Metric function to evaluate on
experiment_config
dict | None
Optional configuration for the experiment
n_samples
int | None
Optional number of items to test in the dataset
auto_continue
boolDefaults to False
Whether to auto-continue optimization
agent_class
type[opik_optimizer.optimizable_agent.OptimizableAgent] | None
Optional agent class to use
project_name
strDefaults to Optimization
max_trials
intDefaults to 10
Maximum number of different prompts to test (default: 10)
reflection_minibatch_size
intDefaults to 3
Size of reflection minibatches (default: 3)
candidate_selection_strategy
strDefaults to pareto
Strategy for candidate selection (choose from “pareto”, “current_best”, or “epsilon_greedy”; default: “pareto”)
skip_perfect_score
boolDefaults to True
Skip candidates with perfect scores (default: True)
perfect_score
floatDefaults to 1.0
Score considered perfect (default: 1.0)
use_merge
boolDefaults to False
Enable merge operations (default: False)
max_merge_invocations
intDefaults to 5
Maximum merge invocations (default: 5)
run_dir
str | None
Directory for run outputs (default: None)
track_best_outputs
boolDefaults to False
Track best outputs during optimization (default: False)
display_progress_bar
boolDefaults to False
Display progress bar (default: False)
seed
intDefaults to 42
Random seed for reproducibility (default: 42)
raise_on_exception
boolDefaults to True
Raise exceptions instead of continuing (default: True)
kwargs
Any

Model Support

GEPA coordinates two model contexts:

  • GepaOptimizer.model: LiteLLM model string the optimizer uses for internal reasoning (reflection, mutation prompts, etc.).
  • ChatPrompt.model: The model evaluated against your dataset—this should match what you run in production.

Set model to any LiteLLM-supported provider (e.g., "gpt-4o", "azure/gpt-4", "anthropic/claude-3-opus", "gemini/gemini-1.5-pro") and pass extra parameters via model_parameters when you need to tune temperature, max tokens, or other limits:

1optimizer = GepaOptimizer(
2 model="anthropic/claude-3-opus-20240229",
3 model_parameters={
4 "temperature": 0.7,
5 "max_tokens": 4096
6 }
7)

Reflection is handled internally; there is no separate reflection_model argument to set.

Limitations & tips

  • Instruction-focused: The current wrapper optimizes the instruction/system portion of your prompt. If you rely heavily on few-shot exemplars, consider pairing GEPA with the Few-Shot Bayesian optimizer or an Evolutionary run.
  • Reflection can misfire: GEPA’s reflective mutations are only as good as the metric reasons you supply. If ScoreResult.reason is vague, the optimizer may reinforce bad behaviors. Invest in descriptive metrics before running GEPA at scale.
  • Cost-aware: Although GEPA is more sample-efficient than some RL-based methods, reflection and Pareto scoring still consume multiple LLM calls per trial. Start with small max_trials and monitor API usage.

Next Steps

  1. Explore specific Optimizers for algorithm details.
  2. Refer to the FAQ for common questions and troubleshooting.
  3. Refer to the API Reference for detailed configuration options.