GEPA | Opik Documentation | Opik Documentation

GepaOptimizer wraps the external GEPA package to optimize a single system prompt for single-turn tasks. It maps Opik datasets and metrics into GEPA’s expected format, runs GEPA’s optimization using a task model and a reflection model, and returns a standard OptimizationResult compatible with the Opik SDK.

GepaOptimizer is ideal when you have a single-turn task (one user input → one model response) and you want to optimize the system prompt using a reflection-driven search.

How it works

The GEPA optimizer companies two key approaches to optimize agents:

Reflection: The optimizer uses the outcomes from evaluations to improve the prompts.
Evolution: The optimizer uses an evolutionary algorithm to explore the space of prompts.

You can learn more about the algorithm in the GEPA paper but in short, the optimizer will:

Quickstart

1 """
2 Optimize a simple system prompt on the tiny_test dataset.
3 Requires: pip install gepa, and a valid OPENAI_API_KEY for LiteLLM-backed models.
4 """
5 from typing import Any, Dict
6 
7 from opik.evaluation.metrics import LevenshteinRatio
8 from opik.evaluation.metrics.score_result import ScoreResult
9 
10 from opik_optimizer import ChatPrompt, datasets
11 from opik_optimizer.gepa_optimizer import GepaOptimizer
12 
13 def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
14     return LevenshteinRatio().score(reference=dataset_item["label"], output=llm_output)
15 
16 dataset = datasets.tiny_test()
17 
18 prompt = ChatPrompt(
19     system="You are a helpful assistant. Answer concisely with the exact answer.",
20     user="{text}",
21 )
22 
23 optimizer = GepaOptimizer(
24     model="openai/gpt-4o-mini",
25     n_threads=6,
26     model_parameters={"temperature": 0.2, "max_tokens": 200},
27 )
28 
29 result = optimizer.optimize_prompt(
30     prompt=prompt,
31     dataset=dataset,
32     metric=levenshtein_ratio,
33     max_trials=12,
34     reflection_minibatch_size=2,
35     n_samples=5,
36 )
37 
38 result.display()

Determinism and tool usage

GEPA’s seed is forwarded directly to the underlying gepa.optimize call, but any non-determinism in your prompt (tool calls, non-zero temperature, external APIs) will still introduce variance. To test seeding in isolation, disable tools or substitute cached responses.
GEPA emits its own baseline evaluation inside the optimization loop. You’ll see one baseline score from Opik’s wrapper and another from GEPA before the first trial; this is expected and does not double-charge the metric budget.
Reflection only triggers after GEPA accepts at least reflection_minibatch_size unique prompts. If the minibatch is larger than the trial budget, the optimizer logs a warning and skips reflection.
GEPA supports tool use during evaluation (allow_tool_use=True) but does not support optimize_tools=True yet. Tool-description optimization requests are currently degraded/blocked until the adapter supports it.

GEPA scores vs. Opik scores

The GEPA Score column reflects the aggregate score GEPA computes on its train/validation split when deciding which candidates stay on the Pareto front. It is useful for understanding how GEPA’s evolutionary search ranks prompts.
The Opik Score column is a fresh evaluation performed through Opik’s metric pipeline on the same dataset (respecting n_samples). This is the score you should use when comparing against your baseline or other optimizers.
Because the GEPA score is based on GEPA’s internal aggregation, it can diverge from the Opik score for the same prompt. This is expected—treat the GEPA score as a hint about why GEPA kept or discarded a candidate, and rely on the Opik score for final comparisons.

`skip_perfect_score`

When skip_perfect_score=True, GEPA immediately ignores any candidate whose GEPA score meets or exceeds the perfect_score threshold (default 1.0). This keeps the search moving toward imperfect prompts instead of spending budget refining already perfect ones.
Set skip_perfect_score=False if your metric tops out below 1.0, or if you still want to see how GEPA mutates a perfect-scoring prompt—for example, when you care about ties being broken by Opik’s rescoring step rather than GEPA’s aggregate.

Configuration Options

Optimizer parameters

The optimizer has the following parameters:

model

strDefaults to openai/gpt-5-nano

LiteLLM model name for the optimization algorithm

model_parameters

dict[str, typing.Any] | None

Optional dict of LiteLLM parameters for optimizer’s internal LLM calls. Common params: temperature, max_tokens, max_completion_tokens, top_p.

n_threads

intDefaults to 6

Number of parallel threads for evaluation

verbose

intDefaults to 1

Controls internal logging/progress bars (0=off, 1=on)

seed

intDefaults to 42

Random seed for reproducibility

`optimize_prompt` parameters

The optimize_prompt method has the following parameters:

prompt

ChatPrompt

The prompt to optimize

dataset

Dataset

Opik Dataset to optimize on

metric

Callable

Metric function to evaluate on

experiment_config

dict | None

Optional configuration for the experiment

n_samples

int | float | str | None

Number of dataset items to use per evaluation. Use counts (e.g., 50), fractions (e.g., 0.1), percentages (e.g., “10%”), or “all”/“full”/None for the full dataset.

n_samples_minibatch

int | None

Optional number of samples for inner-loop minibatches (defaults to n_samples).

n_samples_strategy

str | None

Sampling strategy for subsampling (default: “random_sorted”).

auto_continue

boolDefaults to False

Whether to auto-continue optimization

agent_class

type[opik_optimizer.optimizable_agent.OptimizableAgent] | None

Optional agent class to use

project_name

strDefaults to Optimization

max_trials

intDefaults to 10

Maximum number of different prompts to test (default: 10)

reflection_minibatch_size

intDefaults to 3

Size of reflection minibatches (default: 3)

candidate_selection_strategy

strDefaults to pareto

Strategy for candidate selection (choose from “pareto”, “current_best”, or “epsilon_greedy”; default: “pareto”)

skip_perfect_score

boolDefaults to True

Skip candidates with perfect scores (default: True)

perfect_score

floatDefaults to 1.0

Score considered perfect (default: 1.0)

use_merge

boolDefaults to False

Enable merge operations (default: False)

max_merge_invocations

intDefaults to 5

Maximum merge invocations (default: 5)

run_dir

str | None

Directory for run outputs (default: None)

track_best_outputs

boolDefaults to False

Track best outputs during optimization (default: False)

display_progress_bar

boolDefaults to False

Display progress bar (default: False)

seed

intDefaults to 42

Random seed for reproducibility (default: 42)

raise_on_exception

boolDefaults to True

Raise exceptions instead of continuing (default: True)

kwargs

Any

Model Support

GEPA coordinates two model contexts:

GepaOptimizer.model: LiteLLM model string the optimizer uses for internal reasoning (reflection, mutation prompts, etc.).
ChatPrompt.model: The model evaluated against your dataset—this should match what you run in production.

Set model to any LiteLLM-supported provider (e.g., "gpt-4o", "azure/gpt-4", "anthropic/claude-3-opus", "gemini/gemini-1.5-pro") and pass extra parameters via model_parameters when you need to tune temperature, max tokens, or other limits:

1 optimizer = GepaOptimizer(
2     model="anthropic/claude-3-opus-20240229",
3     model_parameters={
4         "temperature": 0.7,
5         "max_tokens": 4096
6     }
7 )

Reflection is handled internally; there is no separate reflection_model argument to set.

Limitations & tips

Instruction-focused: The current wrapper optimizes the instruction/system portion of your prompt. If you rely heavily on few-shot exemplars, consider pairing GEPA with the Few-Shot Bayesian optimizer or an Evolutionary run.
Reflection can misfire: GEPA’s reflective mutations are only as good as the metric reasons you supply. If ScoreResult.reason is vague, the optimizer may reinforce bad behaviors. Invest in descriptive metrics before running GEPA at scale.
Cost-aware: Although GEPA is more sample-efficient than some RL-based methods, reflection and Pareto scoring still consume multiple LLM calls per trial. Start with small max_trials and monitor API usage.

Next Steps

Explore specific Optimizers for algorithm details.
Refer to the FAQ for common questions and troubleshooting.
Refer to the API Reference for detailed configuration options.

How it works

Quickstart

Determinism and tool usage

GEPA scores vs. Opik scores

skip_perfect_score

Configuration Options

Optimizer parameters

optimize_prompt parameters

Model Support

Limitations & tips

Next Steps

`skip_perfect_score`

`optimize_prompt` parameters