GEPA Optimizer

Single-turn system prompt optimization with reflection

GepaOptimizer wraps the external GEPA package to optimize a single system prompt for single-turn tasks. It maps Opik datasets and metrics into GEPA’s expected format, runs GEPA’s optimization using a task model and a reflection model, and returns a standard OptimizationResult compatible with the Opik SDK.

GEPA integration is currently in Beta. APIs and defaults may evolve, and performance characteristics can change as we iterate. Please report issues or feedback on GitHub.

When to use: Choose GepaOptimizer when you have a single-turn task (one user input → one model response) and you want to optimize the system prompt using a reflection-driven search.

Key trade-offs:

  • GEPA (DefaultAdapter) focuses on a single system prompt; multi-turn or tool-heavy flows are not the target.
  • Requires both a task model and a reflection model (can be the same; often the reflection model is stronger).
  • GEPA runs additional reflection passes; expect extra LLM calls compared to simple optimizers.

Requirements

  • Python: pip install opik-optimizer gepa
  • Models: Uses LiteLLM-style model names for both the task model and reflection model (e.g., openai/gpt-4o-mini, openai/gpt-4o). Set the appropriate environment variables (e.g., OPENAI_API_KEY). See LiteLLM Support for Optimizers.

Quick Example (Tiny Test)

1"""
2Optimize a simple system prompt on the tiny_test dataset.
3Requires: pip install gepa, and a valid OPENAI_API_KEY for LiteLLM-backed models.
4"""
5from typing import Any, Dict
6
7from opik.evaluation.metrics import LevenshteinRatio
8from opik.evaluation.metrics.score_result import ScoreResult
9
10from opik_optimizer import ChatPrompt, datasets
11from opik_optimizer.gepa_optimizer import GepaOptimizer
12
13def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
14 return LevenshteinRatio().score(reference=dataset_item["label"], output=llm_output)
15
16dataset = datasets.tiny_test()
17
18prompt = ChatPrompt(
19 system="You are a helpful assistant. Answer concisely with the exact answer.",
20 user="{text}",
21)
22
23optimizer = GepaOptimizer(
24 model="openai/gpt-4o-mini",
25 reflection_model="openai/gpt-4o", # stronger reflector is often helpful
26 project_name="GEPA_TinyTest",
27 temperature=0.2,
28 max_tokens=200,
29)
30
31result = optimizer.optimize_prompt(
32 prompt=prompt,
33 dataset=dataset,
34 metric=levenshtein_ratio,
35 max_metric_calls=12,
36 reflection_minibatch_size=2,
37 n_samples=5,
38)
39
40result.display()

Reference implementation: sdks/opik_optimizer/scripts/litellm_gepa_tiny_test_example.py

Example (Hotpot-style with LiteLLM)

The following mirrors our example that uses a Hotpot-style dataset and LiteLLM models.

1from typing import Any, Dict
2
3import opik
4from opik_optimizer import ChatPrompt
5from opik_optimizer.gepa_optimizer import GepaOptimizer
6from opik_optimizer.datasets import hotpot_300
7from opik_optimizer.utils import search_wikipedia
8
9from opik.evaluation.metrics import LevenshteinRatio
10from opik.evaluation.metrics.score_result import ScoreResult
11
12dataset = hotpot_300()
13
14def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
15 metric = LevenshteinRatio()
16 return metric.score(reference=dataset_item["answer"], output=llm_output)
17
18prompt = ChatPrompt(
19 system="Answer the question",
20 user="{question}",
21 tools=[
22 {
23 "type": "function",
24 "function": {
25 "name": "search_wikipedia",
26 "description": "This function is used to search wikipedia abstracts.",
27 "parameters": {
28 "type": "object",
29 "properties": {
30 "query": {
31 "type": "string",
32 "description": "The query parameter is the term or phrase to search for.",
33 },
34 },
35 "required": ["query"],
36 },
37 },
38 },
39 ],
40 function_map={"search_wikipedia": opik.track(type="tool")(search_wikipedia)},
41)
42
43optimizer = GepaOptimizer(
44 model="openai/gpt-4o-mini", # task model
45 reflection_model="openai/gpt-4o", # reflection model
46 project_name="GEPA-Hotpot",
47 temperature=0.7,
48 max_tokens=400,
49)
50
51result = optimizer.optimize_prompt(
52 prompt=prompt,
53 dataset=dataset,
54 metric=levenshtein_ratio,
55 max_metric_calls=60,
56 reflection_minibatch_size=5,
57 candidate_selection_strategy="best",
58 n_samples=12,
59)
60
61result.display()

Reference implementation: sdks/opik_optimizer/scripts/litellm_gepa_hotpot_example.py

Usage Notes

  • Single-turn focus: GEPA’s default adapter targets optimizing a single system message for single-turn tasks.
  • Dataset mapping: The optimizer heuristically infers input/output keys when not provided. Ensure your dataset items contain a clear input field (e.g., text, question) and a reference label/answer field.
  • Metrics: Any metric that returns an Opik ScoreResult can be used (e.g., LevenshteinRatio).
  • Budget: max_metric_calls determines the optimization budget; higher budgets can find better prompts at more cost.
  • Reflection batching: reflection_minibatch_size controls how many items are considered per reflection step.

API Summary

Constructor:

1GepaOptimizer(
2 model: str,
3 project_name: str | None = None,
4 reflection_model: str | None = None, # defaults to model if None
5 verbose: int = 1,
6 **model_kwargs,
7)

Common model_kwargs include LiteLLM parameters such as temperature, max_tokens, and others.

Optimization:

1opt_result = optimizer.optimize_prompt(
2 prompt: ChatPrompt,
3 dataset: opik.Dataset,
4 metric: Callable[[dict, str], ScoreResult],
5 max_metric_calls: int,
6 reflection_minibatch_size: int = 3,
7 candidate_selection_strategy: str = "pareto", # e.g., "pareto" or "best"
8 n_samples: int | None = None,
9)

The returned OptimizationResult includes the best prompt, score, and useful details such as GEPA’s best candidate, validation scores, and the evolution history.

Comparisons & Benchmarks

  • Comparisons: GEPA targets optimizing a single system prompt for single-turn tasks, similar in scope to MetaPrompt on simple flows. Unlike Evolutionary or MIPRO, it does not aim to explore widely different prompt structures or multi-step agent graphs. For chat prompts heavily reliant on few-shot examples, the Few-shot Bayesian optimizer may be more appropriate.
  • Benchmarks: TBD. GEPA is newly integrated; we will add results as we complete our evaluation runs.

Troubleshooting

  • Install error: Ensure pip install gepa completes successfully in the same environment as opik-optimizer.
  • API keys: Set provider keys (e.g., OPENAI_API_KEY) for LiteLLM to load your chosen models.
  • Low improvements: Increase max_metric_calls, try a stronger reflection_model, or adjust the starting system prompt.
  • Mismatch fields: Confirm your dataset’s input/output field names align with your prompt’s placeholders and metric’s reference.