GEPA Optimizer
Single-turn system prompt optimization with reflection
GepaOptimizer
wraps the external GEPA package to optimize a
single system prompt for single-turn tasks. It maps Opik datasets and metrics into GEPA’s expected
format, runs GEPA’s optimization using a task model and a reflection model, and returns a standard
OptimizationResult
compatible with the Opik SDK.
GEPA integration is currently in Beta. APIs and defaults may evolve, and performance characteristics can change as we iterate. Please report issues or feedback on GitHub.
When to use: Choose GepaOptimizer
when you have a single-turn task (one user input → one model
response) and you want to optimize the system prompt using a reflection-driven search.
Key trade-offs:
- GEPA (DefaultAdapter) focuses on a single system prompt; multi-turn or tool-heavy flows are not the target.
- Requires both a task model and a reflection model (can be the same; often the reflection model is stronger).
- GEPA runs additional reflection passes; expect extra LLM calls compared to simple optimizers.
Requirements
- Python:
pip install opik-optimizer gepa
- Models: Uses LiteLLM-style model names for both the task model and reflection model (e.g.,
openai/gpt-4o-mini
,openai/gpt-4o
). Set the appropriate environment variables (e.g.,OPENAI_API_KEY
). See LiteLLM Support for Optimizers.
Quick Example (Tiny Test)
Reference implementation:
sdks/opik_optimizer/scripts/litellm_gepa_tiny_test_example.py
Example (Hotpot-style with LiteLLM)
The following mirrors our example that uses a Hotpot-style dataset and LiteLLM models.
Reference implementation:
sdks/opik_optimizer/scripts/litellm_gepa_hotpot_example.py
Usage Notes
- Single-turn focus: GEPA’s default adapter targets optimizing a single system message for single-turn tasks.
- Dataset mapping: The optimizer heuristically infers input/output keys when not provided. Ensure your dataset items
contain a clear input field (e.g.,
text
,question
) and a reference label/answer field. - Metrics: Any metric that returns an Opik
ScoreResult
can be used (e.g.,LevenshteinRatio
). - Budget:
max_metric_calls
determines the optimization budget; higher budgets can find better prompts at more cost. - Reflection batching:
reflection_minibatch_size
controls how many items are considered per reflection step.
API Summary
Constructor:
Common model_kwargs
include LiteLLM parameters such as temperature
, max_tokens
, and others.
Optimization:
The returned OptimizationResult
includes the best prompt, score, and useful details
such as
GEPA’s best candidate, validation scores, and the evolution history.
Comparisons & Benchmarks
- Comparisons: GEPA targets optimizing a single system prompt for single-turn tasks, similar in scope to MetaPrompt on simple flows. Unlike Evolutionary or MIPRO, it does not aim to explore widely different prompt structures or multi-step agent graphs. For chat prompts heavily reliant on few-shot examples, the Few-shot Bayesian optimizer may be more appropriate.
- Benchmarks: TBD. GEPA is newly integrated; we will add results as we complete our evaluation runs.
Troubleshooting
- Install error: Ensure
pip install gepa
completes successfully in the same environment asopik-optimizer
. - API keys: Set provider keys (e.g.,
OPENAI_API_KEY
) for LiteLLM to load your chosen models. - Low improvements: Increase
max_metric_calls
, try a strongerreflection_model
, or adjust the starting system prompt. - Mismatch fields: Confirm your dataset’s input/output field names align with your prompt’s placeholders and metric’s reference.