Your multi-hop reasoning agent fails 55 percent of the time. You spend three days tweaking prompts by adjusting the phrasing, adding more examples, and restructuring instructions. The agent performance inches to 58 percent success. You’re not sure what helped. You wonder if you’ve hit a ceiling.

This is the reality for many teams building production agents. Manual prompt engineering doesn’t scale. Other alternatives, such as reinforcement learning using Group Relative Policy Optimization (GRPO), require significantly more rollouts to converge. For agents calling expensive APIs or teams with limited evaluation budgets, neither of those options works.
Research from UC Berkeley and Stanford demonstrates that GEPA (Genetic-Pareto) optimization is a fundamentally different approach that achieves up to 20 percent performance gains while using up to 35 times fewer rollouts than GRPO. On the HotpotQA benchmark, GEPA took a 42 percent baseline to 62 percent accuracy with just 6,438 rollouts. GRPO needed 24,000 rollouts to reach 43 percent.
The key insight? Instead of compressing agent execution traces into sparse scalar rewards, GEPA treats natural language as a rich learning signal. By using GEPA, the LLM analyzes failures, identifies patterns and proposes targeted improvements in a systematic way, which is why reflection-based optimization represents the future of agent development.
Manual Optimization and Reinforcement Learning Miss the Mark
Build an agent prototype. Watch it fail on edge cases. Spend hours hypothesizing what’s wrong. Tweak prompts. Re-evaluate. Repeat.
When your customer service agent gives a wrong answer, what failed? The intent classification prompt? The retrieval pulling irrelevant documents? Unclear tool descriptions? The response generation lacking necessary constraints? You’re stuck debugging a system with dozens of potential failure points and little insight into which changes will help.
The problem isn’t effort. It’s signal.
Reinforcement learning addresses this challenge through brute force by sampling thousands of trajectories, computing scalar rewards, estimating policy gradients, and updating weights. It takes approximately 24,000 rollouts for GRPO to reach convergence. For an agent calling expensive APIs multiple times per interaction, that’s prohibitively expensive. For teams with only hundreds of labeled examples, it’s impossible.
The fundamental issue? Both approaches discard diagnostic information. Manual debugging relies on human intuition scanning traces. Reinforcement learning collapses execution paths into single reward scores. Neither systematically extracts the insights that are already available in natural language traces.
GEPA Treats Language as Signal, Not Noise
Agent execution traces can inherently be interpreted. Your agent produces prompts, reasoning chains, tool calls and natural language outputs. The idea behind GEPA is that modern LLMs can already understand and reason about these traces. You can then use an LLM to analyze failures and generate targeted improvements based on these traces.
Traditional reinforcement learning asks, “Which direction in parameter space increases reward?”
GEPA asks, “What specifically went wrong in this execution trace, and how should we fix the prompt to address it?”
This question shifts the optimization paradigm entirely, creating a profound difference.
Consider a multi-hop question-answering system. In the GEPA paper, the researchers use a seed prompt for the second retrieval hop:
Given the fields question, summary_1, produce the fields query.
After GEPA optimization, the evolved prompt became more like this example:
You will be given two input fields: question and summary_1.
Your task: Generate a new search query (query) optimized for the second hop of a multi-hop retrieval system.
Purpose and Context:
-Your generated query aims to find the missing pieces of information needed to fully answer the question
-The query must retrieve relevant documents NOT found in first hop
Key Observations and Lessons:
-First-hop documents often cover one entity or aspect
-Remaining relevant documents often involve connected or higher-level concepts mentioned in summary_1
-Avoid merely paraphrasing the original question
-Infer what broader entities/concepts might provide the crucial missing information
…
These changes aren’t generic prompt engineering tips. These improvements are targeted, effectively derived from analyzing actual failure patterns when the agents duplicate first-hop queries, fail to broaden search scope and ignore higher-level concepts in retrieved summaries.
This provides a drastic improvement in the results with fewer rollouts. With GEPA, reflective improvement generates targeted changes, while evolutionary search prevents premature convergence.
Reflection Drives Fixes From Failures
When GEPA evaluates your agent, it captures complete execution traces, including inputs, intermediate steps, tool calls with responses, reasoning chains, and final outputs.
For failures, a reflective prompt optimizer analyzes these traces diagnostically:
- What specific prompt aspect led to this failure?
- What information or instruction is missing?
- What patterns appear across multiple failures?
- How should the prompt be modified?
The optimizer generates natural language feedback that’s diagnostic and prescriptive. Then each new candidate inherits lessons from its ancestor prompts. Iterations build on accumulated insights rather than starting from scratch with each modified prompt.
This is fundamentally different from the reinforcement prompt approach. Instead of changing the prompt by nudging parameters, the LLM identifies the failure and provides a specific fix.
Pareto-Based Selection Maintains Diversity
If your only goal is “fastest top speed” you’d just pick the car with the highest speed. You might end up with a car that’s incredibly fast but gets terrible mileage, is unsafe, and costs a fortune. You’ve optimized for one thing, but ignored others. Now, imagine you want a car that’s fast and fuel-efficient and affordable and safe. It’s unlikely one car will be the absolute best at all four things simultaneously.
A Pareto front is like a group of cars that are all “best compromises.” For any car on the Pareto front, you cannot make it better in one aspect (e.g., faster) without making it worse in at least one other aspect (e.g., less fuel-efficient, more expensive).
After the reflection generates improvements, GEPA maintains a Pareto frontier of candidate prompts. A prompt only earns frontier membership (make it on the board) if it achieves the best score on at least one training task, even if other prompts perform better on average.
This approach creates strategic diversity. Some candidates excel at straightforward questions but struggle with complex multi-hop reasoning. Others handle ambiguous queries well but fail on factual lookups. By maintaining multiple “winning strategies,” GEPA continues exploring throughout optimization.
When selecting candidates to improve, GEPA samples from the frontier proportionally to how many tasks each dominates. Frequent winners get selected more often, but edge-case specialists remain in play.
Many optimization algorithms exploit too aggressively, which means the algorithm finds something that works, sticks with it and gets stuck. GEPA balances exploitation with exploration, preventing premature abandonment of promising alternatives.
How the GEPA Loop Works
The GEPA workflow includes several steps to improve results quickly:
- Sample: use current candidate prompt on dataset sample (minibatch), capture full traces
- Reflect: optimizer analyzes failures, proposes improvements
- Mutate: generate new candidate prompts with proposed changes
- Validate: if minibatch performance improves, add to pool of winning prompts
- Update frontier: re-evaluate which candidate prompts dominate each task
- Select: choose next candidate prompt via Pareto-weighted sampling (good across all aspects)
- Repeat: continue until rollout budget exhausted
Each iteration makes targeted progress informed by actual failure modes. The selection strategy ensures systematic prompt space exploration.
The Benchmarks Show Substantial Gains and Practical Budgets
As researchers discovered in GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, the performance gains are consistent and substantial across diverse task types. The researchers used four benchmarks to demonstrate GEPA’s effectiveness against MIPROv2 (automated prompt optimizer) and GRPO (reinforcement learning).
On most tasks, GEPA matched GRPO’s best validation performance after just 300 to 400 rollouts, up to 78 times greater sample efficiency at that milestone.
On IFBench, which tests instruction following on completely novel constraint types, GEPA was the only method that improved over baseline. This suggests the reflection-based approach captures genuinely transferable prompt engineering principles rather than overfitting to training examples.
The GEPA paper opens interesting questions for data scientists and other researchers:
- How does reflection quality vary across optimizer models?
- Can GEPA extend beyond single-module optimization to full agent architectures?
- What happens in domains where failure modes aren’t expressible in natural language?
- How does performance scale with extremely limited rollout budgets (under 100)?
This new approach of reflection-based improvements is new. We will continue to discover its boundaries, including applications for optimization for multiple objectives and architectural search.
Comparing GEPA Strengths and Weaknesses
No optimization algorithm works everywhere. Understanding GEPA’s limitations reveals where the approach fits in your stack.
GEPA Strengths
GEPA’s efficiency shines with 600 to 7,000 rollout budgets, which is ideal for expensive APIs or limited labeled data. This is especially useful when rollout budgets are constrained.
GEPA targets single-turn or few-turn tasks with clear input-output patterns. It excels at refining the core instructions guiding agent behavior.
The benchmarks where GEPA succeeded when benchmarks had diagnosable failure modes visible in execution traces. These benchmarks required multi-hop reasoning, document retrieval and structured generation.
Unlike black-box optimization, GEPA produces substantive, interpretable refinements. This optimization could be a requirement when you need reviewable changes, which is critical for regulated industries or brand-sensitive applications.
GEPA Weaknesses
The research notes GEPA is instruction-focused. If you need to select optimal demonstrations rather than improving wording, Few-Shot Bayesian optimization could work better.
If system-level coordination is the main issue for your agent, such as complex multi-agent architectures where prompt quality isn’t the bottleneck, use Hierarchical Reflective optimization or architectural changes instead.
If your diagnostic signal is poor, GEPA’s reflection mechanism might not be the right tool. Because GEPA requires interpretable failure modes, execution traces need to reveal why failures occur. If the traces don’t include that information, the algorithm can’t generate targeted improvements.
If you have an unlimited evaluation budget, you might be able to afford more than 50,000 rollouts and want to optimize model weights directly. In this case, traditional reinforcement learning might achieve marginal additional gains. Before deciding, consider the interpretability tradeoff.
Optimizing Agents With An Optimizer Agnostic Modular SDK
Always match the algorithm to the challenge. GEPA excels at instruction optimization under sample constraints. But it’s one tool in a comprehensive toolkit for production agent development.
When considering how to incorporate GEPA in your agent, it is important to remember that it is only one algorithm in Opik’s broader optimization suite. Production systems need multiple approaches:
- GEPA: optimizes a single system prompt for single-turn tasks
- MetaPrompt: LLM-driven critique for general refinement
- HRPO (Hierarchical Reflective): Root cause analysis for multi-component systems
- Few-Shot Bayesian: Demonstration selection via Bayesian search
- Evolutionary: Genetic algorithms for novel structures
- Parameter: Temperature, top_p and model setting tuning
You can also chain these tools together, which is a unique feature of the Opik Agent Optimization SDK, creating a native connection point between all optimization algorithms. For example, you could run GEPA followed by Few-Shot Bayesian, which optimizes instructions then demonstrations. Or GEPA followed by Parameter, which refines prompts then model settings.
To successfully implement GEPA alongside these other optimizers, consider these tips:
- Rich feedback is essential. Maximize how the reason field is used because it powers reflection. Specific reasons enable targeted improvements, but simply noting a failure does not.
- Ensure you have full observability by sending every trial log to Opik’s dashboard with complete trace data. These should include inputs, outputs, intermediate reasoning, metric scores and the evolution tree showing how prompts improved.
- Make sure the same tracing infrastructure that captures optimization runs works in production, creating a continuous loop where production feedback informs future optimization.
GEPA is just one algorithm addressing one piece of the puzzle. Production agentic systems require optimizing prompts, tools, architectures, model parameters and retrieval strategies simultaneously. Building a reliable agent isn’t reliant on a single algorithm or optimizer. These agents are built with composable toolkits where you select the right approach for each component, chain algorithms together and run them at different lifecycle stages.
The Path Forward
Three years ago, prompt engineering was an art. Manual tweaking, folklore, trial and error. Today it’s becoming a science through systematic, data-driven and algorithmic refinement.
GEPA represents this shift. Rather than treating prompts as opaque parameters to tweak randomly, this approach treats prompts as natural language instructions that can be analyzed, critiqued and improved through explicit reasoning.
This interpretability transforms how you can build production systems. When GEPA proposes changes, review exactly what it’s modifying and why. You don’t have to deploy black-box improvements. GEPA empowers you to deploy understandable refinements you can validate against requirements, compliance constraints and brand voice.
In Opik, agent optimization integrates with tracing, evaluation and production monitoring to create continuous improvement loops. Production traces become optimization training data. Evaluation metrics guide algorithmic search. Optimized prompts get validated against held-out tests before deployment. The cycle repeats, continuously refining systems as user needs evolve. The best part is that GEPA is a first-party integration with the Opik optimization SDK: why not use GEPA with a user interface and a testing pipeline out of the box?.
This is how AI development matures: from reactive debugging to proactive optimization, from manual iteration to systematic improvement, from guesswork to principled engineering.
Ready to move beyond manual prompt tweaking? Use the Opik Agent Optimizer Quickstart to set up your first successful optimization run. The open-source Agent Optimizer includes GEPA alongside MetaPrompt, Hierarchical Reflective, Evolutionary and other algorithms. These tools give you the flexibility to compose the right optimization strategy for your specific challenges.
To see how reflection-based optimization compares to your current approach, start with 50 evaluation examples and a simple success metric. GEPA shows you what systematic, interpretable improvement looks like. Join hundreds of teams already using Opik for production agent optimization.
