Hierarchical Reflective Optimizer
The HierarchicalReflectiveOptimizer
uses hierarchical root cause analysis to identify and address specific failure modes in your prompts. It analyzes evaluation results, identifies patterns in failures, and generates targeted improvements to address each failure mode systematically.
When to Use This Optimizer:
HierarchicalReflectiveOptimizer
is ideal when you have a complex prompt that you want to refine based on understanding why it’s failing. Unlike optimizers that generate many random variations, this optimizer systematically analyzes failures, identifies root causes, and makes surgical improvements to address each specific issue.
Key Trade-offs:
- Requires metrics that return reasons for their scores (using
ScoreResult
withreason
field). Simple numeric metrics won’t provide enough feedback for root cause analysis. - Best suited for refinement of existing prompts rather than discovering entirely new prompt structures.
- Uses hierarchical analysis which involves multiple LLM calls for analyzing batches of failures and synthesizing findings.
- Currently supports single-iteration optimization (one round of analysis and improvement), though the framework is designed for future multi-round support.
Have questions about HierarchicalReflectiveOptimizer
? The Optimizer & SDK FAQ addresses common questions about choosing optimizers, understanding the role of the reasoning_model
, and how parameters like max_parallel_batches
affect performance and cost.
How It Works
The HierarchicalReflectiveOptimizer
takes a systematic approach to prompt improvement through the following process:
-
Baseline Evaluation:
- Your initial prompt is evaluated against the dataset using your specified
metric
. - A baseline score is established to measure improvements against.
- Your initial prompt is evaluated against the dataset using your specified
-
Hierarchical Root Cause Analysis:
- Evaluation results (especially failures or low-scoring cases) are split into batches.
- Each batch is analyzed in parallel using the
reasoning_model
to identify patterns and failure modes. - The batch-level analyses are then synthesized into a unified set of failure modes that represent the core issues with the current prompt.
- This hierarchical approach (batch → synthesis) is more scalable and robust than analyzing all failures at once.
-
Failure Mode Identification:
- Each identified failure mode includes:
- A descriptive name (e.g., “Vague Instructions”, “Missing Context”)
- A description of the failure pattern
- A root cause analysis explaining why the prompt fails in these cases
- Each identified failure mode includes:
-
Targeted Improvement Generation:
- For each failure mode, the optimizer generates an improved version of the prompt.
- The improvement is guided by a meta-prompt that instructs the
reasoning_model
to:- Make surgical, targeted changes that address the specific root cause
- Update existing instructions if they’re unclear or incomplete
- Add new instructions only when necessary
- Maintain the original prompt structure and intent
-
Iterative Evaluation with Retries:
- Each improved prompt is evaluated against the dataset.
- If an improvement doesn’t increase the score, the optimizer can retry with a different seed (controlled by
max_retries
). - Only improvements that increase the score are kept; otherwise, the previous best prompt is retained.
-
Result:
- The highest-scoring prompt found across all improvements is returned as the optimized prompt.
- Detailed metadata about the optimization process, including failure modes addressed and improvement attempts, is included in the result.
Metric Requirements: The HierarchicalReflectiveOptimizer
requires metrics that provide reasoning about their scores. When using Opik metrics, ensure they return ScoreResult
objects with the reason
field populated. This feedback is critical for identifying failure modes.
Example of a good metric for HierarchicalReflectiveOptimizer:
The hierarchical root cause analysis (Step 2) is what makes this optimizer unique. It processes evaluation results in batches, analyzes patterns in each batch, and then synthesizes findings across all batches. This approach scales better to large datasets and produces more coherent, actionable failure modes than analyzing all results at once. Understanding how Opik’s evaluation works will help you design better metrics: - Evaluation Overview - Metrics Overview
Configuration Options
Basic Configuration
Advanced Configuration
Key parameters include:
reasoning_model
: The LLM used for root cause analysis, failure mode synthesis, and generating prompt improvements. This is typically a powerful model like GPT-4.num_threads
: Number of parallel threads used for evaluating prompts against the dataset. Higher values speed up evaluation but increase concurrent API calls.max_parallel_batches
: Maximum number of batches to analyze concurrently during hierarchical root cause analysis. Controls parallelism vs. memory/API usage.seed
: Random seed for reproducibility. Each retry attempt uses a varied seed to avoid cache hits and ensure different improvement suggestions.verbose
: Controls display of progress bars and detailed logging (0=off, 1=on).**model_kwargs
: Additional keyword arguments (e.g.,temperature
,max_tokens
) passed to the underlying LLM calls.
The optimize_prompt
method also accepts:
max_retries
: Number of retry attempts if an improvement doesn’t increase the score (default: 2). Each retry uses a different seed.n_samples
: Optional limit on the number of dataset items used for evaluation. Useful for faster iterations during development.auto_continue
: Reserved for future multi-round optimization support.
Example Usage
Model Support
The HierarchicalReflectiveOptimizer
uses LiteLLM for model interactions, providing broad compatibility with various LLM providers including OpenAI, Azure OpenAI, Anthropic, Google (Vertex AI / AI Studio), Mistral AI, Cohere, and locally hosted models (e.g., via Ollama).
The reasoning_model
parameter accepts any LiteLLM-supported model string (e.g., "openai/gpt-4.1"
, "azure/gpt-4"
, "anthropic/claude-3-opus"
, "gemini/gemini-1.5-pro"
).
For detailed instructions on how to specify different models and configure providers, please refer to the main LiteLLM Support for Optimizers documentation page.
Configuration Example using LiteLLM model string
Best Practices
-
Metric Design
- Always include detailed reasons in your metric’s
ScoreResult
. The quality of root cause analysis depends on this feedback. - Provide specific, actionable feedback about why a response succeeds or fails.
- Consider multiple aspects: correctness, completeness, format, tone, etc.
- Always include detailed reasons in your metric’s
-
Starting Prompt
- Begin with a reasonably structured prompt. The optimizer refines existing prompts rather than creating from scratch.
- Include clear intent and structure; the optimizer will make it more precise.
-
Batch Configuration
max_parallel_batches=5
is a good default for balancing speed and API rate limits.- Increase if you have high rate limits and want faster analysis.
- Decrease if you encounter rate limiting issues.
-
Retry Strategy
- Use
max_retries=2
ormax_retries=3
to give the optimizer multiple chances to improve for each failure mode. - Each retry uses a different seed, producing different improvement suggestions.
- Higher retries increase cost but may find better solutions.
- Use
-
Sample Size
- Start with
n_samples=50-100
for faster iteration during development. - Use larger samples or full dataset for final optimization runs.
- Ensure your sample is representative of the full dataset’s diversity.
- Start with
Comparison with Other Optimizers
Troubleshooting
Issue: Optimizer reports no improvements found
- Solution: Check that your metric returns detailed
reason
fields. Ensure the dataset has sufficient examples of failures to analyze.
Issue: Root cause analysis seems generic
- Solution: Use a stronger
reasoning_model
(e.g., GPT-4 instead of GPT-3.5). Ensure your metric’s reasons are specific and actionable.
Issue: Optimization is slow
- Solution: Reduce
n_samples
, increasenum_threads
, or decreasemax_parallel_batches
to balance speed vs. thoroughness.
Issue: Rate limiting errors
- Solution: Decrease
max_parallel_batches
andnum_threads
to reduce concurrent API calls.
Research and References
The Reflective Optimizer is inspired by techniques in:
- Hierarchical analysis for scalable root cause identification
- Reflective prompting for self-improvement
- Targeted refinement over broad exploration
Next Steps
- Explore other Optimization Algorithms
- Learn about Dataset Requirements
- Try the Example Projects & Cookbooks for runnable examples
- Read about creating effective metrics that provide good feedback for optimization