Hierarchical Reflective Optimizer

Hierarchical root cause analysis for targeted prompt improvement

The HierarchicalReflectiveOptimizer uses hierarchical root cause analysis to identify and address specific failure modes in your prompts. It analyzes evaluation results, identifies patterns in failures, and generates targeted improvements to address each failure mode systematically.

When to Use This Optimizer: HierarchicalReflectiveOptimizer is ideal when you have a complex prompt that you want to refine based on understanding why it’s failing. Unlike optimizers that generate many random variations, this optimizer systematically analyzes failures, identifies root causes, and makes surgical improvements to address each specific issue.

Key Trade-offs:

  • Requires metrics that return reasons for their scores (using ScoreResult with reason field). Simple numeric metrics won’t provide enough feedback for root cause analysis.
  • Best suited for refinement of existing prompts rather than discovering entirely new prompt structures.
  • Uses hierarchical analysis which involves multiple LLM calls for analyzing batches of failures and synthesizing findings.
  • Currently supports single-iteration optimization (one round of analysis and improvement), though the framework is designed for future multi-round support.

Have questions about HierarchicalReflectiveOptimizer? The Optimizer & SDK FAQ addresses common questions about choosing optimizers, understanding the role of the reasoning_model, and how parameters like max_parallel_batches affect performance and cost.

How It Works

The HierarchicalReflectiveOptimizer takes a systematic approach to prompt improvement through the following process:

  1. Baseline Evaluation:

    • Your initial prompt is evaluated against the dataset using your specified metric.
    • A baseline score is established to measure improvements against.
  2. Hierarchical Root Cause Analysis:

    • Evaluation results (especially failures or low-scoring cases) are split into batches.
    • Each batch is analyzed in parallel using the reasoning_model to identify patterns and failure modes.
    • The batch-level analyses are then synthesized into a unified set of failure modes that represent the core issues with the current prompt.
    • This hierarchical approach (batch → synthesis) is more scalable and robust than analyzing all failures at once.
  3. Failure Mode Identification:

    • Each identified failure mode includes:
      • A descriptive name (e.g., “Vague Instructions”, “Missing Context”)
      • A description of the failure pattern
      • A root cause analysis explaining why the prompt fails in these cases
  4. Targeted Improvement Generation:

    • For each failure mode, the optimizer generates an improved version of the prompt.
    • The improvement is guided by a meta-prompt that instructs the reasoning_model to:
      • Make surgical, targeted changes that address the specific root cause
      • Update existing instructions if they’re unclear or incomplete
      • Add new instructions only when necessary
      • Maintain the original prompt structure and intent
  5. Iterative Evaluation with Retries:

    • Each improved prompt is evaluated against the dataset.
    • If an improvement doesn’t increase the score, the optimizer can retry with a different seed (controlled by max_retries).
    • Only improvements that increase the score are kept; otherwise, the previous best prompt is retained.
  6. Result:

    • The highest-scoring prompt found across all improvements is returned as the optimized prompt.
    • Detailed metadata about the optimization process, including failure modes addressed and improvement attempts, is included in the result.

Metric Requirements: The HierarchicalReflectiveOptimizer requires metrics that provide reasoning about their scores. When using Opik metrics, ensure they return ScoreResult objects with the reason field populated. This feedback is critical for identifying failure modes.

Example of a good metric for HierarchicalReflectiveOptimizer:

1from opik.evaluation.metrics import ScoreResult
2
3def my_metric(dataset_item, llm_output):
4 # Your scoring logic
5 score = calculate_score(dataset_item, llm_output)
6
7 # IMPORTANT: Include reason for the score
8 reason = f"Output {'matches' if score > 0.5 else 'does not match'} expected format. {additional_context}"
9
10 return ScoreResult(
11 name="my_metric",
12 value=score,
13 reason=reason # This is required!
14 )

The hierarchical root cause analysis (Step 2) is what makes this optimizer unique. It processes evaluation results in batches, analyzes patterns in each batch, and then synthesizes findings across all batches. This approach scales better to large datasets and produces more coherent, actionable failure modes than analyzing all results at once. Understanding how Opik’s evaluation works will help you design better metrics: - Evaluation Overview - Metrics Overview

Configuration Options

Basic Configuration

1from opik_optimizer import HierarchicalReflectiveOptimizer
2
3optimizer = HierarchicalReflectiveOptimizer(
4 reasoning_model="openai/gpt-4.1", # Model for analysis and improvement generation
5 # Technical Parameters
6 num_threads=12, # Parallel threads for evaluation
7 max_parallel_batches=5, # Max batches analyzed concurrently
8 verbose=1, # 0=quiet, 1=show progress
9 seed=42, # Random seed for reproducibility
10 # LLM parameters (passed via **model_kwargs)
11 temperature=0.7,
12 max_tokens=4096
13)

Advanced Configuration

Key parameters include:

  • reasoning_model: The LLM used for root cause analysis, failure mode synthesis, and generating prompt improvements. This is typically a powerful model like GPT-4.
  • num_threads: Number of parallel threads used for evaluating prompts against the dataset. Higher values speed up evaluation but increase concurrent API calls.
  • max_parallel_batches: Maximum number of batches to analyze concurrently during hierarchical root cause analysis. Controls parallelism vs. memory/API usage.
  • seed: Random seed for reproducibility. Each retry attempt uses a varied seed to avoid cache hits and ensure different improvement suggestions.
  • verbose: Controls display of progress bars and detailed logging (0=off, 1=on).
  • **model_kwargs: Additional keyword arguments (e.g., temperature, max_tokens) passed to the underlying LLM calls.

The optimize_prompt method also accepts:

  • max_retries: Number of retry attempts if an improvement doesn’t increase the score (default: 2). Each retry uses a different seed.
  • n_samples: Optional limit on the number of dataset items used for evaluation. Useful for faster iterations during development.
  • auto_continue: Reserved for future multi-round optimization support.

Example Usage

1from opik_optimizer import HierarchicalReflectiveOptimizer, ChatPrompt, datasets
2from opik.evaluation.metrics.score_result import ScoreResult
3
4# 1. Define your evaluation dataset
5dataset = datasets.hotpot_300() # or use your own dataset
6
7# 2. Configure the evaluation metric (MUST return reasons!)
8def answer_quality_metric(dataset_item, llm_output):
9 reference = dataset_item.get("answer", "")
10
11 # Your scoring logic
12 is_correct = reference.lower() in llm_output.lower()
13 score = 1.0 if is_correct else 0.0
14
15 # IMPORTANT: Provide detailed reasoning
16 if is_correct:
17 reason = f"Output contains the correct answer: '{reference}'"
18 else:
19 reason = f"Output does not contain expected answer '{reference}'. Output was too vague or incorrect."
20
21 return ScoreResult(
22 name="answer_quality",
23 value=score,
24 reason=reason # Critical for root cause analysis!
25 )
26
27# 3. Define your initial prompt
28initial_prompt = ChatPrompt(
29 project_name="reflective_optimization",
30 messages=[
31 {"role": "system", "content": "You are a helpful assistant that answers questions accurately."},
32 {"role": "user", "content": "Question: {question}\n\nProvide a concise answer."}
33 ]
34)
35
36# 4. Initialize the HierarchicalReflectiveOptimizer
37optimizer = HierarchicalReflectiveOptimizer(
38 reasoning_model="openai/gpt-4.1", # Strong model for analysis
39 num_threads=8,
40 max_parallel_batches=5,
41 seed=42,
42 temperature=0.7
43)
44
45# 5. Run the optimization
46optimization_result = optimizer.optimize_prompt(
47 prompt=initial_prompt,
48 dataset=dataset,
49 metric=answer_quality_metric,
50 n_samples=100, # Evaluate on 100 samples
51 max_retries=2 # Retry up to 2 times if improvement fails
52)
53
54# 6. View the results
55optimization_result.display()
56
57# Access the optimized prompt
58print("\nOptimized Prompt:")
59for msg in optimization_result.prompt:
60 print(f"{msg['role']}: {msg['content']}")
61
62# Check optimization details
63print(f"\nInitial Score: {optimization_result.initial_score:.4f}")
64print(f"Final Score: {optimization_result.score:.4f}")
65print(f"Improvement: {(optimization_result.score - optimization_result.initial_score):.4f}")
66print(f"LLM Calls Made: {optimization_result.llm_calls}")

Model Support

The HierarchicalReflectiveOptimizer uses LiteLLM for model interactions, providing broad compatibility with various LLM providers including OpenAI, Azure OpenAI, Anthropic, Google (Vertex AI / AI Studio), Mistral AI, Cohere, and locally hosted models (e.g., via Ollama).

The reasoning_model parameter accepts any LiteLLM-supported model string (e.g., "openai/gpt-4.1", "azure/gpt-4", "anthropic/claude-3-opus", "gemini/gemini-1.5-pro").

For detailed instructions on how to specify different models and configure providers, please refer to the main LiteLLM Support for Optimizers documentation page.

Configuration Example using LiteLLM model string

1optimizer = HierarchicalReflectiveOptimizer(
2 reasoning_model="anthropic/claude-3-opus-20240229",
3 temperature=0.7,
4 max_tokens=4096
5)

Best Practices

  1. Metric Design

    • Always include detailed reasons in your metric’s ScoreResult. The quality of root cause analysis depends on this feedback.
    • Provide specific, actionable feedback about why a response succeeds or fails.
    • Consider multiple aspects: correctness, completeness, format, tone, etc.
  2. Starting Prompt

    • Begin with a reasonably structured prompt. The optimizer refines existing prompts rather than creating from scratch.
    • Include clear intent and structure; the optimizer will make it more precise.
  3. Batch Configuration

    • max_parallel_batches=5 is a good default for balancing speed and API rate limits.
    • Increase if you have high rate limits and want faster analysis.
    • Decrease if you encounter rate limiting issues.
  4. Retry Strategy

    • Use max_retries=2 or max_retries=3 to give the optimizer multiple chances to improve for each failure mode.
    • Each retry uses a different seed, producing different improvement suggestions.
    • Higher retries increase cost but may find better solutions.
  5. Sample Size

    • Start with n_samples=50-100 for faster iteration during development.
    • Use larger samples or full dataset for final optimization runs.
    • Ensure your sample is representative of the full dataset’s diversity.

Comparison with Other Optimizers

AspectHierarchicalReflectiveOptimizerMetaPromptOptimizerEvolutionaryOptimizer
ApproachRoot cause analysis → targeted fixesGenerate variations → evaluateGenetic algorithm with populations
Metric RequirementsRequires reasons (ScoreResult)Scores onlyScores only
Best ForRefining complex prompts systematicallyGeneral prompt improvementExploring diverse prompt structures
IterationsSingle round (currently)Multiple roundsMultiple generations
LLM CallsModerate (analysis + improvements)High (many candidate generations)Very high (full populations)
Failure UnderstandingDeep (identifies specific failure modes)LimitedNone (purely score-driven)

Troubleshooting

Issue: Optimizer reports no improvements found

  • Solution: Check that your metric returns detailed reason fields. Ensure the dataset has sufficient examples of failures to analyze.

Issue: Root cause analysis seems generic

  • Solution: Use a stronger reasoning_model (e.g., GPT-4 instead of GPT-3.5). Ensure your metric’s reasons are specific and actionable.

Issue: Optimization is slow

  • Solution: Reduce n_samples, increase num_threads, or decrease max_parallel_batches to balance speed vs. thoroughness.

Issue: Rate limiting errors

  • Solution: Decrease max_parallel_batches and num_threads to reduce concurrent API calls.

Research and References

The Reflective Optimizer is inspired by techniques in:

  • Hierarchical analysis for scalable root cause identification
  • Reflective prompting for self-improvement
  • Targeted refinement over broad exploration

Next Steps