For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • Upgrading to Opik 2.0
    • Ollie Agent
    • FAQ
    • Changelog
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
      • Opik Agent Optimizer
      • Optimization Studio
      • Quickstart
      • Quickstart notebook
      • FAQ
      • Changelog
      • Known Issues
        • Overview
        • Benchmarks
        • MetaPrompt
        • HRPO
        • Few-Shot Bayesian
        • Evolutionary
        • GEPA
        • Parameter
        • Tool Optimization
  • Evaluation
    • Overview
    • Getting started
    • Concepts
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Datasets & metrics
  • Latest results
  • Run benchmarks locally
  • Next steps
DevelopmentOptimization runsOptimization Algorithms

Optimizer benchmarks

Was this page helpful?
Previous

MetaPrompt Optimizer

Refine and improve LLM prompts with systematic analysis.
Next
Built with

We regularly evaluate every optimizer against shared datasets so you can make informed trade-offs. This page summarizes the latest results and explains how to reproduce them with the public benchmark scripts.

Datasets & metrics

Each run uses Opik datasets backed by open-source corpuses commonly used in academia:

DatasetDescriptionPrimary metrics
Arc (ai2_arc)Multiple-choice science questions.LevenshteinRatio, accuracy.
GSM8K (gsm8k)Grade-school math word problems.Exact match, custom math verifier.
MedHallu (medhallu)Medical Q&A with hallucination checks.Hallucination, AnswerRelevance.
RagBench (ragbench)Retrieval-oriented questions.AnswerRelevance, contextual grounding.

Results shown below use openai/gpt-5-nano for evaluation on non multi-hop based runs. Scores will change if you select different models, metrics, agent configurations or prompt seeds.

Latest results

RankAlgorithm/OptimizerAvg. ScoreArcGSM8KRagBench
1HRPO67.83%92.70%28.00%82.80%
2Few-Shot Bayesian59.17%28.09%59.26%90.15%
3Evolutionary52.51%40.00%25.53%92.00%
4MetaPrompt38.75%25.00%26.93%64.31%
5GEPA32.27%6.55%26.08%64.17%
6Baseline (no optimization)11.85%1.69%24.06%9.81%

These are directional numbers. Some optimizers use more LLM/tool calls per trial than others (e.g., the HRPO Hierarchical Reflective Prompt Optimizer batches multiple analyses), so cost and runtime are not apples-to-apples even when the trial budget matches. Re-run the suite with your own datasets, models, and cost constraints before committing to a single optimizer.

Run benchmarks locally

  1. Install dependencies (ideally in a virtualenv):
    $pip install -r sdks/opik_optimizer/benchmarks/requirements.txt
  2. Configure provider keys (e.g., OPENAI_API_KEY).
  3. Execute the runner:
    $python sdks/opik_optimizer/benchmarks/run_benchmark.py \
    > --model openai/gpt-5-nano \
    > --output results.json
  4. Inspect the JSON or load it into a notebook to compare against the published table.

The script spins up datasets defined in sdks/opik_optimizer/benchmarks/config.py, runs each optimizer with consistent trial budgets, and logs runs to Opik so you can review traces. Note that production use should include separate validation datasets to prevent overfitting—see Define datasets for guidance.

Looking for production-style examples beyond synthetic benchmarks? Check out the agent optimizations demos repo. It contains end-to-end scenarios (LangGraph, RAG, support bots) and shows how different optimizers behave in real workloads.

Next steps

  • Learn how each optimizer works in the Algorithms overview.
  • Customize the benchmark configs (datasets, metrics, budgets) to mirror your production workload.
  • Share results or contribute improvements via GitHub.