Optimizer benchmarks

We regularly evaluate every optimizer against shared datasets so you can make informed trade-offs. This page summarizes the latest results and explains how to reproduce them with the public benchmark scripts.

Datasets & metrics

Each run uses Opik datasets backed by open-source corpuses commonly used in academia:

DatasetDescriptionPrimary metrics
Arc (ai2_arc)Multiple-choice science questions.LevenshteinRatio, accuracy.
GSM8K (gsm8k)Grade-school math word problems.Exact match, custom math verifier.
MedHallu (medhallu)Medical Q&A with hallucination checks.Hallucination, AnswerRelevance.
RagBench (ragbench)Retrieval-oriented questions.AnswerRelevance, contextual grounding.

Results shown below use gpt-4o-mini for evaluation on non multi-hop based runs. Scores will change if you select different models, metrics, agent configurations or prompt seeds.

Latest results

RankAlgorithm/OptimizerAvg. ScoreArcGSM8KRagBench
1HRPO67.83%92.70%28.00%82.80%
2Few-Shot Bayesian59.17%28.09%59.26%90.15%
3Evolutionary52.51%40.00%25.53%92.00%
4MetaPrompt38.75%25.00%26.93%64.31%
5GEPA32.27%6.55%26.08%64.17%
6Baseline (no optimization)11.85%1.69%24.06%9.81%

These are directional numbers. Some optimizers use more LLM/tool calls per trial than others (e.g., the HRPO Hierarchical Reflective Prompt Optimizer batches multiple analyses), so cost and runtime are not apples-to-apples even when the trial budget matches. Re-run the suite with your own datasets, models, and cost constraints before committing to a single optimizer.

Run benchmarks locally

  1. Install dependencies (ideally in a virtualenv):
    $pip install -r sdks/opik_optimizer/benchmarks/requirements.txt
  2. Configure provider keys (e.g., OPENAI_API_KEY).
  3. Execute the runner:
    $python sdks/opik_optimizer/benchmarks/run_benchmark.py \
    > --model openai/gpt-4o-mini \
    > --output results.json
  4. Inspect the JSON or load it into a notebook to compare against the published table.

The script spins up datasets defined in sdks/opik_optimizer/benchmarks/config.py, runs each optimizer with consistent trial budgets, and logs runs to Opik so you can review traces. Note that production use should include separate validation datasets to prevent overfitting—see Define datasets for guidance.

Looking for production-style examples beyond synthetic benchmarks? Check out the agent optimizations demos repo. It contains end-to-end scenarios (LangGraph, RAG, support bots) and shows how different optimizers behave in real workloads.

Next steps

  • Learn how each optimizer works in the Algorithms overview.
  • Customize the benchmark configs (datasets, metrics, budgets) to mirror your production workload.
  • Share results or contribute improvements via GitHub.