Optimizer benchmarks

We regularly evaluate every optimizer against shared datasets so you can make informed trade-offs. This page summarizes the latest results and explains how to reproduce them with the public benchmark scripts.

Datasets & metrics

Each run uses Opik datasets backed by open-source corpuses commonly used in academia:

DatasetDescriptionPrimary metrics
Arc (ai2_arc)Multiple-choice science questions.LevenshteinRatio, accuracy.
GSM8K (gsm8k)Grade-school math word problems.Exact match, custom math verifier.
MedHallu (medhallu)Medical Q&A with hallucination checks.Hallucination, AnswerRelevance.
RagBench (ragbench)Retrieval-oriented questions.AnswerRelevance, contextual grounding.

Results shown below use gpt-4o-mini for evaluation. Scores will change if you select different models, metrics, or prompt seeds.

Latest results

RankAlgorithm/OptimizerAvg. ScoreArcGSM8KRagBench
1Hierarchical Reflective67.83%92.70%28.00%82.80%
2Few-Shot Bayesian59.17%28.09%59.26%90.15%
3Evolutionary52.51%40.00%25.53%92.00%
4MetaPrompt38.75%25.00%26.93%64.31%
5GEPA32.27%6.55%26.08%64.17%
6Baseline (no optimization)11.85%1.69%24.06%9.81%

These are directional numbers. Some optimizers use more LLM/tool calls per trial than others (e.g., Hierarchical Reflective batches multiple analyses), so cost and runtime are not apples-to-apples even when the trial budget matches. Re-run the suite with your own datasets, models, and cost constraints before committing to a single optimizer.

Run benchmarks locally

  1. Install dependencies (ideally in a virtualenv):
    $pip install -r sdks/opik_optimizer/benchmarks/requirements.txt
  2. Configure provider keys (e.g., OPENAI_API_KEY).
  3. Execute the runner:
    $python sdks/opik_optimizer/benchmarks/run_benchmark.py \
    > --model openai/gpt-4o-mini \
    > --output results.json
  4. Inspect the JSON or load it into a notebook to compare against the published table.

The script spins up datasets defined in sdks/opik_optimizer/benchmarks/config.py, runs each optimizer with consistent trial budgets, and logs runs to Opik so you can review traces.

Looking for production-style examples beyond synthetic benchmarks? Check out the agent optimizations demos repo. It contains end-to-end scenarios (LangGraph, RAG, support bots) and shows how different optimizers behave in real workloads.

Next steps

  • Learn how each optimizer works in the Algorithms overview.
  • Customize the benchmark configs (datasets, metrics, budgets) to mirror your production workload.
  • Share results or contribute improvements via GitHub.