Optimizer benchmarks
We regularly evaluate every optimizer against shared datasets so you can make informed trade-offs. This page summarizes the latest results and explains how to reproduce them with the public benchmark scripts.
Datasets & metrics
Each run uses Opik datasets backed by open-source corpuses commonly used in academia:
Results shown below use gpt-4o-mini for evaluation. Scores will change if you select different models, metrics, or prompt seeds.
Latest results
These are directional numbers. Some optimizers use more LLM/tool calls per trial than others (e.g., Hierarchical Reflective batches multiple analyses), so cost and runtime are not apples-to-apples even when the trial budget matches. Re-run the suite with your own datasets, models, and cost constraints before committing to a single optimizer.
Run benchmarks locally
- Install dependencies (ideally in a virtualenv):
- Configure provider keys (e.g.,
OPENAI_API_KEY). - Execute the runner:
- Inspect the JSON or load it into a notebook to compare against the published table.
The script spins up datasets defined in sdks/opik_optimizer/benchmarks/config.py, runs each optimizer with consistent trial budgets, and logs runs to Opik so you can review traces.
Looking for production-style examples beyond synthetic benchmarks? Check out the agent optimizations demos repo. It contains end-to-end scenarios (LangGraph, RAG, support bots) and shows how different optimizers behave in real workloads.
Next steps
- Learn how each optimizer works in the Algorithms overview.
- Customize the benchmark configs (datasets, metrics, budgets) to mirror your production workload.
- Share results or contribute improvements via GitHub.