Define metrics
Metrics drive optimizer decisions. This guide highlights the fastest way to pick proven presets from Opik’s evaluation catalog, then shows how to extend them when your use case demands it. If you need the full theory, see Evaluation concepts and the metrics overview.
Metric anatomy
A metric is a callable with the signature (dataset_item, llm_output) -> ScoreResult | float. Use ScoreResult to attach names and reasons.
Compose metrics
Use MultiMetricObjective to balance multiple goals (accuracy, style, safety).
Weights do not need to sum to 1; choose numbers that highlight the most critical metric to your use case.
Recommended presets
Reuse these heuristics before writing custom metrics—most are already imported in opik.evaluation.metrics.
Checklist for great metrics
- Return explanations – populate
reasonso reflective optimizers can group failure modes. - Avoid randomness – deterministic metrics keep optimizers from chasing noise.
- Bound runtime – use cached references or lightweight models where possible; heavy metrics slow down trials.
- Log metadata – include
detailsin theScoreResultif you want to visualize per-sample attributes later.
When you outgrow presets, move to Custom metrics for LLM-as-a-judge flows or domain-specific scoring.
Testing metrics
- Dry-run against a handful of dataset rows before launching an optimization.
- Use
optimizer.task_evaluator.evaluate_promptto evaluate a single prompt with your metric. - Inspect the per-sample reasons in the Opik dashboard to ensure they match expectations.
Related resources
- Deep dive: Multi-metric optimization guide
- API reference:
ScoreResult - Advanced topic: Custom metrics