Metrics drive optimizer decisions. This guide highlights the fastest way to pick proven presets from Opik’s evaluation catalog, then shows how to extend them when your use case demands it. If you need the full theory, see Evaluation concepts and the metrics overview.
A metric is a callable with the signature (dataset_item, llm_output) -> ScoreResult | float. Use ScoreResult to attach names and reasons.
Use MultiMetricObjective to balance multiple goals (accuracy, style, safety).
Weights do not need to sum to 1; choose numbers that highlight the most critical metric to your use case.
You can optimize for efficiency alongside quality by including span-based metrics like cost and duration in your composite objective. These metrics require access to the task_span parameter:
For a working end-to-end example in the repository, see: multi_metric_cost_duration_example.py
Span-based metrics like SpanCost and SpanDuration automatically receive the task_span parameter during evaluation, which contains execution information about the agent’s run. When using raw (non-normalized) cost or duration values, use negative weights in MultiMetricObjective to minimize them. When using target-normalized metrics (target=), use positive weights because those scores are mapped to a “higher is better” scale.
Direction control is explicit:
invert=True (default) for efficiency metrics where lower raw values should score higher.invert=False if your objective should reward higher raw values.
When target is omitted, these metrics return raw values (not normalized scores).Reuse these heuristics before writing custom metrics—most are already imported in opik.evaluation.metrics.
Opik Optimizer also ships built-in metric helpers for common optimization setups:
Example with built-ins:
reason so reflective optimizers can group failure modes.details in the ScoreResult if you want to visualize per-sample attributes later.When you outgrow presets, move to Custom metrics for LLM-as-a-judge flows or domain-specific scoring.
optimizer.task_evaluator.evaluate_prompt to evaluate a single prompt with your metric.ScoreResult