Define metrics

Metrics drive optimizer decisions. This guide highlights the fastest way to pick proven presets from Opik’s evaluation catalog, then shows how to extend them when your use case demands it. If you need the full theory, see Evaluation concepts and the metrics overview.

Metric anatomy

A metric is a callable with the signature (dataset_item, llm_output) -> ScoreResult | float. Use ScoreResult to attach names and reasons.

1from opik.evaluation.metrics.score_result import ScoreResult
2
3def short_answer(item, output):
4 is_short = len(output) <= 200
5 return ScoreResult(
6 name="short_answer",
7 value=1.0 if is_short else 0.0,
8 reason="Answer under 200 chars" if is_short else "Answer too long"
9 )

Compose metrics

Use MultiMetricObjective to balance multiple goals (accuracy, style, safety).

1from opik_optimizer import MultiMetricObjective
2from opik.evaluation.metrics import LevenshteinRatio, AnswerRelevance
3
4objective = MultiMetricObjective(
5 weights=[0.6, 0.4],
6 metrics=[
7 lambda item, output: LevenshteinRatio().score(reference=item["answer"], output=output),
8 lambda item, output: AnswerRelevance().score(
9 context=[item["answer"]], output=output, input=item["question"]
10 ),
11 ],
12 name="accuracy_and_relevance",
13)

Weights do not need to sum to 1; choose numbers that highlight the most critical metric to your use case.

Include cost and duration metrics

You can optimize for efficiency alongside quality by including span-based metrics like cost and duration in your composite objective. These metrics require access to the task_span parameter:

1from opik_optimizer import MultiMetricObjective
2from opik.evaluation.metrics import AnswerRelevance
3from opik_optimizer.metrics import SpanCost, SpanDuration
4
5# Regular metric without task_span
6def answer_relevance(dataset_item, llm_output):
7 metric = AnswerRelevance()
8 return metric.score(
9 context=[dataset_item["answer"]],
10 output=llm_output,
11 input=dataset_item["question"]
12 )
13
14# Built-in span metrics can be normalized with target= for clean multi-metric weighting.
15# invert=True (default) means lower raw value -> higher score.
16cost = SpanCost(target=0.01, invert=True, name="cost_score")
17duration = SpanDuration(target=6.0, invert=True, name="duration_score")
18
19# Combine quality, cost, and speed metrics on a common [0, 1] scale
20objective = MultiMetricObjective(
21 metrics=[answer_relevance, cost, duration],
22 weights=[0.33, 0.33, 0.33], # equally optimize for accuracy, cost and duration/latency
23 name="quality_cost_speed",
24)

For a working end-to-end example in the repository, see: multi_metric_cost_duration_example.py

Span-based metrics like SpanCost and SpanDuration automatically receive the task_span parameter during evaluation, which contains execution information about the agent’s run. When using raw (non-normalized) cost or duration values, use negative weights in MultiMetricObjective to minimize them. When using target-normalized metrics (target=), use positive weights because those scores are mapped to a “higher is better” scale.

Direction control is explicit:

  • invert=True (default) for efficiency metrics where lower raw values should score higher.
  • invert=False if your objective should reward higher raw values. When target is omitted, these metrics return raw values (not normalized scores).
ScenarioMetricNotes
Factual QALevenshteinRatio or ExactMatchWorks with text-only datasets; deterministic and low cost.
Retrieval / groundingAnswerRelevancePass reference context via context=[item["answer"]] or retrieved docs.
SafetyModeration or custom LLM-as-a-judgeCombine with MultiMetricObjective to gate unsafe answers.
Multi-turn trajectoriesAgent trajectory evaluatorScores complete conversations, not just final outputs.

Reuse these heuristics before writing custom metrics—most are already imported in opik.evaluation.metrics.

Optimizer built-in metrics

Opik Optimizer also ships built-in metric helpers for common optimization setups:

MetricImportWhen to use
LevenshteinAccuracyMetricfrom opik_optimizer.metrics import LevenshteinAccuracyMetricQuick string-similarity accuracy using dataset keys like answer or highlights.
SpanCostfrom opik_optimizer.metrics import SpanCostCost metric with target= normalization and invert= direction control.
SpanDurationfrom opik_optimizer.metrics import SpanDurationDuration metric with target= normalization and invert= direction control.

Example with built-ins:

1from opik_optimizer import MultiMetricObjective
2from opik_optimizer.metrics import LevenshteinAccuracyMetric, SpanCost, SpanDuration
3
4accuracy = LevenshteinAccuracyMetric(reference_key="answer")
5cost = SpanCost(target=0.01, invert=True, name="cost_score")
6duration = SpanDuration(target=6.0, invert=True, name="duration_score")
7
8objective = MultiMetricObjective(
9 metrics=[accuracy, cost, duration],
10 weights=[0.5, 0.25, 0.25], # all metrics already normalized to [0, 1]
11 name="accuracy_cost_duration",
12)

Checklist for great metrics

  • Return explanations – populate reason so reflective optimizers can group failure modes.
  • Avoid randomness – deterministic metrics keep optimizers from chasing noise.
  • Bound runtime – use cached references or lightweight models where possible; heavy metrics slow down trials.
  • Log metadata – include details in the ScoreResult if you want to visualize per-sample attributes later.

When you outgrow presets, move to Custom metrics for LLM-as-a-judge flows or domain-specific scoring.

Testing metrics

  1. Dry-run against a handful of dataset rows before launching an optimization.
  2. Use optimizer.task_evaluator.evaluate_prompt to evaluate a single prompt with your metric.
  3. Inspect the per-sample reasons in the Opik dashboard to ensure they match expectations.