Define metrics | Opik Documentation

Metrics drive optimizer decisions. This guide highlights the fastest way to pick proven presets from Opik’s evaluation catalog, then shows how to extend them when your use case demands it. If you need the full theory, see Evaluation concepts and the metrics overview.

Metric anatomy

A metric is a callable with the signature (dataset_item, llm_output) -> ScoreResult | float. Use ScoreResult to attach names and reasons.

1 from opik.evaluation.metrics.score_result import ScoreResult
2 
3 def short_answer(item, output):
4     is_short = len(output) <= 200
5     return ScoreResult(
6         name="short_answer",
7         value=1.0 if is_short else 0.0,
8         reason="Answer under 200 chars" if is_short else "Answer too long"
9     )

Compose metrics

Use MultiMetricObjective to balance multiple goals (accuracy, style, safety).

1 from opik_optimizer import MultiMetricObjective
2 from opik.evaluation.metrics import LevenshteinRatio, AnswerRelevance
3 
4 objective = MultiMetricObjective(
5     weights=[0.6, 0.4],
6     metrics=[
7         lambda item, output: LevenshteinRatio().score(reference=item["answer"], output=output),
8         lambda item, output: AnswerRelevance().score(
9             context=[item["answer"]], output=output, input=item["question"]
10         ),
11     ],
12     name="accuracy_and_relevance",
13 )

Weights do not need to sum to 1; choose numbers that highlight the most critical metric to your use case. Use negative weights to minimize a metric instead of maximizing it.

Include cost and duration metrics

You can optimize for efficiency alongside quality by including span-based metrics like cost and duration in your composite objective. These metrics require access to the task_span parameter:

1 from opik_optimizer import MultiMetricObjective
2 from opik.evaluation.metrics import AnswerRelevance
3 from opik_optimizer.metrics import TotalSpanCost, SpanDuration
4 
5 # Metric that needs task_span
6 def cost_in_cents(dataset_item, llm_output, task_span):
7     cost_metric = TotalSpanCost()
8     result = cost_metric.score(task_span=task_span)
9     return result.value * 100  # Convert to cents
10 
11 # Metric that needs task_span
12 def duration_seconds(dataset_item, llm_output, task_span):
13     duration_metric = SpanDuration()
14     result = duration_metric.score(task_span=task_span)
15     return result.value
16 
17 # Regular metric without task_span
18 def answer_relevance(dataset_item, llm_output):
19     metric = AnswerRelevance()
20     return metric.score(
21         context=[dataset_item["answer"]], 
22         output=llm_output, 
23         input=dataset_item["question"]
24     )
25 
26 # Combine quality, cost, and speed metrics
27 # Use negative weights to minimize cost and duration
28 objective = MultiMetricObjective(
29     metrics=[answer_relevance, cost_in_cents, duration_seconds],
30     weights=[1.0, -5, -0.3],  # Maximize quality, minimize cost (manually adjusted bigger weight since the cost values are small), minimize duration
31     name="quality_cost_speed",
32 )

Span-based metrics like TotalSpanCost and SpanDuration automatically receive the task_span parameter during evaluation, which contains execution information about the agent’s run. Use negative weights to minimize metrics (cost, duration) rather than maximize them.

LLM task total cost and duration are not normalized values, we recommend adjusting their weights based on your baseline metrics and the importance you want them to have.

Recommended presets

Scenario	Metric	Notes
Factual QA	`LevenshteinRatio` or `ExactMatch`	Works with text-only datasets; deterministic and low cost.
Retrieval / grounding	`AnswerRelevance`	Pass reference context via `context=[item["answer"]]` or retrieved docs.
Safety	`Moderation` or custom LLM-as-a-judge	Combine with `MultiMetricObjective` to gate unsafe answers.
Multi-turn trajectories	Agent trajectory evaluator	Scores complete conversations, not just final outputs.

Reuse these heuristics before writing custom metrics—most are already imported in opik.evaluation.metrics.

Checklist for great metrics

Return explanations – populate reason so reflective optimizers can group failure modes.
Avoid randomness – deterministic metrics keep optimizers from chasing noise.
Bound runtime – use cached references or lightweight models where possible; heavy metrics slow down trials.
Log metadata – include details in the ScoreResult if you want to visualize per-sample attributes later.

When you outgrow presets, move to Custom metrics for LLM-as-a-judge flows or domain-specific scoring.

Testing metrics

Dry-run against a handful of dataset rows before launching an optimization.
Use optimizer.task_evaluator.evaluate_prompt to evaluate a single prompt with your metric.
Inspect the per-sample reasons in the Opik dashboard to ensure they match expectations.

Deep dive: Multi-metric optimization guide
API reference: ScoreResult
Advanced topic: Custom metrics