Custom metrics | Opik Documentation

Use custom metrics when built-in metrics are not enough (domain-specific scoring, precise safety checks, unique multimodal checks). Start with the core Opik evaluation docs so you know what already exists:

Evaluation concepts – terminology and lifecycle.
Metrics overview – default heuristic metrics (ROUGE, BLEU, Hallucination, etc.).
LLM-as-a-judge patterns – how Opik runs judge models against multi-turn traces.

Design principles

Deterministic – cache external model calls. Where supported by the model, set temperature to 0 and a seed value to increase the likelihood of repeated runs matching. Note that not all models guarantee deterministic outputs even with these settings.
Explainable – always set reason on ScoreResult for better dashboards.
Composable – wrap helpers into utility modules so multiple optimizers share them.
Layered – start with single metrics, then combine them via MultiMetricObjective when you need trade-offs.
Cost - consider the cost implications if you rely on compute and API calls for evaluations.

Example: safety + completeness metric

1 from opik.evaluation.metrics import AnswerRelevance
2 from opik.evaluation.metrics.score_result import ScoreResult
3 from some_safety_model import classify_risk
4 
5 safety_model = classify_risk.Client()
6 
7 def safety_and_completeness(item, output):
8     relevance = AnswerRelevance().score(
9         context=[item["answer"]], output=output, input=item["question"]
10     )
11     safety = safety_model.score(text=output)
12 
13     value = 1.0 if relevance.value > 0.75 and safety["label"] == "safe" else 0.0
14     reason = f"Relevant={relevance.value:.2f}, safety={safety['label']}"
15 
16     return ScoreResult(name="safety_completeness", value=value, reason=reason)

Metric building blocks

Single metrics – implement one callable per concern (accuracy, tone, cost). Keep them reusable across prompts.
Multi-metric objectives – combine single metrics with weights when you need to balance, e.g., accuracy (0.7) + style (0.3). See Multi-metric optimization for templates.
LLM-as-a-judge – call out to an evaluation model (OpenAI, Anthropic, etc.) inside the metric. Always include detailed prompts so results stay stable, and understand that reflective optimizers will inherit any noise from these judge calls.
Heuristics – leverage built-ins from /evaluation/metrics instead of reinventing classic scores. You can compose heuristics with custom logic as shown above.

Testing

Write pytest cases that feed canned dataset items into the metric and assert expected scores.
Run metrics against a golden dataset on CI to catch regressions.
For multi-metric objectives, add tests that verify weight changes behave as expected (e.g., higher weight increases sensitivity).

Design principles

Example: safety + completeness metric

Metric building blocks

Testing

Related docs