Custom metrics
Use custom metrics when built-in metrics are not enough (domain-specific scoring, precise safety checks, unique multimodal checks). Start with the core Opik evaluation docs so you know what already exists:
- Evaluation concepts – terminology and lifecycle.
- Metrics overview – default heuristic metrics (ROUGE, BLEU, Hallucination, etc.).
- LLM-as-a-judge patterns – how Opik runs judge models against multi-turn traces.
Design principles
- Deterministic – cache external model calls. Where supported by the model, set temperature to 0 and a seed value to increase the likelihood of repeated runs matching. Note that not all models guarantee deterministic outputs even with these settings.
- Explainable – always set
reasononScoreResultfor better dashboards. - Composable – wrap helpers into utility modules so multiple optimizers share them.
- Layered – start with single metrics, then combine them via
MultiMetricObjectivewhen you need trade-offs. - Cost - consider the cost implications if you rely on compute and API calls for evaluations.
Example: safety + completeness metric
Metric building blocks
- Single metrics – implement one callable per concern (accuracy, tone, cost). Keep them reusable across prompts.
- Multi-metric objectives – combine single metrics with weights when you need to balance, e.g., accuracy (0.7) + style (0.3). See Multi-metric optimization for templates.
- LLM-as-a-judge – call out to an evaluation model (OpenAI, Anthropic, etc.) inside the metric. Always include detailed prompts so results stay stable, and understand that reflective optimizers will inherit any noise from these judge calls.
- Heuristics – leverage built-ins from
/evaluation/metricsinstead of reinventing classic scores. You can compose heuristics with custom logic as shown above.
Testing
- Write pytest cases that feed canned dataset items into the metric and assert expected scores.
- Run metrics against a golden dataset on CI to catch regressions.
- For multi-metric objectives, add tests that verify weight changes behave as expected (e.g., higher weight increases sensitivity).