Define Evaluation Metrics | Opik Documentation

From Subjective Assessment to Quantifiable Metrics

This video explores Opik’s comprehensive metrics system that transforms subjective LLM assessment into quantifiable measurements. You’ll discover the different types of automated scoring methods available, see practical examples using Answer Relevance and Levenshtein metrics, and learn how to create custom metrics when needed. The video also covers cost considerations and best practices for combining multiple metrics to capture different dimensions of quality.

Key Highlights

Comprehensive Metric Types: Choose from heuristic metrics (exact match, contains, regex, JSON validation), hallucination detection, and LLM-as-a-judge approaches like GEVAL
Easy Implementation: Import metrics directly from opik.evaluation.metrics and instantiate classes - demonstrated with Answer Relevance and Levenshtein ratio
Custom Metric Development: Create your own metrics by extending the base metric class from Opik repository when built-in options don’t meet your needs
UI Integration: View metrics in trace overview by scrolling right or opening feedback scores section, with ability to manually add/remove scores
Manual Feedback Definition: Create custom feedback definitions in Configuration section for human-applied metrics like pass/fail classifications
Cost-Aware Evaluation: Consider trade-offs between evaluation speed, depth, and cost - especially with expensive thinking models for LLM-as-a-judge approaches
Multi-Dimensional Assessment: Combine multiple metrics (e.g., factual accuracy + helpfulness) to get complete quality pictures rather than single-metric evaluation
Filtering Capabilities: Use feedback scores to filter traces and identify patterns in model performance across different quality dimensions