Define Evaluation Metrics

From Subjective Assessment to Quantifiable Metrics

This video explores Opik’s comprehensive metrics system that transforms subjective LLM assessment into quantifiable measurements. You’ll discover the different types of automated scoring methods available, see practical examples using Answer Relevance and Levenshtein metrics, and learn how to create custom metrics when needed. The video also covers cost considerations and best practices for combining multiple metrics to capture different dimensions of quality.

Key Highlights

  • Comprehensive Metric Types: Choose from heuristic metrics (exact match, contains, regex, JSON validation), hallucination detection, and LLM-as-a-judge approaches like GEVAL
  • Easy Implementation: Import metrics directly from opik.evaluation.metrics and instantiate classes - demonstrated with Answer Relevance and Levenshtein ratio
  • Custom Metric Development: Create your own metrics by extending the base metric class from Opik repository when built-in options don’t meet your needs
  • UI Integration: View metrics in trace overview by scrolling right or opening feedback scores section, with ability to manually add/remove scores
  • Manual Feedback Definition: Create custom feedback definitions in Configuration section for human-applied metrics like pass/fail classifications
  • Cost-Aware Evaluation: Consider trade-offs between evaluation speed, depth, and cost - especially with expensive thinking models for LLM-as-a-judge approaches
  • Multi-Dimensional Assessment: Combine multiple metrics (e.g., factual accuracy + helpfulness) to get complete quality pictures rather than single-metric evaluation
  • Filtering Capabilities: Use feedback scores to filter traces and identify patterns in model performance across different quality dimensions