Evaluation Metrics
Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.
What Are Metrics?
In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric
abstract class, which provides the core functionality for validation and tracking.
How Metrics Calculate Scores
Each metric must implement the score
method, which:
- Accepts an
input
object containing combined data from the task output, dataset item, and scoringKeyMapping - Processes the inputs to produce a score
- Returns an
EvaluationScoreResult
or array of results, which includes:name
: The metric namevalue
: The numerical score (typically 0.0-1.0)reason
: A human-readable explanation for the score
Types of Metrics
Opik supports different types of metrics:
- Heuristic metrics: Simple rule-based evaluations (e.g., exact match, contains, regex match)
- Model-based metrics: Evaluations powered by AI models (coming soon)
Built-in Metrics
Opik provides several built-in metrics for common evaluation scenarios:
ExactMatch
Checks if the model output exactly matches the expected output:
Contains
Checks if the model output contains specific text:
RegexMatch
Checks if the model output matches a regular expression pattern:
IsJson
Checks if the output is valid JSON:
Metric Configuration
Custom Naming and Tracking
Each metric can be configured with a custom name and tracking option:
Combining Multiple Metrics
You can use multiple metrics in a single evaluation:
Input Requirements
Validation Schema
Each metric defines a Zod validation schema that specifies required inputs:
The validation system ensures all required parameters are present before executing the metric.
Mapping Inputs
You can map dataset fields and task outputs to metric inputs using scoringKeyMapping
:
Score Interpretation
Score Ranges
Most metrics in Opik return scores between 0.0 and 1.0:
- 1.0: Perfect match or ideal performance
- 0.0: No match or complete failure
- Intermediate values: Partial matches or varying degrees of success
Creating Custom Metrics
Implementing Your Own Metric
To create a custom metric:
- Extend the
BaseMetric
class - Define a validation schema using Zod
- Implement the
score
method
Here’s an example of a custom metric that checks if output length is within a specified range:
Validation Best Practices
When creating custom metrics:
-
Define clear validation schemas:
-
Return meaningful reasons:
-
Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics