Evaluation Metrics | Opik Documentation

Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.

What Are Metrics?

In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric abstract class, which provides the core functionality for validation and tracking.

1 abstract class BaseMetric<GenericZodObjectType> {
2   public readonly name: string;
3   public readonly trackMetric: boolean;
4   public abstract readonly validationSchema: GenericZodObjectType;
5 
6   abstract score(
7     input: Infer<GenericZodObjectType>
8   ):
9     | EvaluationScoreResult
10     | EvaluationScoreResult[]
11     | Promise<EvaluationScoreResult>
12     | Promise<EvaluationScoreResult[]>;
13 }

How Metrics Calculate Scores

Each metric must implement the score method, which:

Accepts an input object containing combined data from the task output, dataset item, and scoringKeyMapping
Processes the inputs to produce a score
Returns an EvaluationScoreResult or array of results, which includes:
- name: The metric name
- value: The numerical score (typically 0.0-1.0)
- reason: A human-readable explanation for the score

Types of Metrics

Opik supports different types of metrics:

Heuristic metrics: Simple rule-based evaluations (e.g., exact match, contains, regex match)
Model-based metrics: Evaluations powered by AI models (coming soon)

Built-in Metrics

Opik provides several built-in metrics for common evaluation scenarios:

ExactMatch

Checks if the model output exactly matches the expected output:

1 const exactMatch = new ExactMatch();
2 // Usage requires both 'output' and 'expected' parameters

Contains

Checks if the model output contains specific text:

1 const contains = new Contains();
2 // Usage requires both 'output' and 'expected' parameters

RegexMatch

Checks if the model output matches a regular expression pattern:

1 const regexMatch = new RegexMatch();
2 // Usage requires 'output' and 'pattern' parameters

IsJson

Checks if the output is valid JSON:

1 const isJson = new IsJson();
2 // Usage requires 'output' parameter

Metric Configuration

Custom Naming and Tracking

Each metric can be configured with a custom name and tracking option:

1 // Create metric with custom name
2 const exactMatch = new ExactMatch("my_exact_match");
3 
4 // Create metric with tracking disabled
5 const regexMatch = new RegexMatch("custom_regex", false);

Combining Multiple Metrics

You can use multiple metrics in a single evaluation:

1 const metrics = [new ExactMatch(), new Contains(), new RegexMatch()];
2 
3 // In your evaluation configuration
4 await evaluate({
5   dataset: myDataset,
6   task: myTask,
7   scoringMetrics: metrics,
8 });

Input Requirements

Validation Schema

Each metric defines a Zod validation schema that specifies required inputs:

1 // ExactMatch validation schema example
2 const validationSchema = z.object({
3   output: z.string(), // The model output
4   expected: z.string(), // The expected output
5 });

The validation system ensures all required parameters are present before executing the metric.

Mapping Inputs

You can map dataset fields and task outputs to metric inputs using scoringKeyMapping:

1 await evaluate({
2   dataset: myDataset,
3   task: myTask,
4   scoringMetrics: [new ExactMatch()],
5   scoringKeyMapping: {
6     // Map dataset/task fields to metric parameter names
7     output: "model.response",
8     expected: "dataset.answer",
9   },
10 });

Score Interpretation

Score Ranges

Most metrics in Opik return scores between 0.0 and 1.0:

1.0: Perfect match or ideal performance
0.0: No match or complete failure
Intermediate values: Partial matches or varying degrees of success

Creating Custom Metrics

Implementing Your Own Metric

To create a custom metric:

Extend the BaseMetric class
Define a validation schema using Zod
Implement the score method

Here’s an example of a custom metric that checks if output length is within a specified range:

1 import z from "zod";
2 import { BaseMetric } from "@opik/sdk";
3 import { EvaluationScoreResult } from "@opik/sdk";
4 
5 // Define validation schema
6 const validationSchema = z.object({
7   output: z.string(),
8   minLength: z.number(),
9   maxLength: z.number(),
10 });
11 
12 // Infer TypeScript type from schema
13 type Input = z.infer<typeof validationSchema>;
14 
15 export class LengthRangeMetric extends BaseMetric {
16   public validationSchema = validationSchema;
17 
18   constructor(name = "length_range", trackMetric = true) {
19     super(name, trackMetric);
20   }
21 
22   async score(input: Input): Promise<EvaluationScoreResult> {
23     const { output, minLength, maxLength } = input;
24     const length = output.length;
25 
26     // Calculate score (1.0 if within range, 0.0 otherwise)
27     const isWithinRange = length >= minLength && length <= maxLength;
28     const score = isWithinRange ? 1.0 : 0.0;
29 
30     // Return result with explanation
31     return {
32       name: this.name,
33       value: score,
34       reason: isWithinRange
35         ? `Output length (${length}) is within range ${minLength}-${maxLength}`
36         : `Output length (${length}) is outside range ${minLength}-${maxLength}`,
37     };
38   }
39 }

Validation Best Practices

When creating custom metrics:

Define clear validation schemas:

1 const validationSchema = z.object({
2   output: z.string().min(1, "Output is required"),
3   threshold: z.number().min(0).max(1),
4 });

Return meaningful reasons:

1 return {
2   name: this.name,
3   value: score,
4   reason: `Score ${score.toFixed(2)} because [detailed explanation]`,
5 };

Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics