Evaluation Metrics

Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.

What Are Metrics?

In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric abstract class, which provides the core functionality for validation and tracking.

1abstract class BaseMetric<GenericZodObjectType> {
2 public readonly name: string;
3 public readonly trackMetric: boolean;
4 public abstract readonly validationSchema: GenericZodObjectType;
5
6 abstract score(
7 input: Infer<GenericZodObjectType>
8 ):
9 | EvaluationScoreResult
10 | EvaluationScoreResult[]
11 | Promise<EvaluationScoreResult>
12 | Promise<EvaluationScoreResult[]>;
13}

How Metrics Calculate Scores

Each metric must implement the score method, which:

  1. Accepts an input object containing combined data from the task output, dataset item, and scoringKeyMapping
  2. Processes the inputs to produce a score
  3. Returns an EvaluationScoreResult or array of results, which includes:
    • name: The metric name
    • value: The numerical score (typically 0.0-1.0)
    • reason: A human-readable explanation for the score

Types of Metrics

Opik supports different types of metrics:

  1. Heuristic metrics: Simple rule-based evaluations (e.g., exact match, contains, regex match)
  2. Model-based metrics: Evaluations powered by AI models (coming soon)

Built-in Metrics

Opik provides several built-in metrics for common evaluation scenarios:

ExactMatch

Checks if the model output exactly matches the expected output:

1const exactMatch = new ExactMatch();
2// Usage requires both 'output' and 'expected' parameters

Contains

Checks if the model output contains specific text:

1const contains = new Contains();
2// Usage requires both 'output' and 'expected' parameters

RegexMatch

Checks if the model output matches a regular expression pattern:

1const regexMatch = new RegexMatch();
2// Usage requires 'output' and 'pattern' parameters

IsJson

Checks if the output is valid JSON:

1const isJson = new IsJson();
2// Usage requires 'output' parameter

Metric Configuration

Custom Naming and Tracking

Each metric can be configured with a custom name and tracking option:

1// Create metric with custom name
2const exactMatch = new ExactMatch("my_exact_match");
3
4// Create metric with tracking disabled
5const regexMatch = new RegexMatch("custom_regex", false);

Combining Multiple Metrics

You can use multiple metrics in a single evaluation:

1const metrics = [new ExactMatch(), new Contains(), new RegexMatch()];
2
3// In your evaluation configuration
4await evaluate({
5 dataset: myDataset,
6 task: myTask,
7 scoringMetrics: metrics,
8});

Input Requirements

Validation Schema

Each metric defines a Zod validation schema that specifies required inputs:

1// ExactMatch validation schema example
2const validationSchema = z.object({
3 output: z.string(), // The model output
4 expected: z.string(), // The expected output
5});

The validation system ensures all required parameters are present before executing the metric.

Mapping Inputs

You can map dataset fields and task outputs to metric inputs using scoringKeyMapping:

1await evaluate({
2 dataset: myDataset,
3 task: myTask,
4 scoringMetrics: [new ExactMatch()],
5 scoringKeyMapping: {
6 // Map dataset/task fields to metric parameter names
7 output: "model.response",
8 expected: "dataset.answer",
9 },
10});

Score Interpretation

Score Ranges

Most metrics in Opik return scores between 0.0 and 1.0:

  • 1.0: Perfect match or ideal performance
  • 0.0: No match or complete failure
  • Intermediate values: Partial matches or varying degrees of success

Creating Custom Metrics

Implementing Your Own Metric

To create a custom metric:

  1. Extend the BaseMetric class
  2. Define a validation schema using Zod
  3. Implement the score method

Here’s an example of a custom metric that checks if output length is within a specified range:

1import z from "zod";
2import { BaseMetric } from "@opik/sdk";
3import { EvaluationScoreResult } from "@opik/sdk";
4
5// Define validation schema
6const validationSchema = z.object({
7 output: z.string(),
8 minLength: z.number(),
9 maxLength: z.number(),
10});
11
12// Infer TypeScript type from schema
13type Input = z.infer<typeof validationSchema>;
14
15export class LengthRangeMetric extends BaseMetric {
16 public validationSchema = validationSchema;
17
18 constructor(name = "length_range", trackMetric = true) {
19 super(name, trackMetric);
20 }
21
22 async score(input: Input): Promise<EvaluationScoreResult> {
23 const { output, minLength, maxLength } = input;
24 const length = output.length;
25
26 // Calculate score (1.0 if within range, 0.0 otherwise)
27 const isWithinRange = length >= minLength && length <= maxLength;
28 const score = isWithinRange ? 1.0 : 0.0;
29
30 // Return result with explanation
31 return {
32 name: this.name,
33 value: score,
34 reason: isWithinRange
35 ? `Output length (${length}) is within range ${minLength}-${maxLength}`
36 : `Output length (${length}) is outside range ${minLength}-${maxLength}`,
37 };
38 }
39}

Validation Best Practices

When creating custom metrics:

  1. Define clear validation schemas:

    1const validationSchema = z.object({
    2 output: z.string().min(1, "Output is required"),
    3 threshold: z.number().min(0).max(1),
    4});
  2. Return meaningful reasons:

    1return {
    2 name: this.name,
    3 value: score,
    4 reason: `Score ${score.toFixed(2)} because [detailed explanation]`,
    5};
  3. Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics