Evaluation Metrics

Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.

What Are Metrics?

In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric abstract class, which provides the core functionality for validation and tracking.

1abstract class BaseMetric<GenericZodObjectType> {
2 public readonly name: string;
3 public readonly trackMetric: boolean;
4 public abstract readonly validationSchema: GenericZodObjectType;
5
6 abstract score(
7 input: Infer<GenericZodObjectType>
8 ):
9 | EvaluationScoreResult
10 | EvaluationScoreResult[]
11 | Promise<EvaluationScoreResult>
12 | Promise<EvaluationScoreResult[]>;
13}

How Metrics Calculate Scores

Each metric must implement the score method, which:

  1. Accepts an input object containing combined data from the task output, dataset item, and scoringKeyMapping
  2. Processes the inputs to produce a score
  3. Returns an EvaluationScoreResult or array of results, which includes:
    • name: The metric name
    • value: The numerical score (typically 0.0-1.0)
    • reason: A human-readable explanation for the score

Types of Metrics

Opik supports different types of metrics:

  1. Heuristic metrics: Simple rule-based evaluations (e.g., exact match, contains, regex match)
  2. LLM Judge metrics: AI-powered evaluations that use language models to assess output quality

Built-in Metrics

Opik provides several built-in metrics for common evaluation scenarios:

ExactMatch

Checks if the model output exactly matches the expected output:

1const exactMatch = new ExactMatch();
2// Usage requires both 'output' and 'expected' parameters

Contains

Checks if the model output contains specific text:

1const contains = new Contains();
2// Usage requires both 'output' and 'expected' parameters

RegexMatch

Checks if the model output matches a regular expression pattern:

1const regexMatch = new RegexMatch();
2// Usage requires 'output' and 'pattern' parameters

IsJson

Checks if the output is valid JSON:

1const isJson = new IsJson();
2// Usage requires 'output' parameter

Metric Configuration

Custom Naming and Tracking

Each metric can be configured with a custom name and tracking option:

1// Create metric with custom name
2const exactMatch = new ExactMatch("my_exact_match");
3
4// Create metric with tracking disabled
5const regexMatch = new RegexMatch("custom_regex", false);

Combining Multiple Metrics

You can use multiple metrics in a single evaluation:

1const metrics = [new ExactMatch(), new Contains(), new RegexMatch()];
2
3// In your evaluation configuration
4await evaluate({
5 dataset: myDataset,
6 task: myTask,
7 scoringMetrics: metrics,
8});

Input Requirements

Validation Schema

Each metric defines a Zod validation schema that specifies required inputs:

1// ExactMatch validation schema example
2const validationSchema = z.object({
3 output: z.string(), // The model output
4 expected: z.string(), // The expected output
5});

The validation system ensures all required parameters are present before executing the metric.

Mapping Inputs

You can map dataset fields and task outputs to metric inputs using scoringKeyMapping:

1await evaluate({
2 dataset: myDataset,
3 task: myTask,
4 scoringMetrics: [new ExactMatch()],
5 scoringKeyMapping: {
6 // Map dataset/task fields to metric parameter names
7 output: "model.response",
8 expected: "dataset.answer",
9 },
10});

Score Interpretation

Score Ranges

Most metrics in Opik return scores between 0.0 and 1.0:

  • 1.0: Perfect match or ideal performance
  • 0.0: No match or complete failure
  • Intermediate values: Partial matches or varying degrees of success

Creating Custom Metrics

Implementing Your Own Metric

To create a custom metric:

  1. Extend the BaseMetric class
  2. Define a validation schema using Zod
  3. Implement the score method

Here’s an example of a custom metric that checks if output length is within a specified range:

1import z from "zod";
2import { BaseMetric } from "@opik/sdk";
3import { EvaluationScoreResult } from "@opik/sdk";
4
5// Define validation schema
6const validationSchema = z.object({
7 output: z.string(),
8 minLength: z.number(),
9 maxLength: z.number(),
10});
11
12// Infer TypeScript type from schema
13type Input = z.infer<typeof validationSchema>;
14
15export class LengthRangeMetric extends BaseMetric {
16 public validationSchema = validationSchema;
17
18 constructor(name = "length_range", trackMetric = true) {
19 super(name, trackMetric);
20 }
21
22 async score(input: Input): Promise<EvaluationScoreResult> {
23 const { output, minLength, maxLength } = input;
24 const length = output.length;
25
26 // Calculate score (1.0 if within range, 0.0 otherwise)
27 const isWithinRange = length >= minLength && length <= maxLength;
28 const score = isWithinRange ? 1.0 : 0.0;
29
30 // Return result with explanation
31 return {
32 name: this.name,
33 value: score,
34 reason: isWithinRange
35 ? `Output length (${length}) is within range ${minLength}-${maxLength}`
36 : `Output length (${length}) is outside range ${minLength}-${maxLength}`,
37 };
38 }
39}

Validation Best Practices

When creating custom metrics:

  1. Define clear validation schemas:

    1const validationSchema = z.object({
    2 output: z.string().min(1, "Output is required"),
    3 threshold: z.number().min(0).max(1),
    4});
  2. Return meaningful reasons:

    1return {
    2 name: this.name,
    3 value: score,
    4 reason: `Score ${score.toFixed(2)} because [detailed explanation]`,
    5};
  3. Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics

LLM Judge Metrics

LLM Judge metrics use language models to evaluate the quality of LLM outputs. These metrics provide more nuanced evaluation than simple heuristic checks.

AnswerRelevance

Evaluates how relevant the output is to the input question:

1import { AnswerRelevance } from "opik";
2
3// Using default model (gpt-4o)
4const metric = new AnswerRelevance();
5
6// With custom model ID
7const metricWithModel = new AnswerRelevance({
8 model: "claude-3-5-sonnet-latest",
9});
10
11// With LanguageModel instance
12import { openai } from "@ai-sdk/openai";
13const customModel = openai("gpt-4o");
14const metricWithCustomModel = new AnswerRelevance({ model: customModel });
15
16// Usage
17const score = await metric.score({
18 input: "What is the capital of France?",
19 output: "The capital of France is Paris.",
20 context: ["France is a country in Western Europe."], // Optional
21});
22
23console.log(score.value); // 0.0 to 1.0
24console.log(score.reason); // Explanation of the score

Parameters

  • input (required): The question or prompt
  • output (required): The model’s response to evaluate
  • context (optional): Additional context for evaluation

Score Range

  • 1.0: Perfect relevance - output directly addresses the input
  • 0.5: Partial relevance - output is somewhat related but incomplete
  • 0.0: No relevance - output doesn’t address the input

Hallucination

Detects whether the output contains hallucinated or unfaithful information:

1import { Hallucination } from "opik";
2
3const metric = new Hallucination();
4
5// Without context - checks against general knowledge
6const score1 = await metric.score({
7 input: "What is the capital of France?",
8 output:
9 "The capital of France is Paris. It is famous for its iconic Eiffel Tower.",
10});
11
12// With context - checks faithfulness to provided context
13const score2 = await metric.score({
14 input: "What is the capital of France?",
15 output:
16 "The capital of France is Paris. It is famous for its iconic Eiffel Tower.",
17 context: [
18 "France is a country in Western Europe. Its capital is Paris, which is known for landmarks like the Eiffel Tower.",
19 ],
20});
21
22console.log(score2.value); // 1.0 = hallucination detected, 0.0 = no hallucination
23console.log(score2.reason); // Array of reasons for the score

Parameters

  • input (required): The original question or prompt
  • output (required): The model’s response to evaluate
  • context (optional): Reference information to check against

Score Values

  • 0.0: No hallucination - output is faithful to context/facts
  • 1.0: Hallucination detected - output contains false or unsupported information

Moderation

Checks if the output contains harmful, inappropriate, or unsafe content:

1import { Moderation } from "opik";
2
3const metric = new Moderation();
4
5const score = await metric.score({
6 input: "Tell me about safety guidelines",
7 output: "Here are some safety guidelines...",
8});
9
10console.log(score.value); // 1.0 = harmful content detected, 0.0 = safe
11console.log(score.reason); // Explanation of moderation decision

Parameters

  • input (required): The original prompt
  • output (required): The model’s response to evaluate

Score Values

  • 0.0: Safe - no harmful content detected
  • 1.0: Harmful - inappropriate or unsafe content detected

Usefulness

Evaluates how useful the output is in addressing the input:

1import { Usefulness } from "opik";
2
3const metric = new Usefulness();
4
5const score = await metric.score({
6 input: "How do I reset my password?",
7 output:
8 "To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the instructions sent to your inbox.",
9});
10
11console.log(score.value); // 0.0 to 1.0
12console.log(score.reason); // Explanation of usefulness score

Parameters

  • input (required): The question or request
  • output (required): The model’s response to evaluate

Score Range

  • 1.0: Very useful - comprehensive and actionable
  • 0.5: Somewhat useful - partially helpful
  • 0.0: Not useful - doesn’t help address the input

Configuring LLM Judge Metrics

Model Configuration

All LLM Judge metrics accept a model parameter in their constructor:

1import { openai } from "@ai-sdk/openai";
2import { anthropic } from "@ai-sdk/anthropic";
3import { google } from "@ai-sdk/google";
4import { Hallucination } from "opik";
5
6// Using model ID string
7const metric1 = new Hallucination({ model: "gpt-4o" });
8const metric2 = new Hallucination({ model: "claude-3-5-sonnet-latest" });
9const metric3 = new Hallucination({ model: "gemini-2.0-flash" });
10
11// Using LanguageModel instance
12const customModel = openai("gpt-4o");
13const metric4 = new Hallucination({ model: customModel });

Async Scoring

All LLM Judge metrics support asynchronous scoring:

1import { AnswerRelevance } from "opik";
2
3const metric = new AnswerRelevance();
4
5// Async/await
6const score = await metric.score({
7 input: "What is TypeScript?",
8 output: "TypeScript is a typed superset of JavaScript.",
9});
10
11// Promise chain
12metric
13 .score({
14 input: "What is TypeScript?",
15 output: "TypeScript is a typed superset of JavaScript.",
16 })
17 .then((score) => console.log(score.value));

Combining Multiple LLM Judge Metrics

Use multiple metrics together for comprehensive evaluation:

1import { AnswerRelevance, Hallucination, Moderation, Usefulness } from "opik";
2import { evaluate } from "opik";
3
4await evaluate({
5 dataset: myDataset,
6 task: myTask,
7 scoringMetrics: [
8 new AnswerRelevance(),
9 new Hallucination(),
10 new Moderation(),
11 new Usefulness(),
12 ],
13});

Custom Model for Each Metric

Different metrics can use different models:

1import { openai } from "@ai-sdk/openai";
2import { anthropic } from "@ai-sdk/anthropic";
3import { AnswerRelevance, Hallucination } from "opik";
4
5// Use GPT-4o for answer relevance
6const relevanceMetric = new AnswerRelevance({
7 model: openai("gpt-4o"),
8});
9
10// Use Claude for hallucination detection
11const hallucinationMetric = new Hallucination({
12 model: anthropic("claude-3-5-sonnet-latest"),
13});
14
15await evaluate({
16 dataset: myDataset,
17 task: myTask,
18 scoringMetrics: [relevanceMetric, hallucinationMetric],
19});

LLM Judge Metric Best Practices

1. Use Model ID Strings for Simplicity

For most use cases, use model ID strings directly:

1import { Hallucination } from "opik";
2
3const metric = new Hallucination({ model: "gpt-4o" });

The Opik SDK handles model configuration internally for optimal evaluation performance.

2. Provide Context When Available

Context improves evaluation accuracy:

1// Better: With context
2await metric.score({
3 input: "What is the capital?",
4 output: "The capital is Paris.",
5 context: ["France is a country in Europe. Its capital is Paris."],
6});
7
8// OK: Without context (relies on general knowledge)
9await metric.score({
10 input: "What is the capital of France?",
11 output: "The capital is Paris.",
12});

3. Choose Appropriate Models

Match model capabilities to metric requirements:

1// Complex reasoning: Use GPT-4o or Claude Sonnet
2const complexMetric = new AnswerRelevance({ model: "gpt-4o" });
3
4// Simple checks: Use faster, cheaper models
5const simpleMetric = new Moderation({ model: "gpt-4o-mini" });

4. Handle Errors Gracefully

LLM calls can fail - handle errors appropriately:

1try {
2 const score = await metric.score({
3 input: "What is TypeScript?",
4 output: "TypeScript is a typed superset of JavaScript.",
5 });
6 console.log(score);
7} catch (error) {
8 console.error("Metric evaluation failed:", error);
9 // Implement fallback or retry logic
10}

5. Batch Evaluations

Use the evaluate function for efficient batch processing:

1// More efficient for multiple items
2await evaluate({
3 dataset: myDataset,
4 task: myTask,
5 scoringMetrics: [new Hallucination()],
6 scoringWorkers: 5, // Parallel scoring
7});
8
9// Less efficient for batch processing
10for (const item of datasetItems) {
11 await metric.score(item); // Sequential scoring
12}

Score Interpretation

Understanding LLM Judge Scores

LLM Judge metrics return structured scores with:

1interface EvaluationScoreResult {
2 name: string; // Metric name
3 value: number; // Numerical score (0.0-1.0 typically)
4 reason: string | string[]; // Explanation for the score
5}

Example Score Results

1// AnswerRelevance
2{
3 name: "answer_relevance",
4 value: 0.95,
5 reason: "The answer directly addresses the question with accurate information"
6}
7
8// Hallucination
9{
10 name: "hallucination",
11 value: 0.0,
12 reason: ["All information is supported by the context", "No contradictions found"]
13}
14
15// Moderation
16{
17 name: "moderation",
18 value: 0.0,
19 reason: "Content is safe and appropriate"
20}

Generation Parameters

Configuring Temperature, Seed, and MaxTokens

All LLM Judge metrics support generation parameters in their constructor:

1import { Hallucination, AnswerRelevance } from "opik";
2
3// Configure generation parameters
4const metric = new Hallucination({
5 model: "gpt-4o",
6 temperature: 0.3, // Lower = more deterministic
7 seed: 42, // For reproducible outputs
8 maxTokens: 1000, // Maximum response length
9});
10
11// Different settings for different metrics
12const relevanceMetric = new AnswerRelevance({
13 model: "claude-3-5-sonnet-latest",
14 temperature: 0.7, // Higher = more creative
15 seed: 12345,
16});
17
18// Use the metrics
19const score = await metric.score({
20 input: "What is the capital of France?",
21 output: "The capital of France is Paris.",
22 context: ["France is a country in Western Europe."],
23});

Advanced Model Settings

For provider-specific advanced parameters, use modelSettings:

1import { Hallucination } from "opik";
2
3const metric = new Hallucination({
4 model: "gpt-4o",
5 temperature: 0.5,
6 modelSettings: {
7 topP: 0.9, // Nucleus sampling
8 topK: 50, // Top-K sampling
9 presencePenalty: 0.1, // Reduce repetition
10 frequencyPenalty: 0.2, // Reduce phrase repetition
11 stopSequences: ["END"], // Custom stop sequences
12 },
13});

For provider-specific options not exposed through modelSettings, use LanguageModel instances:

1import { openai } from "@ai-sdk/openai";
2import { anthropic } from "@ai-sdk/anthropic";
3import { google } from "@ai-sdk/google";
4
5// OpenAI with structured outputs
6const openaiModel = openai("gpt-4o", {
7 structuredOutputs: true,
8});
9
10// Anthropic with cache control
11const anthropicModel = anthropic("claude-3-5-sonnet-latest", {
12 cacheControl: true,
13});
14
15// Google Gemini with specific configuration
16const googleModel = google("gemini-2.0-flash");
17
18const metric1 = new Hallucination({ model: openaiModel });
19const metric2 = new Hallucination({ model: anthropicModel });
20const metric3 = new Hallucination({ model: googleModel });

See Vercel AI SDK Provider Documentation for provider-specific options:

See Also