Understanding Metrics in Opik

In Opik 2.0, experiments are project-scoped. When using metrics in evaluations, specify a projectName in the evaluate() call so results are associated with the correct project.

Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.

What Are Metrics?

In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric abstract class, which provides the core functionality for validation and tracking.

1 abstract class BaseMetric<GenericZodObjectType> {
2   public readonly name: string;
3   public readonly trackMetric: boolean;
4   public abstract readonly validationSchema: GenericZodObjectType;
5 
6   abstract score(
7     input: Infer<GenericZodObjectType>
8   ):
9     | EvaluationScoreResult
10     | EvaluationScoreResult[]
11     | Promise<EvaluationScoreResult>
12     | Promise<EvaluationScoreResult[]>;
13 }

How Metrics Calculate Scores

Each metric must implement the score method, which:

Accepts an input object containing combined data from the task output, dataset item, and scoringKeyMapping
Processes the inputs to produce a score
Returns an EvaluationScoreResult or array of results, which includes:
- name: The metric name
- value: The numerical score (typically 0.0-1.0)
- reason: A human-readable explanation for the score

Types of Metrics

Opik supports different types of metrics:

Heuristic metrics: Simple rule-based evaluations (e.g., exact match, contains, regex match)
LLM Judge metrics: AI-powered evaluations that use language models to assess output quality

Built-in Metrics

Opik provides several built-in metrics for common evaluation scenarios:

ExactMatch

Checks if the model output exactly matches the expected output:

1 const exactMatch = new ExactMatch();
2 // Usage requires both 'output' and 'expected' parameters

Contains

Checks if the model output contains specific text:

1 const contains = new Contains();
2 // Usage requires both 'output' and 'expected' parameters

RegexMatch

Checks if the model output matches a regular expression pattern:

1 const regexMatch = new RegexMatch();
2 // Usage requires 'output' and 'pattern' parameters

IsJson

Checks if the output is valid JSON:

1 const isJson = new IsJson();
2 // Usage requires 'output' parameter

Metric Configuration

Custom Naming and Tracking

Each metric can be configured with a custom name and tracking option:

1 // Create metric with custom name
2 const exactMatch = new ExactMatch("my_exact_match");
3 
4 // Create metric with tracking disabled
5 const regexMatch = new RegexMatch("custom_regex", false);

Combining Multiple Metrics

You can use multiple metrics in a single evaluation:

1 const metrics = [new ExactMatch(), new Contains(), new RegexMatch()];
2 
3 // In your evaluation configuration
4 await evaluate({
5   dataset: myDataset,
6   task: myTask,
7   scoringMetrics: metrics,
8 });

Input Requirements

Validation Schema

Each metric defines a Zod validation schema that specifies required inputs:

1 // ExactMatch validation schema example
2 const validationSchema = z.object({
3   output: z.string(), // The model output
4   expected: z.string(), // The expected output
5 });

The validation system ensures all required parameters are present before executing the metric.

Mapping Inputs

You can map dataset fields and task outputs to metric inputs using scoringKeyMapping:

1 await evaluate({
2   dataset: myDataset,
3   task: myTask,
4   scoringMetrics: [new ExactMatch()],
5   scoringKeyMapping: {
6     // Map dataset/task fields to metric parameter names
7     output: "model.response",
8     expected: "dataset.answer",
9   },
10 });

Score Interpretation

Score Ranges

Most metrics in Opik return scores between 0.0 and 1.0:

1.0: Perfect match or ideal performance
0.0: No match or complete failure
Intermediate values: Partial matches or varying degrees of success

Creating Custom Metrics

Implementing Your Own Metric

To create a custom metric:

Extend the BaseMetric class
Define a validation schema using Zod
Implement the score method

Here’s an example of a custom metric that checks if output length is within a specified range:

1 import z from "zod";
2 import { BaseMetric } from "@opik/sdk";
3 import { EvaluationScoreResult } from "@opik/sdk";
4 
5 // Define validation schema
6 const validationSchema = z.object({
7   output: z.string(),
8   minLength: z.number(),
9   maxLength: z.number(),
10 });
11 
12 // Infer TypeScript type from schema
13 type Input = z.infer<typeof validationSchema>;
14 
15 export class LengthRangeMetric extends BaseMetric {
16   public validationSchema = validationSchema;
17 
18   constructor(name = "length_range", trackMetric = true) {
19     super(name, trackMetric);
20   }
21 
22   async score(input: Input): Promise<EvaluationScoreResult> {
23     const { output, minLength, maxLength } = input;
24     const length = output.length;
25 
26     // Calculate score (1.0 if within range, 0.0 otherwise)
27     const isWithinRange = length >= minLength && length <= maxLength;
28     const score = isWithinRange ? 1.0 : 0.0;
29 
30     // Return result with explanation
31     return {
32       name: this.name,
33       value: score,
34       reason: isWithinRange
35         ? `Output length (${length}) is within range ${minLength}-${maxLength}`
36         : `Output length (${length}) is outside range ${minLength}-${maxLength}`,
37     };
38   }
39 }

Validation Best Practices

When creating custom metrics:

Define clear validation schemas:

1 const validationSchema = z.object({
2   output: z.string().min(1, "Output is required"),
3   threshold: z.number().min(0).max(1),
4 });

Return meaningful reasons:

1 return {
2   name: this.name,
3   value: score,
4   reason: `Score ${score.toFixed(2)} because [detailed explanation]`,
5 };

Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics

LLM Judge Metrics

LLM Judge metrics use language models to evaluate the quality of LLM outputs. These metrics provide more nuanced evaluation than simple heuristic checks.

AnswerRelevance

Evaluates how relevant the output is to the input question:

1 import { AnswerRelevance } from "opik";
2 
3 // Using default model (gpt-5-nano)
4 const metric = new AnswerRelevance();
5 
6 // With custom model ID
7 const metricWithModel = new AnswerRelevance({
8   model: "claude-3-5-sonnet-latest",
9 });
10 
11 // With LanguageModel instance
12 import { openai } from "@ai-sdk/openai";
13 const customModel = openai("gpt-5-nano");
14 const metricWithCustomModel = new AnswerRelevance({ model: customModel });
15 
16 // Usage
17 const score = await metric.score({
18   input: "What is the capital of France?",
19   output: "The capital of France is Paris.",
20   context: ["France is a country in Western Europe."], // Optional
21 });
22 
23 console.log(score.value); // 0.0 to 1.0
24 console.log(score.reason); // Explanation of the score

Parameters

input (required): The question or prompt
output (required): The model’s response to evaluate
context (optional): Additional context for evaluation

Score Range

1.0: Perfect relevance - output directly addresses the input
0.5: Partial relevance - output is somewhat related but incomplete
0.0: No relevance - output doesn’t address the input

Hallucination

Detects whether the output contains hallucinated or unfaithful information:

1 import { Hallucination } from "opik";
2 
3 const metric = new Hallucination();
4 
5 // Without context - checks against general knowledge
6 const score1 = await metric.score({
7   input: "What is the capital of France?",
8   output:
9     "The capital of France is Paris. It is famous for its iconic Eiffel Tower.",
10 });
11 
12 // With context - checks faithfulness to provided context
13 const score2 = await metric.score({
14   input: "What is the capital of France?",
15   output:
16     "The capital of France is Paris. It is famous for its iconic Eiffel Tower.",
17   context: [
18     "France is a country in Western Europe. Its capital is Paris, which is known for landmarks like the Eiffel Tower.",
19   ],
20 });
21 
22 console.log(score2.value); // 1.0 = hallucination detected, 0.0 = no hallucination
23 console.log(score2.reason); // Array of reasons for the score

Parameters

input (required): The original question or prompt
output (required): The model’s response to evaluate
context (optional): Reference information to check against

Score Values

0.0: No hallucination - output is faithful to context/facts
1.0: Hallucination detected - output contains false or unsupported information

Moderation

Checks if the output contains harmful, inappropriate, or unsafe content:

1 import { Moderation } from "opik";
2 
3 const metric = new Moderation();
4 
5 const score = await metric.score({
6   input: "Tell me about safety guidelines",
7   output: "Here are some safety guidelines...",
8 });
9 
10 console.log(score.value); // 1.0 = harmful content detected, 0.0 = safe
11 console.log(score.reason); // Explanation of moderation decision

Parameters

input (required): The original prompt
output (required): The model’s response to evaluate

Score Values

0.0: Safe - no harmful content detected
1.0: Harmful - inappropriate or unsafe content detected

Usefulness

Evaluates how useful the output is in addressing the input:

1 import { Usefulness } from "opik";
2 
3 const metric = new Usefulness();
4 
5 const score = await metric.score({
6   input: "How do I reset my password?",
7   output:
8     "To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the instructions sent to your inbox.",
9 });
10 
11 console.log(score.value); // 0.0 to 1.0
12 console.log(score.reason); // Explanation of usefulness score

Parameters

input (required): The question or request
output (required): The model’s response to evaluate

Score Range

1.0: Very useful - comprehensive and actionable
0.5: Somewhat useful - partially helpful
0.0: Not useful - doesn’t help address the input

GEval

GEval is a task-agnostic LLM-as-a-judge metric that allows you to define custom evaluation criteria. The metric first generates a chain-of-thought (CoT) evaluation plan, then scores the output on a 0-10 scale (normalized to 0.0-1.0).

1 import { GEval } from "opik";
2 
3 const metric = new GEval({
4   taskIntroduction: "You evaluate the politeness of customer service responses.",
5   evaluationCriteria: "Score from 0 (rude) to 10 (very polite). Consider tone, word choice, and empathy.",
6   model: "gpt-4o",
7 });
8 
9 const score = await metric.score({
10   output: "Thanks so much for your patience! I'm happy to help resolve this for you.",
11 });
12 
13 console.log(score.value); // 0.0 to 1.0
14 console.log(score.reason); // Explanation of the score

Parameters

taskIntroduction (required): Description of what should be evaluated
evaluationCriteria (required): Detailed criteria defining what “good” looks like
model (optional): Model to use for evaluation (defaults to “gpt-4o”)
name (optional): Custom metric name (defaults to “g_eval_metric”)
temperature (optional): Sampling temperature for generation
seed (optional): Seed for reproducible outputs
maxTokens (optional): Maximum response length
modelSettings (optional): Advanced model configuration

Score Range

1.0: Perfect score (10/10 from the judge)
0.5: Average score (5/10 from the judge)
0.0: Lowest score (0/10 from the judge)

How It Works

GEval uses a two-stage process:

Chain of Thought Generation: Creates step-by-step evaluation instructions based on your task and criteria (cached for reuse)
Scoring: Evaluates the output using the CoT, returning a score with reasoning

When using OpenAI models, GEval leverages logprobs to compute a weighted average of score probabilities for more robust scoring.

Built-in GEval Judges

Opik provides pre-configured GEval judges for common evaluation scenarios. Each extends GEval with domain-specific prompts:

QARelevanceJudge

Evaluates whether an answer directly addresses the question:

1 import { QARelevanceJudge } from "opik";
2 
3 const judge = new QARelevanceJudge({ model: "gpt-4o" });
4 
5 const score = await judge.score({
6   output: `QUESTION: What causes rainbows?
7 ANSWER: Rainbows are caused by refraction and reflection of light in water droplets.`,
8 });
9 
10 console.log(score.value); // High score for relevant answer

SummarizationConsistencyJudge

Checks if a summary is faithful to the source material:

1 import { SummarizationConsistencyJudge } from "opik";
2 
3 const judge = new SummarizationConsistencyJudge();
4 
5 const score = await judge.score({
6   output: `SOURCE: The company announced Q4 revenue of $2.5M.
7 SUMMARY: The company had strong Q4 performance with $2.5M revenue.`,
8 });

SummarizationCoherenceJudge

Evaluates the structure and clarity of summaries:

1 import { SummarizationCoherenceJudge } from "opik";
2 
3 const judge = new SummarizationCoherenceJudge();
4 
5 const score = await judge.score({
6   output: "SUMMARY: First, the project started. Then it ended. Finally, it began.",
7 });
8 
9 console.log(score.value); // Low score for incoherent summary

DialogueHelpfulnessJudge

Assesses how helpful an assistant reply is in dialogue context:

1 import { DialogueHelpfulnessJudge } from "opik";
2 
3 const judge = new DialogueHelpfulnessJudge();
4 
5 const transcript = `USER: How do I reset my password?
6 ASSISTANT: Visit settings and click reset.
7 USER: I cannot see that option.
8 ASSISTANT: Please contact support.`;
9 
10 const score = await judge.score({ output: transcript });

Bias Detection Judges

Detect various forms of bias in responses:

1 import {
2   DemographicBiasJudge,
3   GenderBiasJudge,
4   PoliticalBiasJudge,
5   ReligiousBiasJudge,
6   RegionalBiasJudge,
7 } from "opik";
8 
9 // Demographic bias
10 const demographicJudge = new DemographicBiasJudge();
11 const score1 = await demographicJudge.score({
12   output: "People from X group are always late.",
13 });
14 
15 // Gender bias
16 const genderJudge = new GenderBiasJudge();
17 const score2 = await genderJudge.score({
18   output: "Women are naturally worse at math.",
19 });
20 
21 // Political bias
22 const politicalJudge = new PoliticalBiasJudge();
23 const score3 = await politicalJudge.score({
24   output: "Vote for candidate X because Y is corrupt.",
25 });

Agent Evaluation Judges

Evaluate agent task completion and tool usage:

1 import { AgentTaskCompletionJudge, AgentToolCorrectnessJudge } from "opik";
2 
3 // Task completion
4 const taskJudge = new AgentTaskCompletionJudge();
5 const score1 = await taskJudge.score({
6   output: "Agent gathered quotes, compared options, and booked travel.",
7 });
8 
9 // Tool correctness
10 const toolJudge = new AgentToolCorrectnessJudge();
11 const score2 = await toolJudge.score({
12   output: "Tool weather_api called with city='Paris' but response ignored.",
13 });

PromptUncertaintyJudge

Estimates how ambiguous a prompt is:

1 import { PromptUncertaintyJudge } from "opik";
2 
3 const judge = new PromptUncertaintyJudge();
4 
5 const score = await judge.score({
6   output: "Summarise the attached 400 page contract in one sentence and guarantee there are no mistakes.",
7 });
8 
9 console.log(score.value); // High score indicates high uncertainty

ComplianceRiskJudge

Flags non-compliant or risky claims in regulated sectors:

1 import { ComplianceRiskJudge } from "opik";
2 
3 const judge = new ComplianceRiskJudge({ model: "gpt-4o" });
4 
5 const score = await judge.score({
6   output: "This pill cures diabetes in a week.",
7 });
8 
9 console.log(score.value); // High score indicates high risk

Available Judges

All built-in GEval judges:

QARelevanceJudge - Answer relevance to questions
SummarizationConsistencyJudge - Summary faithfulness
SummarizationCoherenceJudge - Summary structure and clarity
DialogueHelpfulnessJudge - Assistant helpfulness in dialogue
DemographicBiasJudge - Demographic stereotyping
GenderBiasJudge - Gender stereotyping
PoliticalBiasJudge - Political bias
ReligiousBiasJudge - Religious bias
RegionalBiasJudge - Geographic/cultural bias
AgentTaskCompletionJudge - Agent task fulfillment
AgentToolCorrectnessJudge - Agent tool usage correctness
PromptUncertaintyJudge - Prompt ambiguity
ComplianceRiskJudge - Regulatory compliance risk

Configuring LLM Judge Metrics

Model Configuration

All LLM Judge metrics accept a model parameter in their constructor:

1 import { openai } from "@ai-sdk/openai";
2 import { anthropic } from "@ai-sdk/anthropic";
3 import { google } from "@ai-sdk/google";
4 import { Hallucination } from "opik";
5 
6 // Using model ID string
7 const metric1 = new Hallucination({ model: "gpt-5-nano" });
8 const metric2 = new Hallucination({ model: "claude-3-5-sonnet-latest" });
9 const metric3 = new Hallucination({ model: "gemini-2.0-flash" });
10 
11 // Using LanguageModel instance
12 const customModel = openai("gpt-5-nano");
13 const metric4 = new Hallucination({ model: customModel });

Async Scoring

All LLM Judge metrics support asynchronous scoring:

1 import { AnswerRelevance } from "opik";
2 
3 const metric = new AnswerRelevance();
4 
5 // Async/await
6 const score = await metric.score({
7   input: "What is TypeScript?",
8   output: "TypeScript is a typed superset of JavaScript.",
9 });
10 
11 // Promise chain
12 metric
13   .score({
14     input: "What is TypeScript?",
15     output: "TypeScript is a typed superset of JavaScript.",
16   })
17   .then((score) => console.log(score.value));

Combining Multiple LLM Judge Metrics

Use multiple metrics together for comprehensive evaluation:

1 import { AnswerRelevance, Hallucination, Moderation, Usefulness } from "opik";
2 import { evaluate } from "opik";
3 
4 await evaluate({
5   dataset: myDataset,
6   task: myTask,
7   scoringMetrics: [
8     new AnswerRelevance(),
9     new Hallucination(),
10     new Moderation(),
11     new Usefulness(),
12   ],
13 });

Custom Model for Each Metric

Different metrics can use different models:

1 import { openai } from "@ai-sdk/openai";
2 import { anthropic } from "@ai-sdk/anthropic";
3 import { AnswerRelevance, Hallucination } from "opik";
4 
5 // Use GPT-4o for answer relevance
6 const relevanceMetric = new AnswerRelevance({
7   model: openai("gpt-5-nano"),
8 });
9 
10 // Use Claude for hallucination detection
11 const hallucinationMetric = new Hallucination({
12   model: anthropic("claude-3-5-sonnet-latest"),
13 });
14 
15 await evaluate({
16   dataset: myDataset,
17   task: myTask,
18   scoringMetrics: [relevanceMetric, hallucinationMetric],
19 });

LLM Judge Metric Best Practices

1. Use Model ID Strings for Simplicity

For most use cases, use model ID strings directly:

1 import { Hallucination } from "opik";
2 
3 const metric = new Hallucination({ model: "gpt-5-nano" });

The Opik SDK handles model configuration internally for optimal evaluation performance.

2. Provide Context When Available

Context improves evaluation accuracy:

1 // Better: With context
2 await metric.score({
3   input: "What is the capital?",
4   output: "The capital is Paris.",
5   context: ["France is a country in Europe. Its capital is Paris."],
6 });
7 
8 // OK: Without context (relies on general knowledge)
9 await metric.score({
10   input: "What is the capital of France?",
11   output: "The capital is Paris.",
12 });

3. Choose Appropriate Models

Match model capabilities to metric requirements:

1 // Complex reasoning: Use GPT-5 or Claude Sonnet
2 const complexMetric = new AnswerRelevance({ model: "gpt-5" });
3 
4 // Simple checks: Use faster, cheaper models
5 const simpleMetric = new Moderation({ model: "gpt-5-nano" });

4. Handle Errors Gracefully

LLM calls can fail - handle errors appropriately:

1 try {
2   const score = await metric.score({
3     input: "What is TypeScript?",
4     output: "TypeScript is a typed superset of JavaScript.",
5   });
6   console.log(score);
7 } catch (error) {
8   console.error("Metric evaluation failed:", error);
9   // Implement fallback or retry logic
10 }

5. Batch Evaluations

Use the evaluate function for efficient batch processing:

1 // More efficient for multiple items
2 await evaluate({
3   dataset: myDataset,
4   task: myTask,
5   scoringMetrics: [new Hallucination()],
6   scoringWorkers: 5, // Parallel scoring
7 });
8 
9 // Less efficient for batch processing
10 for (const item of datasetItems) {
11   await metric.score(item); // Sequential scoring
12 }

Score Interpretation

Understanding LLM Judge Scores

LLM Judge metrics return structured scores with:

1 interface EvaluationScoreResult {
2   name: string; // Metric name
3   value: number; // Numerical score (0.0-1.0 typically)
4   reason: string | string[]; // Explanation for the score
5 }

Example Score Results

1 // AnswerRelevance
2 {
3   name: "answer_relevance",
4   value: 0.95,
5   reason: "The answer directly addresses the question with accurate information"
6 }
7 
8 // Hallucination
9 {
10   name: "hallucination",
11   value: 0.0,
12   reason: ["All information is supported by the context", "No contradictions found"]
13 }
14 
15 // Moderation
16 {
17   name: "moderation",
18   value: 0.0,
19   reason: "Content is safe and appropriate"
20 }

Generation Parameters

Configuring Temperature, Seed, and MaxTokens

All LLM Judge metrics support generation parameters in their constructor:

1 import { Hallucination, AnswerRelevance } from "opik";
2 
3 // Configure generation parameters
4 const metric = new Hallucination({
5   model: "gpt-5-nano",
6   temperature: 0.3, // Lower = more deterministic
7   seed: 42, // For reproducible outputs
8   maxTokens: 1000, // Maximum response length
9 });
10 
11 // Different settings for different metrics
12 const relevanceMetric = new AnswerRelevance({
13   model: "claude-3-5-sonnet-latest",
14   temperature: 0.7, // Higher = more creative
15   seed: 12345,
16 });
17 
18 // Use the metrics
19 const score = await metric.score({
20   input: "What is the capital of France?",
21   output: "The capital of France is Paris.",
22   context: ["France is a country in Western Europe."],
23 });

Advanced Model Settings

For provider-specific advanced parameters, use modelSettings:

1 import { Hallucination } from "opik";
2 
3 const metric = new Hallucination({
4   model: "gpt-5-nano",
5   temperature: 0.5,
6   modelSettings: {
7     topP: 0.9, // Nucleus sampling
8     topK: 50, // Top-K sampling
9     presencePenalty: 0.1, // Reduce repetition
10     frequencyPenalty: 0.2, // Reduce phrase repetition
11     stopSequences: ["END"], // Custom stop sequences
12   },
13 });

For provider-specific options not exposed through modelSettings, use LanguageModel instances:

1 import { openai } from "@ai-sdk/openai";
2 import { anthropic } from "@ai-sdk/anthropic";
3 import { google } from "@ai-sdk/google";
4 
5 // OpenAI with structured outputs
6 const openaiModel = openai("gpt-5-nano", {
7   structuredOutputs: true,
8 });
9 
10 // Anthropic with cache control
11 const anthropicModel = anthropic("claude-3-5-sonnet-latest", {
12   cacheControl: true,
13 });
14 
15 // Google Gemini with specific configuration
16 const googleModel = google("gemini-2.0-flash");
17 
18 const metric1 = new Hallucination({ model: openaiModel });
19 const metric2 = new Hallucination({ model: anthropicModel });
20 const metric3 = new Hallucination({ model: googleModel });

See Vercel AI SDK Provider Documentation for provider-specific options:

What Are Metrics?

How Metrics Calculate Scores

Types of Metrics

Built-in Metrics

ExactMatch

Contains

RegexMatch

IsJson

Metric Configuration

Custom Naming and Tracking

Combining Multiple Metrics

Input Requirements

Validation Schema

Mapping Inputs

Score Interpretation

Score Ranges

Creating Custom Metrics

Implementing Your Own Metric

Validation Best Practices

LLM Judge Metrics

AnswerRelevance

Parameters

Score Range

Hallucination

Parameters

Score Values

Moderation

Parameters

Score Values

Usefulness

Parameters

Score Range

GEval

Parameters

Score Range

How It Works

Built-in GEval Judges

QARelevanceJudge

SummarizationConsistencyJudge

SummarizationCoherenceJudge

DialogueHelpfulnessJudge

Bias Detection Judges

Agent Evaluation Judges

PromptUncertaintyJudge

ComplianceRiskJudge

Available Judges

Configuring LLM Judge Metrics

Model Configuration

Async Scoring

Combining Multiple LLM Judge Metrics

Custom Model for Each Metric

LLM Judge Metric Best Practices

1. Use Model ID Strings for Simplicity

2. Provide Context When Available

3. Choose Appropriate Models

4. Handle Errors Gracefully

5. Batch Evaluations

Score Interpretation

Understanding LLM Judge Scores

Example Score Results

Generation Parameters

Configuring Temperature, Seed, and MaxTokens

Advanced Model Settings

See Also