Evaluation Metrics
Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.
What Are Metrics?
In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric abstract class, which provides the core functionality for validation and tracking.
How Metrics Calculate Scores
Each metric must implement the score method, which:
- Accepts an
inputobject containing combined data from the task output, dataset item, and scoringKeyMapping - Processes the inputs to produce a score
- Returns an
EvaluationScoreResultor array of results, which includes:name: The metric namevalue: The numerical score (typically 0.0-1.0)reason: A human-readable explanation for the score
Types of Metrics
Opik supports different types of metrics:
- Heuristic metrics: Simple rule-based evaluations (e.g., exact match, contains, regex match)
- LLM Judge metrics: AI-powered evaluations that use language models to assess output quality
Built-in Metrics
Opik provides several built-in metrics for common evaluation scenarios:
ExactMatch
Checks if the model output exactly matches the expected output:
Contains
Checks if the model output contains specific text:
RegexMatch
Checks if the model output matches a regular expression pattern:
IsJson
Checks if the output is valid JSON:
Metric Configuration
Custom Naming and Tracking
Each metric can be configured with a custom name and tracking option:
Combining Multiple Metrics
You can use multiple metrics in a single evaluation:
Input Requirements
Validation Schema
Each metric defines a Zod validation schema that specifies required inputs:
The validation system ensures all required parameters are present before executing the metric.
Mapping Inputs
You can map dataset fields and task outputs to metric inputs using scoringKeyMapping:
Score Interpretation
Score Ranges
Most metrics in Opik return scores between 0.0 and 1.0:
- 1.0: Perfect match or ideal performance
- 0.0: No match or complete failure
- Intermediate values: Partial matches or varying degrees of success
Creating Custom Metrics
Implementing Your Own Metric
To create a custom metric:
- Extend the
BaseMetricclass - Define a validation schema using Zod
- Implement the
scoremethod
Here’s an example of a custom metric that checks if output length is within a specified range:
Validation Best Practices
When creating custom metrics:
-
Define clear validation schemas:
-
Return meaningful reasons:
-
Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics
LLM Judge Metrics
LLM Judge metrics use language models to evaluate the quality of LLM outputs. These metrics provide more nuanced evaluation than simple heuristic checks.
AnswerRelevance
Evaluates how relevant the output is to the input question:
Parameters
input(required): The question or promptoutput(required): The model’s response to evaluatecontext(optional): Additional context for evaluation
Score Range
- 1.0: Perfect relevance - output directly addresses the input
- 0.5: Partial relevance - output is somewhat related but incomplete
- 0.0: No relevance - output doesn’t address the input
Hallucination
Detects whether the output contains hallucinated or unfaithful information:
Parameters
input(required): The original question or promptoutput(required): The model’s response to evaluatecontext(optional): Reference information to check against
Score Values
- 0.0: No hallucination - output is faithful to context/facts
- 1.0: Hallucination detected - output contains false or unsupported information
Moderation
Checks if the output contains harmful, inappropriate, or unsafe content:
Parameters
input(required): The original promptoutput(required): The model’s response to evaluate
Score Values
- 0.0: Safe - no harmful content detected
- 1.0: Harmful - inappropriate or unsafe content detected
Usefulness
Evaluates how useful the output is in addressing the input:
Parameters
input(required): The question or requestoutput(required): The model’s response to evaluate
Score Range
- 1.0: Very useful - comprehensive and actionable
- 0.5: Somewhat useful - partially helpful
- 0.0: Not useful - doesn’t help address the input
Configuring LLM Judge Metrics
Model Configuration
All LLM Judge metrics accept a model parameter in their constructor:
Async Scoring
All LLM Judge metrics support asynchronous scoring:
Combining Multiple LLM Judge Metrics
Use multiple metrics together for comprehensive evaluation:
Custom Model for Each Metric
Different metrics can use different models:
LLM Judge Metric Best Practices
1. Use Model ID Strings for Simplicity
For most use cases, use model ID strings directly:
The Opik SDK handles model configuration internally for optimal evaluation performance.
2. Provide Context When Available
Context improves evaluation accuracy:
3. Choose Appropriate Models
Match model capabilities to metric requirements:
4. Handle Errors Gracefully
LLM calls can fail - handle errors appropriately:
5. Batch Evaluations
Use the evaluate function for efficient batch processing:
Score Interpretation
Understanding LLM Judge Scores
LLM Judge metrics return structured scores with:
Example Score Results
Generation Parameters
Configuring Temperature, Seed, and MaxTokens
All LLM Judge metrics support generation parameters in their constructor:
Advanced Model Settings
For provider-specific advanced parameters, use modelSettings:
For provider-specific options not exposed through modelSettings, use LanguageModel instances:
See Vercel AI SDK Provider Documentation for provider-specific options:
See Also
- Models - Configuring language models for metrics
- evaluate Function - Using metrics in evaluations
- evaluatePrompt Function - Using metrics with prompt evaluation