In Opik 2.0, experiments are project-scoped. When using metrics in evaluations, specify a projectName in the evaluate() call so results are associated with the correct project.
Metrics are a fundamental component of the Opik evaluation function. They provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time.
In Opik, a metric is a function that calculates a score based on specific inputs, such as model outputs and reference answers. All metrics in Opik extend the BaseMetric abstract class, which provides the core functionality for validation and tracking.
Each metric must implement the score method, which:
input object containing combined data from the task output, dataset item, and scoringKeyMappingEvaluationScoreResult or array of results, which includes:
name: The metric namevalue: The numerical score (typically 0.0-1.0)reason: A human-readable explanation for the scoreOpik supports different types of metrics:
Opik provides several built-in metrics for common evaluation scenarios:
Checks if the model output exactly matches the expected output:
Checks if the model output contains specific text:
Checks if the model output matches a regular expression pattern:
Checks if the output is valid JSON:
Each metric can be configured with a custom name and tracking option:
You can use multiple metrics in a single evaluation:
Each metric defines a Zod validation schema that specifies required inputs:
The validation system ensures all required parameters are present before executing the metric.
You can map dataset fields and task outputs to metric inputs using scoringKeyMapping:
Most metrics in Opik return scores between 0.0 and 1.0:
To create a custom metric:
BaseMetric classscore methodHere’s an example of a custom metric that checks if output length is within a specified range:
When creating custom metrics:
Define clear validation schemas:
Return meaningful reasons:
Normalize scores to a consistent range (typically 0.0-1.0) for easier comparison with other metrics
LLM Judge metrics use language models to evaluate the quality of LLM outputs. These metrics provide more nuanced evaluation than simple heuristic checks.
Evaluates how relevant the output is to the input question:
input (required): The question or promptoutput (required): The model’s response to evaluatecontext (optional): Additional context for evaluationDetects whether the output contains hallucinated or unfaithful information:
input (required): The original question or promptoutput (required): The model’s response to evaluatecontext (optional): Reference information to check againstChecks if the output contains harmful, inappropriate, or unsafe content:
input (required): The original promptoutput (required): The model’s response to evaluateEvaluates how useful the output is in addressing the input:
input (required): The question or requestoutput (required): The model’s response to evaluateGEval is a task-agnostic LLM-as-a-judge metric that allows you to define custom evaluation criteria. The metric first generates a chain-of-thought (CoT) evaluation plan, then scores the output on a 0-10 scale (normalized to 0.0-1.0).
taskIntroduction (required): Description of what should be evaluatedevaluationCriteria (required): Detailed criteria defining what “good” looks likemodel (optional): Model to use for evaluation (defaults to “gpt-4o”)name (optional): Custom metric name (defaults to “g_eval_metric”)temperature (optional): Sampling temperature for generationseed (optional): Seed for reproducible outputsmaxTokens (optional): Maximum response lengthmodelSettings (optional): Advanced model configurationGEval uses a two-stage process:
When using OpenAI models, GEval leverages logprobs to compute a weighted average of score probabilities for more robust scoring.
Opik provides pre-configured GEval judges for common evaluation scenarios. Each extends GEval with domain-specific prompts:
Evaluates whether an answer directly addresses the question:
Checks if a summary is faithful to the source material:
Evaluates the structure and clarity of summaries:
Assesses how helpful an assistant reply is in dialogue context:
Detect various forms of bias in responses:
Evaluate agent task completion and tool usage:
Estimates how ambiguous a prompt is:
Flags non-compliant or risky claims in regulated sectors:
All built-in GEval judges:
QARelevanceJudge - Answer relevance to questionsSummarizationConsistencyJudge - Summary faithfulnessSummarizationCoherenceJudge - Summary structure and clarityDialogueHelpfulnessJudge - Assistant helpfulness in dialogueDemographicBiasJudge - Demographic stereotypingGenderBiasJudge - Gender stereotypingPoliticalBiasJudge - Political biasReligiousBiasJudge - Religious biasRegionalBiasJudge - Geographic/cultural biasAgentTaskCompletionJudge - Agent task fulfillmentAgentToolCorrectnessJudge - Agent tool usage correctnessPromptUncertaintyJudge - Prompt ambiguityComplianceRiskJudge - Regulatory compliance riskAll LLM Judge metrics accept a model parameter in their constructor:
All LLM Judge metrics support asynchronous scoring:
Use multiple metrics together for comprehensive evaluation:
Different metrics can use different models:
For most use cases, use model ID strings directly:
The Opik SDK handles model configuration internally for optimal evaluation performance.
Context improves evaluation accuracy:
Match model capabilities to metric requirements:
LLM calls can fail - handle errors appropriately:
Use the evaluate function for efficient batch processing:
LLM Judge metrics return structured scores with:
All LLM Judge metrics support generation parameters in their constructor:
For provider-specific advanced parameters, use modelSettings:
For provider-specific options not exposed through modelSettings, use LanguageModel instances:
See Vercel AI SDK Provider Documentation for provider-specific options: