Overview

Opik provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls. These metrics are broken down into two main categories:

  1. Heuristic metrics
  2. LLM as a Judge metrics

Heuristic metrics are deterministic and are often statistical in nature. LLM as a Judge metrics are non-deterministic and are based on the idea of using an LLM to evaluate the output of another LLM.

Opik provides the following built-in evaluation metrics:

MetricTypeDescriptionDocumentation
EqualsHeuristicChecks if the output exactly matches an expected stringEquals
ContainsHeuristicCheck if the output contains a specific substring, can be both case sensitive or case insensitiveContains
RegexMatchHeuristicChecks if the output matches a specified regular expression patternRegexMatch
IsJsonHeuristicChecks if the output is a valid JSON objectIsJson
LevenshteinHeuristicCalculates the Levenshtein distance between the output and an expected stringLevenshtein
HallucinationLLM as a JudgeCheck if the output contains any hallucinationsHallucination
G-EvalLLM as a JudgeTask agnostic LLM as a Judge metricG-Eval
ModerationLLM as a JudgeCheck if the output contains any harmful contentModeration
AnswerRelevanceLLM as a JudgeCheck if the output is relevant to the questionAnswerRelevance
UsefulnessLLM as a JudgeCheck if the output is useful to the questionUsefulness
ContextRecallLLM as a JudgeCheck if the output contains any hallucinationsContextRecall
ContextPrecisionLLM as a JudgeCheck if the output contains any hallucinationsContextPrecision
Conversational CoherenceLLM as a JudgeCalculates the conversational coherence score for a given conversation thread.ConversationalCoherence
Session Completeness QualityLLM as a JudgeEvaluates the completeness of a session within a conversational thread.SessionCompleteness
User FrustrationLLM as a JudgeCalculates the user frustration score for a given conversational thread.UserFrustration

You can also create your own custom metric, learn more about it in the Custom Metric section.

Customizing LLM as a Judge metrics

By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.

1from opik.evaluation.metrics import Hallucination
2
3metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
4
5metric.score(
6input="What is the capital of France?",
7output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
8)

For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.

For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.