Overview | Opik Documentation

Opik provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls. These metrics are broken down into two main categories:

Heuristic metrics
LLM as a Judge metrics

Heuristic metrics are deterministic and are often statistical in nature. LLM as a Judge metrics are non-deterministic and are based on the idea of using an LLM to evaluate the output of another LLM.

Opik provides the following built-in evaluation metrics:

Metric	Type	Description	Documentation
Equals	Heuristic	Checks if the output exactly matches an expected string	Equals
Contains	Heuristic	Check if the output contains a specific substring, can be both case sensitive or case insensitive	Contains
RegexMatch	Heuristic	Checks if the output matches a specified regular expression pattern	RegexMatch
IsJson	Heuristic	Checks if the output is a valid JSON object	IsJson
Levenshtein	Heuristic	Calculates the Levenshtein distance between the output and an expected string	Levenshtein
Hallucination	LLM as a Judge	Check if the output contains any hallucinations	Hallucination
G-Eval	LLM as a Judge	Task agnostic LLM as a Judge metric	G-Eval
Moderation	LLM as a Judge	Check if the output contains any harmful content	Moderation
AnswerRelevance	LLM as a Judge	Check if the output is relevant to the question	AnswerRelevance
Usefulness	LLM as a Judge	Check if the output is useful to the question	Usefulness
ContextRecall	LLM as a Judge	Check if the output contains any hallucinations	ContextRecall
ContextPrecision	LLM as a Judge	Check if the output contains any hallucinations	ContextPrecision
Conversational Coherence	LLM as a Judge	Calculates the conversational coherence score for a given conversation thread.	ConversationalCoherence
Session Completeness Quality	LLM as a Judge	Evaluates the completeness of a session within a conversational thread.	SessionCompleteness
User Frustration	LLM as a Judge	Calculates the user frustration score for a given conversational thread.	UserFrustration

You can also create your own custom metric, learn more about it in the Custom Metric section.

Customizing LLM as a Judge metrics

By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.

1 from opik.evaluation.metrics import Hallucination
2 
3 metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
4 
5 metric.score(
6 input="What is the capital of France?",
7 output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
8 )

For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.

For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.