Overview

Opik provides a set of built-in evaluation metrics that you can mix and match to evaluate LLM behaviour. These metrics are broken down into two main categories:

Heuristic metrics – deterministic checks that rely on rules, statistics, or classical NLP algorithms.
LLM as a Judge metrics – delegate scoring to an LLM so you can capture semantic, task-specific, or conversation-level quality signals.

Heuristic metrics are ideal when you need reproducible checks such as exact matching, regex validation, or similarity scores against a reference. LLM as a Judge metrics are useful when you want richer qualitative feedback (hallucination detection, helpfulness, summarisation quality, regulatory risk, etc.).

Built-in metrics

Heuristic metrics

Metric	Description	Documentation
BERTScore	Contextual embedding similarity score	BERTScore
ChrF	Character n-gram F-score (chrF / chrF++)	ChrF
Contains	Checks whether the output contains a specific substring	Contains
Corpus BLEU	Computes corpus-level BLEU across multiple outputs	CorpusBLEU
Equals	Checks if the output exactly matches an expected string	Equals
GLEU	Estimates grammatical fluency for candidate sentences	GLEU
IsJson	Validates that the output can be parsed as JSON	IsJson
JSDivergence	Jensen–Shannon similarity between token distributions	JSDivergence
JSDistance	Raw Jensen–Shannon divergence	JSDistance
KLDivergence	Kullback–Leibler divergence with smoothing	KLDivergence
Language Adherence	Verifies output language code	Language Adherence
Levenshtein	Calculates the normalized Levenshtein distance between output and reference	Levenshtein
Readability	Reports Flesch Reading Ease and FK grade	Readability
RegexMatch	Checks if the output matches a specified regular expression pattern	RegexMatch
ROUGE	Calculates ROUGE variants (rouge1/2/L/Lsum/W)	ROUGE
Sentence BLEU	Computes a BLEU score for a single output against one or more references	SentenceBLEU
Sentiment	Scores sentiment using VADER	Sentiment
Spearman Ranking	Spearman’s rank correlation	Spearman Ranking
Tone	Flags tone issues such as shouting or negativity	Tone

Conversation heuristic metrics

Metric	Description	Documentation
DegenerationC	Detects repetition and degeneration patterns over a conversation	DegenerationC
Knowledge Retention	Checks whether the last assistant reply preserves user facts from earlier turns	Knowledge Retention

LLM as a Judge metrics

Metric	Description	Documentation
Agent Task Completion Judge	Checks whether an agent fulfilled its assigned task	Agent Task Completion
Agent Tool Correctness Judge	Evaluates whether an agent used tools correctly	Agent Tool Correctness
Answer Relevance	Checks whether the answer stays on-topic with the question	Answer Relevance
Compliance Risk Judge	Identifies non-compliant or high-risk statements	Compliance Risk
Context Precision	Ensures the answer only uses relevant context	Context Precision
Context Recall	Measures how well the answer recalls supporting context	Context Recall
Dialogue Helpfulness Judge	Evaluates how helpful an assistant reply is in a dialogue	Dialogue Helpfulness
G-Eval	Task-agnostic judge configurable with custom instructions	G-Eval
Hallucination	Detects unsupported or hallucinated claims using an LLM judge	Hallucination
LLM Juries Judge	Averages scores from multiple judge metrics for ensemble scoring	LLM Juries
Meaning Match	Evaluates semantic equivalence between output and ground truth	Meaning Match
Moderation	Flags safety or policy violations in assistant responses	Moderation
Prompt Uncertainty Judge	Detects ambiguity in prompts that may confuse LLMs	Prompt Diagnostics
QA Relevance Judge	Determines whether an answer directly addresses the user question	QA Relevance
Structured Output Compliance	Checks JSON or schema adherence for structured responses	Structured Output
Summarization Coherence Judge	Rates the structure and coherence of a summary	Summarization Coherence
Summarization Consistency Judge	Checks if a summary stays faithful to the source	Summarization Consistency
Trajectory Accuracy	Scores how closely agent trajectories follow expected steps	Trajectory Accuracy
Usefulness	Rates how useful the answer is to the user	Usefulness

Conversation LLM as a Judge metrics

Metric	Description	Documentation
Conversational Coherence	Evaluates coherence across sliding windows of a dialogue	Conversational Coherence
Session Completeness Quality	Checks whether user goals were satisfied during the session	Session Completeness
User Frustration	Estimates the likelihood a user was frustrated	User Frustration

Customizing LLM as a Judge metrics

By default, Opik uses GPT-5-nano from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.

1 from opik.evaluation.metrics import Hallucination
2 
3 metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
4 
5 metric.score(
6 input="What is the capital of France?",
7 output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
8 )

For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.

For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.