Heuristic metrics
Heuristic metrics are rule-based evaluation methods that allow you to check specific aspects of language model outputs. These metrics use predefined criteria or patterns to assess the quality, consistency, or characteristics of generated text. They come in two flavours:
- Token or string heuristics – operate on a single turn and compare the candidate output to a reference or handcrafted rule.
- Conversation heuristics – analyse whole transcripts to spot issues like degeneration or forgotten facts across assistant turns.
String and token heuristics
Conversation heuristics
[!TIP] These metrics operate on a single transcript without requiring a gold reference. If you need BLEU/ROUGE/METEOR-style comparisons, compose a custom
ConversationThreadMetricthat wraps the single-turn heuristics (SentenceBLEU,ROUGE,METEOR).
Score an LLM response
You can score an LLM response by first initializing the metrics and then calling the score method:
Metrics
Equals
The Equals metric can be used to check if the output of an LLM exactly matches a specific string. It can be used in the following way:
Contains
The Contains metric can be used to check if the output of an LLM contains a specific substring. It can be used in the following way:
RegexMatch
The RegexMatch metric can be used to check if the output of an LLM matches a specified regular expression pattern. It can be used in the following way:
IsJson
The IsJson metric can be used to check if the output of an LLM is valid. It can be used in the following way:
LevenshteinRatio
The LevenshteinRatio metric measures how similar the output is to a reference string on a 0–1 scale (1.0 means identical). It is useful when exact matches are too strict but you still want to penalise large deviations.
BLEU
The BLEU (Bilingual Evaluation Understudy) metrics estimate how close the LLM outputs are to one or more reference translations. Opik provides two separate classes:
SentenceBLEU– Single-sentence BLEUCorpusBLEU– Corpus-level BLEU Both rely on the underlying NLTK BLEU implementation with optional smoothing methods, weights, and variable n-gram orders.
You will need nltk library:
Use SentenceBLEU to compute single-sentence BLEU between a single candidate and one (or more) references:
Use CorpusBLEU to compute corpus-level BLEU for multiple candidates vs. multiple references. Each candidate and its references align by index in the list:
You can also customize n-grams, smoothing methods, or weights:
Note: If any candidate or reference is empty, SentenceBLEU or CorpusBLEU will raise a MetricComputationError. Handle or validate inputs accordingly.
ROUGE
ROUGE supports multiple variants out of the box: rouge1, rouge2, rougeL, and rougeLsum. You can switch variants via the rouge_type argument and optionally enable stemming or sentence splitting.
Install rouge-score when using this metric:
GLEU
GLEU estimates grammatical fluency using n-gram overlap. It is useful when you care about fluency rather than exact lexical matches.
Requires nltk:
BERTScore
BERTScore compares texts using contextual embeddings, offering a robust alternative to token-level similarity metrics. It produces precision, recall, and F1 scores (Opik reports the F1 by default).
Install the optional dependency before use:
ChrF
ChrF computes the character n-gram F-score (chrF / chrF++). Adjust beta, char_order, and word_order to switch between the two variants.
This metric relies on NLTK:
Distribution metrics
Histogram-based metrics compare token distributions between candidate and reference texts. They are helpful when you want to match style, vocabulary, or topical coverage.
JSDivergence
Returns 1 - Jensen–Shannon divergence, giving a similarity score between 0.0 and 1.0.
JSDistance
Wraps the same computation but returns the raw divergence (0.0 means identical distributions).
KLDivergence
Computes the KL divergence with optional smoothing and direction control.
Language Adherence
LanguageAdherenceMetric checks whether text matches an expected ISO language code. It can use a fastText language identification model or a custom detector callable.
Install fasttext and download a language ID model when using the default detector:
Readability
Readability computes Flesch Reading Ease (0–100) and the Flesch–Kincaid grade using the textstat package. The metric returns the reading-ease score normalised to [0, 1].
Install the optional dependency when using this metric:
Pass enforce_bounds=True alongside min_grade and/or max_grade to turn the metric into a strict guardrail that only reports 1.0 when the text meets your grade limits.
Spearman Ranking
SpearmanRanking measures how well two rankings agree. It returns a normalised correlation score in [0, 1].
Tone
Tone flags outputs that sound aggressive, negative, or violate a list of forbidden phrases. You can tweak sentiment thresholds, uppercase ratios, and exclamation limits.
Sentiment
The Sentiment metric analyzes the sentiment of text using NLTK’s VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyzer. It returns scores for positive, neutral, negative, and compound sentiment.
You will need the nltk library and the vader_lexicon:
Use Sentiment to analyze the sentiment of text:
The metric returns:
value: The compound sentiment score (-1.0 to 1.0)metadata: Dictionary containing all sentiment scores:pos: Positive sentiment (0.0-1.0)neu: Neutral sentiment (0.0-1.0)neg: Negative sentiment (0.0-1.0)compound: Normalized compound score (-1.0 to 1.0)
The compound score is a normalized score between -1.0 (extremely negative) and 1.0 (extremely positive), with scores:
- ≥ 0.05: Positive sentiment
-
-0.05 and < 0.05: Neutral sentiment
- ≤ -0.05: Negative sentiment
ROUGE
The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics estimate how close the LLM outputs are to one or more reference summaries, commonly used for evaluating summarization and text generation tasks. It measures the overlap between an output string and a reference string, with support for multiple ROUGE types. This metrics is a wrapper around the Google Research reimplementation of ROUGE, which is based on the rouge-score library. You will need rouge-score library:
It can be used in a following way:
You can customize the ROUGE metric using the following parameters:
-
rouge_type(str): Type of ROUGE score to compute. Must be one of:rouge1: Unigram-based scoringrouge2: Bigram-based scoringrougeL: Longest common subsequence-based scoringrougeLsum: ROUGE-L score based on sentence splitting
Default:
rouge1 -
use_stemmer(bool): Whether to use stemming in ROUGE computation.
Default:False -
split_summaries(bool): Whether to split summaries into sentences.
Default:False -
tokenizer(Any | None): Custom tokenizer for sentence splitting.
Default:None
AggregatedMetric
You can use the AggregatedMetric function to compute averages across multiple metrics for each item in your experiment.
You can define the metric as:
References
Notes
- The metric is case-insensitive.
- ROUGE scores are useful for comparing text summarization models or evaluating text similarity.
- Consider using stemming for improved evaluation in certain cases.