Conversational metrics | Opik Documentation

The conversational metrics can be used to score the quality of conversational threads collected by Opik through multiple traces. They also apply to conversations sourced outside of Opik when you want to analyse the performance of an assistant across turns.

Opik provides two families of conversation metrics:

Conversation-level heuristic metrics – lightweight analytics that inspect the transcript itself (for example, knowledge retention or degeneration). Use these when you only have the production conversation and no gold reference.
LLM-as-a-judge conversation metrics – call an LLM to reason about conversation quality, user goal completion, or risk in the latest assistant responses.

Conversation-level heuristic metrics

Metric	Description
KnowledgeRetentionMetric	Checks whether the final assistant replies retain earlier user-provided facts.
ConversationDegenerationMetric	Detects repetition and degeneration patterns across the conversation.

Knowledge Retention Metric

KnowledgeRetentionMetric operates on a conversation and compares how well the last assistant message preserves facts the user injected earlier. This is useful for guardrailing agents that should respect instructions or keep important constraints.

1 from opik.evaluation.metrics import KnowledgeRetentionMetric
2 
3 metric = KnowledgeRetentionMetric(turns_to_consider=5)
4 score = metric.score(conversation=my_thread)
5 print(score.value, score.reason)

Conversation Degeneration Metric

ConversationDegenerationMetric detects repetitive phrases, lack of variance, or low-entropy responses across a conversation. It is a lightweight guard against models that fall into loops or short-circuit the dialogue.

1 from opik.evaluation.metrics import ConversationDegenerationMetric
2 
3 metric = ConversationDegenerationMetric()
4 score = metric.score(conversation=my_thread)

LLM-as-a-judge conversation metrics

Metric	Description
ConversationalCoherenceMetric	Evaluates coherence and relevance across sliding windows of the dialogue.
SessionCompletenessQuality	Checks whether the user’s high-level goals were satisfied.
UserFrustrationMetric	Estimates how frustrated the user was across the interaction.
ConversationComplianceRiskMetric	Applies the Compliance Risk judge to the last assistant response.
ConversationDialogueHelpfulnessMetric	Rates how helpful the final assistant reply is.
ConversationQARelevanceMetric	Checks whether the final answer addresses the user’s request.
ConversationSummarizationConsistencyMetric	Scores how faithful a conversation summary is to the transcript.
ConversationSummarizationCoherenceMetric	Scores the structure and flow of a conversation summary.
ConversationPromptPerplexityMetric	Estimates prompt difficulty at the conversation level.
ConversationPromptUncertaintyMetric	Flags ambiguous prompts in threaded evaluations.

These metrics are based on the idea of using an LLM to evaluate the turns of the conversation between user and assistant. Opik ships a prompt template that wraps the transcript, criteria, and evaluation steps for you. By default, the gpt-5-nano model is used to evaluate responses, but you can switch to any LiteLLM-supported backend by setting the model parameter. You can learn more in the Customize models for LLM as a Judge metrics guide.

The GEval-based conversation adapters listed above live in the opik.evaluation.metrics.conversation.llm_judges.g_eval_wrappers module. They accept the same keyword arguments as their underlying judges (e.g. model, temperature). See Conversation-level GEval metrics for a deeper walkthrough.

Need reference-based scores such as BLEU, ROUGE, or METEOR across conversations? Compose your own ConversationThreadMetric and reuse the single-turn heuristics (SentenceBLEU, ROUGE, METEOR) directly.

ConversationalCoherenceMetric

ConversationalCoherenceMetric evaluates the logical flow of a dialogue. It builds a sliding window of turns and asks an LLM to rate whether the final assistant message is coherent and relevant. It returns a score between 0.0 and 1.0 and can optionally return detailed reasons.

Conversational coherence example

1 from opik.evaluation.metrics import ConversationalCoherenceMetric
2 
3 conversation = [
4     {
5         "role": "user",
6         "content": "I need to book a flight to New York and find a hotel.",
7     },
8     {
9         "role": "assistant",
10         "content": "I can help you with that. For flights to New York, what dates are you looking to travel?",
11     },
12     {
13         "role": "user",
14         "content": "Next weekend, from Friday to Sunday.",
15     },
16     {
17         "role": "assistant",
18         "content": "Great! I recommend checking airlines like Delta, United, or JetBlue for flights to New York next weekend. For hotels, what's your budget range and preferred location in New York?",
19     },
20     {
21         "role": "user",
22         "content": "Around $200 per night, preferably in Manhattan.",
23     },
24     {
25         "role": "assistant",
26         "content": "For Manhattan hotels around $200/night, you might want to look at options like Hotel Beacon, Pod 51, or CitizenM Times Square. These are well-rated options in that price range. Would you like more specific recommendations for any of these?",
27     },
28 ]
29 
30 metric = ConversationalCoherenceMetric(model="gpt-5-nano", window_size=8, include_reason=True)
31 result = metric.score(conversation)
32 print(result.value)
33 print(result.reason)

SessionCompletenessQuality

SessionCompletenessQuality captures whether a conversation fulfilled the user’s top-level goals. The metric asks an LLM to extract intentions from the thread, judge completion, and aggregate the results.

Session completeness example

1 from opik.evaluation.metrics import SessionCompletenessQuality
2 
3 conversation = [
4     {
5         "role": "user",
6         "content": "I need to book a flight to New York and find a hotel.",
7     },
8     {
9         "role": "assistant",
10         "content": "I can help you with that. For flights to New York, what dates are you looking to travel?",
11     },
12     {
13         "role": "user",
14         "content": "Next weekend, from Friday to Sunday.",
15     },
16     {
17         "role": "assistant",
18         "content": "Great! I recommend checking airlines like Delta, United, or JetBlue for flights to New York next weekend. For hotels, what's your budget range and preferred location in New York?",
19     },
20     {
21         "role": "user",
22         "content": "Around $200 per night, preferably in Manhattan.",
23     },
24     {
25         "role": "assistant",
26         "content": "For Manhattan hotels around $200/night, you might want to look at options like Hotel Beacon, Pod 51, or CitizenM Times Square. These are well-rated options in that price range. Would you like more specific recommendations for any of these?",
27     },
28 ]
29 
30 metric = SessionCompletenessQuality(model="gpt-5-nano")
31 result = metric.score(conversation)
32 print(result.value)
33 print(result.reason)

UserFrustrationMetric

UserFrustrationMetric estimates how likely it is that the user became frustrated (e.g. because of repetition or ignored requests). It scans windows of the conversation with an LLM and reports a value between 0.0 (not frustrated) and 1.0 (very frustrated).

User frustration example

1 from opik.evaluation.metrics import UserFrustrationMetric
2 
3 conversation = [
4     {
5         "role": "user",
6         "content": "How do I center a div using CSS?",
7     },
8     {
9         "role": "assistant",
10         "content": "There are many ways to center elements in CSS.",
11     },
12     {
13         "role": "user",
14         "content": "Okay... can you show me one?",
15     },
16     {
17         "role": "assistant",
18         "content": "Sure. It depends on the context — are you centering horizontally, vertically, or both?",
19     },
20     {
21         "role": "user",
22         "content": "Both. Just give me a basic example.",
23     },
24     {
25         "role": "assistant",
26         "content": "Alright. You can use flexbox, grid, or margin auto. All of them work well.",
27     },
28     {
29         "role": "user",
30         "content": "Could you please just write the code?",
31     },
32     {
33         "role": "assistant",
34         "content": "Here’s one way: set the container to display:flex and then add justify-content and align-items so the child centers both ways.",
35     },
36     {
37         "role": "user",
38         "content": "But this doesn’t even center anything! This is incomplete.",
39     },
40     {
41         "role": "assistant",
42         "content": "You're right. You also need `justify-content` and `align-items`.",
43     },
44 ]
45 
46 metric = UserFrustrationMetric(include_reason=True, model="gpt-5-nano")
47 result = metric.score(conversation)
48 print(result.value)
49 print(result.reason)

Next steps

Read more about conversational threads evaluation
Learn how to create custom conversation metrics