Conversational metrics
The conversational metrics can be used to score the quality of conversational threads collected by Opik through multiple traces. They also apply to conversations sourced outside of Opik when you want to analyse the performance of an assistant across turns.
Opik provides two families of conversation metrics:
- Conversation-level heuristic metrics – lightweight analytics that inspect the transcript itself (for example, knowledge retention or degeneration). Use these when you only have the production conversation and no gold reference.
- LLM-as-a-judge conversation metrics – call an LLM to reason about conversation quality, user goal completion, or risk in the latest assistant responses.
Conversation-level heuristic metrics
Knowledge Retention Metric
KnowledgeRetentionMetric operates on a conversation and compares how well the last assistant message preserves facts the user injected earlier. This is useful for guardrailing agents that should respect instructions or keep important constraints.
Conversation Degeneration Metric
ConversationDegenerationMetric detects repetitive phrases, lack of variance, or low-entropy responses across a conversation. It is a lightweight guard against models that fall into loops or short-circuit the dialogue.
LLM-as-a-judge conversation metrics
These metrics are based on the idea of using an LLM to evaluate the turns of the conversation between user and assistant. Opik ships a prompt template that wraps the transcript, criteria, and evaluation steps for you. By default, the gpt-5-nano model is used to evaluate responses, but you can switch to any LiteLLM-supported backend by setting the model parameter. You can learn more in the Customize models for LLM as a Judge metrics guide.
The GEval-based conversation adapters listed above live in the
opik.evaluation.metrics.conversation.llm_judges.g_eval_wrappers module. They accept the same
keyword arguments as their underlying judges (e.g. model, temperature). See
Conversation-level GEval metrics for a deeper walkthrough.
Need reference-based scores such as BLEU, ROUGE, or METEOR across conversations?
Compose your own ConversationThreadMetric and reuse the single-turn heuristics
(SentenceBLEU, ROUGE, METEOR) directly.
ConversationalCoherenceMetric
ConversationalCoherenceMetric evaluates the logical flow of a dialogue. It builds a sliding window of turns and asks an LLM to rate whether the final assistant message is coherent and relevant. It returns a score between 0.0 and 1.0 and can optionally return detailed reasons.
SessionCompletenessQuality
SessionCompletenessQuality captures whether a conversation fulfilled the user’s top-level goals. The metric asks an LLM to extract intentions from the thread, judge completion, and aggregate the results.
UserFrustrationMetric
UserFrustrationMetric estimates how likely it is that the user became frustrated (e.g. because of repetition or ignored requests). It scans windows of the conversation with an LLM and reports a value between 0.0 (not frustrated) and 1.0 (very frustrated).
Next steps
- Read more about conversational threads evaluation
- Learn how to create custom conversation metrics