Conversational metrics
The conversational metrics can be used to score the quality of conversational threads collected by Opik through multiple traces. Also, they can be used to score the quality of the conversations collected by other means.
You can use the following metrics:
These metrics are based on the idea of using an LLM to evaluate the turns of the conversation between user and LLM.
For this we have a prompt template used to generate the prompt for the LLM. By default, the gpt-4o
model is used
to evaluate responses, but you can change this to any model supported by LiteLLM
by setting the model
parameter. You can learn more about customizing models in
the Customize models for LLM as a Judge metrics section.
Each score produced by these metrics comes with a detailed explanation (result.reason
) that helps understand
why that particular score was assigned.
ConversationalCoherenceMetric
This metric assesses the coherence and relevance across a series of conversation turns by evaluating the consistency in responses, logical flow, and overall context maintenance. It evaluates whether the conversation session felt like a natural, adaptive, helpful interaction.
The ConversationalCoherenceMetric
builds a sliding window of dialogue turns for
each turn in the conversation. It then uses a language model to evaluate whether
the final assistant
message within each window is relevant and coherent in
relation to the preceding conversational context.
It supports both synchronous and asynchronous operations to accommodate the model’s
operation type. It returns a score between 0.0
and 1.0
, where 0.0
indicates a
low coherence score and 1.0
indicates a high coherence score.
It can be used in the following way:
Asynchronous scoring is also supported with the ascore
scoring method.
SessionCompletenessQuality
This metric evaluates the completeness of a session within a conversational thread. It assesses whether the session addresses the intended context or purpose of the conversation.
The evaluation process begins by using an LLM to extract a list of high-level user
intentions from the conversation turns. The same LLM is then used to assess
whether each intention was addressed and/or fulfilled over the course of
the conversation. It returns a score between 0.0
and 1.0
, where higher values
indicate better session completeness.
You can use it in the following way:
Asynchronous scoring is also supported with the ascore
scoring method.
UserFrustrationMetric
This metric evaluates the user frustration level within a conversation thread. It estimates a heuristic score estimating the likelihood that the user experienced confusion, annoyance, or disengagement during the session — due to repetition, lack of adaptation, ignored intent signals, or failure to smoothly conclude.
The UserFrustrationMetric
class integrates with LLM models to analyze
conversation data in sliding windows and produce a numerical score along with an optional
reason for the calculated score. It provides both synchronous and asynchronous methods for
calculation and supports customization through attributes like window size and reason inclusion.
This metric can be used to monitor and track user frustration levels during conversations, enabling
insights into user experience. The metric makes use of LLM models to score conversational
windows and summarize results. It returns a score between 0.0
and 1.0
. The higher the score,
the more frustrated the user is likely to be.
It can be used to evaluate the user experience during a conversation, like this:
Asynchronous scoring is also supported with the ascore
scoring method.
Next steps
Read more about the conversational threads evaluation on the conversational threads evaluation page.