Conversational metrics

The conversational metrics can be used to score the quality of conversational threads collected by Opik through multiple traces. Also, they can be used to score the quality of the conversations collected by other means.

You can use the following metrics:

MetricDescription
ConversationalCoherenceMetricCalculates the conversational coherence score for a given conversation thread.
SessionCompletenessQualityEvaluates the completeness of a session within a conversational thread.
UserFrustrationMetricCalculates the user frustration score for a given conversational thread.

These metrics are based on the idea of using an LLM to evaluate the turns of the conversation between user and LLM. For this we have a prompt template used to generate the prompt for the LLM. By default, the gpt-4o model is used to evaluate responses, but you can change this to any model supported by LiteLLM by setting the model parameter. You can learn more about customizing models in the Customize models for LLM as a Judge metrics section.

Each score produced by these metrics comes with a detailed explanation (result.reason) that helps understand why that particular score was assigned.

ConversationalCoherenceMetric

This metric assesses the coherence and relevance across a series of conversation turns by evaluating the consistency in responses, logical flow, and overall context maintenance. It evaluates whether the conversation session felt like a natural, adaptive, helpful interaction.

The ConversationalCoherenceMetric builds a sliding window of dialogue turns for each turn in the conversation. It then uses a language model to evaluate whether the final assistant message within each window is relevant and coherent in relation to the preceding conversational context.

It supports both synchronous and asynchronous operations to accommodate the model’s operation type. It returns a score between 0.0 and 1.0, where 0.0 indicates a low coherence score and 1.0 indicates a high coherence score.

It can be used in the following way:

1from opik.evaluation.metrics import ConversationalCoherenceMetric
2conversation = [
3 {
4 "role": "user",
5 "content": "I need to book a flight to New York and find a hotel.",
6 },
7 {
8 "role": "assistant",
9 "content": "I can help you with that. For flights to New York, what dates are you looking to travel?",
10 },
11 {
12 "role": "user",
13 "content": "Next weekend, from Friday to Sunday."},
14 {
15 "role": "assistant",
16 "content": "Great! I recommend checking airlines like Delta, United, or JetBlue for flights to New York next weekend. For hotels, what's your budget range and preferred location in New York?",
17 },
18 {
19 "role": "user",
20 "content": "Around $200 per night, preferably in Manhattan."},
21 {
22 "role": "assistant",
23 "content": "For Manhattan hotels around $200/night, you might want to look at options like Hotel Beacon, Pod 51, or CitizenM Times Square. These are well-rated options in that price range. Would you like more specific recommendations for any of these?",
24 },
25]
26metric = ConversationalCoherenceMetric()
27result = metric.score(conversation)
28if result.scoring_failed:
29 print(f"Scoring failed: {result.reason}")
30else:
31 print(result.value)

Asynchronous scoring is also supported with the ascore scoring method.

SessionCompletenessQuality

This metric evaluates the completeness of a session within a conversational thread. It assesses whether the session addresses the intended context or purpose of the conversation.

The evaluation process begins by using an LLM to extract a list of high-level user intentions from the conversation turns. The same LLM is then used to assess whether each intention was addressed and/or fulfilled over the course of the conversation. It returns a score between 0.0 and 1.0, where higher values indicate better session completeness.

You can use it in the following way:

1from opik.evaluation.metrics import SessionCompletenessQuality
2conversation = [
3 {
4 "role": "user",
5 "content": "I need to book a flight to New York and find a hotel.",
6 },
7 {
8 "role": "assistant",
9 "content": "I can help you with that. For flights to New York, what dates are you looking to travel?",
10 },
11 {
12 "role": "user",
13 "content": "Next weekend, from Friday to Sunday."},
14 {
15 "role": "assistant",
16 "content": "Great! I recommend checking airlines like Delta, United, or JetBlue for flights to New York next weekend. For hotels, what's your budget range and preferred location in New York?",
17 },
18 {
19 "role": "user",
20 "content": "Around $200 per night, preferably in Manhattan."},
21 {
22 "role": "assistant",
23 "content": "For Manhattan hotels around $200/night, you might want to look at options like Hotel Beacon, Pod 51, or CitizenM Times Square. These are well-rated options in that price range. Would you like more specific recommendations for any of these?",
24 },
25]
26metric = SessionCompletenessQuality()
27result = metric.score(conversation)
28if result.scoring_failed:
29 print(f"Scoring failed: {result.reason}")
30else:
31 print(result.value)

Asynchronous scoring is also supported with the ascore scoring method.

UserFrustrationMetric

This metric evaluates the user frustration level within a conversation thread. It estimates a heuristic score estimating the likelihood that the user experienced confusion, annoyance, or disengagement during the session — due to repetition, lack of adaptation, ignored intent signals, or failure to smoothly conclude.

The UserFrustrationMetric class integrates with LLM models to analyze conversation data in sliding windows and produce a numerical score along with an optional reason for the calculated score. It provides both synchronous and asynchronous methods for calculation and supports customization through attributes like window size and reason inclusion.

This metric can be used to monitor and track user frustration levels during conversations, enabling insights into user experience. The metric makes use of LLM models to score conversational windows and summarize results. It returns a score between 0.0 and 1.0. The higher the score, the more frustrated the user is likely to be.

It can be used to evaluate the user experience during a conversation, like this:

1from opik.evaluation.metrics import UserFrustrationMetric
2conversation = [
3 {
4 "role": "user",
5 "content": "How do I center a div using CSS?"
6 },
7 {
8 "role": "assistant",
9 "content": "There are many ways to center elements in CSS.",
10 },
11 {
12 "role": "user",
13 "content": "Okay... can you show me one?"},
14 {
15 "role": "assistant",
16 "content": "Sure. It depends on the context — are you centering horizontally, vertically, or both?",
17 },
18 {
19 "role": "user",
20 "content": "Both. Just give me a basic example."
21 },
22 {
23 "role": "assistant",
24 "content": "Alright. You can use flexbox, grid, or margin auto. All of them work well.",
25 },
26 {
27 "role": "user",
28 "content": "Could you please just write the code?"
29 },
30 {
31 "role": "assistant",
32 "content": "Here’s one way:\n\n```css\ndiv {\n display: flex;\n}\n```\nThat sets it up for centering.",
33 },
34 {
35 "role": "user",
36 "content": "But this doesn’t even center anything! This is incomplete.",
37 },
38 {
39 "role": "assistant",
40 "content": "You're right. You also need `justify-content` and `align-items`.",
41 },
42 {
43 "role": "user",
44 "content": "Why didn’t you include those in the first place? This is wasting my time.",
45 },
46]
47metric = UserFrustrationMetric()
48result = metric.score(conversation)
49if result.scoring_failed:
50 print(f"Scoring failed: {result.reason}")
51else:
52 print(result.value)

Asynchronous scoring is also supported with the ascore scoring method.

Next steps

Read more about the conversational threads evaluation on the conversational threads evaluation page.