Custom Conversation (Multi-turn) Metrics

Conversation metrics evaluate multi-turn conversations rather than single input-output pairs. These metrics are particularly useful for evaluating chatbots, conversational agents, and any multi-turn dialogue systems.

Understanding the Conversation Format

Conversation thread metrics work with a standardized conversation format:

1from typing import List, Dict, Literal
2
3# Type definition
4ConversationDict = Dict[Literal["role", "content"], str]
5Conversation = List[ConversationDict]
6
7# Example conversation
8conversation = [
9 {"role": "user", "content": "Hello! Can you help me?"},
10 {"role": "assistant", "content": "Hi there! I'd be happy to help. What do you need?"},
11 {"role": "user", "content": "I need information about Python"},
12 {"role": "assistant", "content": "Python is a versatile programming language..."},
13]

Creating a Custom Conversation Metric

To create a custom conversation metric, subclass ConversationThreadMetric and implement the score method:

1from typing import Any
2from opik.evaluation.metrics.conversation import conversation_thread_metric, types
3from opik.evaluation.metrics import score_result
4
5
6class ConversationLengthMetric(conversation_thread_metric.ConversationThreadMetric):
7 """
8 A simple metric that counts the number of conversation turns.
9 """
10
11 def __init__(self, name: str = "conversation_length_score"):
12 super().__init__(name)
13
14 def score(
15 self, conversation: types.Conversation, **kwargs: Any
16 ) -> score_result.ScoreResult:
17 """
18 Score based on conversation length.
19
20 Args:
21 conversation: List of conversation messages with 'role' and 'content'.
22 **kwargs: Additional arguments (ignored).
23 """
24 # Count assistant responses (each represents one conversation turn)
25 num_turns = sum(1 for msg in conversation if msg["role"] == "assistant")
26
27 return score_result.ScoreResult(
28 name=self.name,
29 value=num_turns,
30 reason=f"Conversation has {num_turns} turns"
31 )

Using Custom Conversation Metrics

You can use this metric with evaluate_threads:

1from opik.evaluation import evaluate_threads
2
3# Initialize the metric
4conversation_length_metric = ConversationLengthMetric()
5
6# Evaluate threads in your project
7results = evaluate_threads(
8 project_name="my_chatbot_project",
9 filter_string='status = "inactive"',
10 eval_project_name="chatbot_evaluation",
11 metrics=[conversation_length_metric],
12 trace_input_transform=lambda x: x["input"],
13 trace_output_transform=lambda x: x["output"],
14)

For more details on evaluating conversation threads, see the Evaluate Threads guide.

Next Steps