Custom conversation metric | Opik Documentation

Conversation metrics evaluate multi-turn conversations rather than single input-output pairs. These metrics are particularly useful for evaluating chatbots, conversational agents, and any multi-turn dialogue systems.

Understanding the Conversation Format

Conversation thread metrics work with a standardized conversation format:

1 from typing import List, Dict, Literal
2 
3 # Type definition
4 ConversationDict = Dict[Literal["role", "content"], str]
5 Conversation = List[ConversationDict]
6 
7 # Example conversation
8 conversation = [
9     {"role": "user", "content": "Hello! Can you help me?"},
10     {"role": "assistant", "content": "Hi there! I'd be happy to help. What do you need?"},
11     {"role": "user", "content": "I need information about Python"},
12     {"role": "assistant", "content": "Python is a versatile programming language..."},
13 ]

Creating a Custom Conversation Metric

To create a custom conversation metric, subclass ConversationThreadMetric and implement the score method:

1 from typing import Any
2 from opik.evaluation.metrics import score_result
3 from opik.evaluation.metrics.conversation import (
4     ConversationThreadMetric,
5     types as conversation_types,
6 )
7 
8 
9 class ConversationLengthMetric(ConversationThreadMetric):
10     """
11     A simple metric that counts the number of conversation turns.
12     """
13 
14     def __init__(self, name: str = "conversation_length_score"):
15         super().__init__(name)
16 
17     def score(
18         self, conversation: conversation_types.Conversation, **kwargs: Any
19     ) -> score_result.ScoreResult:
20         """
21         Score based on conversation length.
22         
23         Args:
24             conversation: List of conversation messages with 'role' and 'content'.
25             **kwargs: Additional arguments (ignored).
26         """
27         # Count assistant responses (each represents one conversation turn)
28         num_turns = sum(1 for msg in conversation if msg["role"] == "assistant")
29         
30         return score_result.ScoreResult(
31             name=self.name,
32             value=num_turns,
33             reason=f"Conversation has {num_turns} turns"
34         )

Using Custom Conversation Metrics

You can use this metric with evaluate_threads:

1 from opik.evaluation import evaluate_threads
2 
3 # Initialize the metric
4 conversation_length_metric = ConversationLengthMetric()
5 
6 # Evaluate threads in your project
7 results = evaluate_threads(
8     project_name="my_chatbot_project",
9     filter_string='status = "inactive"',
10     eval_project_name="chatbot_evaluation",
11     metrics=[conversation_length_metric],
12     trace_input_transform=lambda x: x["input"],
13     trace_output_transform=lambda x: x["output"],
14 )

For more details on evaluating conversation threads, see the Evaluate Threads guide.

Next Steps

Learn about built-in conversation metrics
Read the Evaluate Threads guide