Custom Conversation (Multi-turn) Metrics

Conversation metrics evaluate multi-turn conversations rather than single input-output pairs. These metrics are particularly useful for evaluating chatbots, conversational agents, and any multi-turn dialogue systems.

Understanding the Conversation Format

Conversation thread metrics work with a standardized conversation format:

1 from typing import List, Dict, Literal
2 
3 # Type definition
4 ConversationDict = Dict[Literal["role", "content"], str]
5 Conversation = List[ConversationDict]
6 
7 # Example conversation
8 conversation = [
9     {"role": "user", "content": "Hello! Can you help me?"},
10     {"role": "assistant", "content": "Hi there! I'd be happy to help. What do you need?"},
11     {"role": "user", "content": "I need information about Python"},
12     {"role": "assistant", "content": "Python is a versatile programming language..."},
13 ]

Creating a Custom Conversation Metric

To create a custom conversation metric, subclass ConversationThreadMetric and implement the score method:

1 from typing import Any
2 from opik.evaluation.metrics import score_result
3 from opik.evaluation.metrics.conversation import (
4     ConversationThreadMetric,
5     types as conversation_types,
6 )
7 
8 
9 class ConversationLengthMetric(ConversationThreadMetric):
10     """
11     A simple metric that counts the number of conversation turns.
12     """
13 
14     def __init__(self, name: str = "conversation_length_score"):
15         super().__init__(name)
16 
17     def score(
18         self, conversation: conversation_types.Conversation, **kwargs: Any
19     ) -> score_result.ScoreResult:
20         """
21         Score based on conversation length.
22         
23         Args:
24             conversation: List of conversation messages with 'role' and 'content'.
25             **kwargs: Additional arguments (ignored).
26         """
27         # Count assistant responses (each represents one conversation turn)
28         num_turns = sum(1 for msg in conversation if msg["role"] == "assistant")
29         
30         return score_result.ScoreResult(
31             name=self.name,
32             value=num_turns,
33             reason=f"Conversation has {num_turns} turns"
34         )

Advanced Example: LLM-as-a-Judge Conversation Metric

For more sophisticated evaluation, you can use an LLM to judge conversation quality. This pattern is particularly useful when you need nuanced assessment of conversation attributes like helpfulness, coherence, or tone.

Here’s an example that evaluates the quality of assistant responses:

Step 1: Define the Output Schema

1 import pydantic
2 
3 class ConversationQualityScore(pydantic.BaseModel):
4     """Schema for LLM judge output."""
5     score_value: float  # Score between 0.0 and 1.0
6     reason: str  # Explanation for the score
7 
8     __hash__ = object.__hash__

Step 2: Create the Evaluation Prompt

1 def create_evaluation_prompt(conversation: list) -> str:
2     """
3     Create a prompt that asks the LLM to evaluate conversation quality.
4     """
5     return f"""Evaluate the quality of the assistant's responses in this conversation.
6 Consider the following criteria:
7 1. Helpfulness: Does the assistant provide useful, relevant information?
8 2. Clarity: Are the responses clear and easy to understand?
9 3. Consistency: Does the assistant maintain context across turns?
10 4. Professionalism: Is the tone appropriate and respectful?
11 
12 Return a JSON object with:
13 - score_value: A number between 0.0 (poor) and 1.0 (excellent)
14 - reason: A brief explanation of your assessment
15 
16 Conversation:
17 {conversation}
18 
19 Your evaluation (JSON only):
20 """

Step 3: Implement the Metric

1 import logging
2 from typing import Optional, Union, Any
3 import pydantic
4 
5 from opik import exceptions
6 from opik.evaluation.metrics import score_result
7 from opik.evaluation.metrics.conversation import (
8     ConversationThreadMetric,
9     types as conversation_types,
10 )
11 from opik.evaluation.metrics.llm_judges import parsing_helpers
12 from opik.evaluation.models import base_model, models_factory
13 
14 LOGGER = logging.getLogger(__name__)
15 
16 
17 class ConversationQualityMetric(ConversationThreadMetric):
18     """
19     An LLM-as-judge metric that evaluates conversation quality.
20 
21     Args:
22         model: The LLM to use as a judge (e.g., "gpt-4", "claude-3-5-sonnet-20241022").
23                If None, uses the default model.
24         name: The name of this metric.
25         track: Whether to track the metric in Opik.
26         project_name: Optional project name for tracking.
27     """
28 
29     def __init__(
30         self,
31         model: Optional[Union[str, base_model.OpikBaseModel]] = None,
32         name: str = "conversation_quality_score",
33         track: bool = True,
34         project_name: Optional[str] = None,
35     ):
36         super().__init__(name=name, track=track, project_name=project_name)
37         self._init_model(model)
38 
39     def _init_model(
40         self, model: Optional[Union[str, base_model.OpikBaseModel]]
41     ) -> None:
42         """Initialize the LLM model for judging."""
43         if isinstance(model, base_model.OpikBaseModel):
44             self._model = model
45         else:
46             # Get model from factory (supports various providers via LiteLLM)
47             self._model = models_factory.get(model_name=model)
48 
49     def score(
50         self,
51         conversation: conversation_types.Conversation,
52         **kwargs: Any,
53     ) -> score_result.ScoreResult:
54         """
55         Evaluate the conversation quality using an LLM judge.
56 
57         Args:
58             conversation: List of conversation messages.
59             **kwargs: Additional arguments (ignored).
60 
61         Returns:
62             ScoreResult with value between 0.0 and 1.0.
63         """
64         try:
65             # Create the evaluation prompt
66             llm_query = create_evaluation_prompt(conversation)
67 
68             # Call the LLM with structured output
69             model_output = self._model.generate_string(
70                 input=llm_query,
71                 response_format=ConversationQualityScore,
72             )
73 
74             # Parse the LLM response
75             score_data = self._parse_llm_output(model_output)
76 
77             # Ensure score is within valid range [0.0, 1.0]
78             validated_score = max(0.0, min(1.0, score_data.score_value))
79 
80             return score_result.ScoreResult(
81                 name=self.name,
82                 value=validated_score,
83                 reason=score_data.reason,
84             )
85 
86         except Exception as e:
87             LOGGER.error(f"Failed to calculate conversation quality: {e}")
88             raise exceptions.MetricComputationError(
89                 f"Failed to calculate conversation quality: {e}"
90             ) from e
91 
92     def _parse_llm_output(self, model_output: str) -> ConversationQualityScore:
93         """Parse and validate the LLM's output."""
94         try:
95             # Extract JSON from the model output
96             dict_content = parsing_helpers.extract_json_content_or_raise(
97                 model_output
98             )
99 
100             # Validate against schema
101             return ConversationQualityScore.model_validate(dict_content)
102 
103         except pydantic.ValidationError as e:
104             LOGGER.warning(
105                 f"Failed to parse LLM output: {model_output}, error: {e}",
106                 exc_info=True,
107             )
108             raise

Step 4: Use the Metric

1 from opik.evaluation import evaluate_threads
2 
3 # Initialize the metric with your preferred judge model
4 quality_metric = ConversationQualityMetric(
5     model="gpt-4o",  # or "claude-3-5-sonnet-20241022", etc.
6     name="conversation_quality"
7 )
8 
9 # Evaluate threads in your project
10 results = evaluate_threads(
11     project_name="my_chatbot_project",
12     eval_project_name="quality_evaluation",
13     metrics=[quality_metric],
14 )

Key Patterns in LLM-as-Judge Metrics

When building LLM-as-judge metrics, follow these best practices:

Structured Output: Use Pydantic models to ensure consistent LLM responses
Clear Prompts: Provide specific evaluation criteria to the judge
Error Handling: Wrap LLM calls in try-except blocks with proper logging
Model Flexibility: Allow users to specify their preferred judge model
Reason Field: Always include an explanation for transparency

Using Custom Conversation Metrics

You can use custom metrics with evaluate_threads:

1 from opik.evaluation import evaluate_threads
2 
3 # Initialize your metrics
4 conversation_length_metric = ConversationLengthMetric()
5 quality_metric = ConversationQualityMetric(model="gpt-4o")
6 
7 # Evaluate threads in your project.
8 # `evaluate_threads` runs against every thread matched by `filter_string`;
9 # use the filter to scope to the threads you actually want to score.
10 results = evaluate_threads(
11     project_name="my_chatbot_project",
12     filter_string='thread_id contains "user-session"',
13     eval_project_name="chatbot_evaluation",
14     metrics=[conversation_length_metric, quality_metric],
15     trace_input_transform=lambda x: x["input"],
16     trace_output_transform=lambda x: x["output"],
17 )

For more details on evaluating conversation threads, see the Evaluate Threads guide.

Next Steps

Learn about built-in conversation metrics
Read the Evaluate Threads guide