Custom conversation metric

Custom Conversation (Multi-turn) Metrics

Conversation metrics evaluate multi-turn conversations rather than single input-output pairs. These metrics are particularly useful for evaluating chatbots, conversational agents, and any multi-turn dialogue systems.

Understanding the Conversation Format

Conversation thread metrics work with a standardized conversation format:

1from typing import List, Dict, Literal
2
3# Type definition
4ConversationDict = Dict[Literal["role", "content"], str]
5Conversation = List[ConversationDict]
6
7# Example conversation
8conversation = [
9 {"role": "user", "content": "Hello! Can you help me?"},
10 {"role": "assistant", "content": "Hi there! I'd be happy to help. What do you need?"},
11 {"role": "user", "content": "I need information about Python"},
12 {"role": "assistant", "content": "Python is a versatile programming language..."},
13]

Creating a Custom Conversation Metric

To create a custom conversation metric, subclass ConversationThreadMetric and implement the score method:

1from typing import Any
2from opik.evaluation.metrics import score_result
3from opik.evaluation.metrics.conversation import (
4 ConversationThreadMetric,
5 types as conversation_types,
6)
7
8
9class ConversationLengthMetric(ConversationThreadMetric):
10 """
11 A simple metric that counts the number of conversation turns.
12 """
13
14 def __init__(self, name: str = "conversation_length_score"):
15 super().__init__(name)
16
17 def score(
18 self, conversation: conversation_types.Conversation, **kwargs: Any
19 ) -> score_result.ScoreResult:
20 """
21 Score based on conversation length.
22
23 Args:
24 conversation: List of conversation messages with 'role' and 'content'.
25 **kwargs: Additional arguments (ignored).
26 """
27 # Count assistant responses (each represents one conversation turn)
28 num_turns = sum(1 for msg in conversation if msg["role"] == "assistant")
29
30 return score_result.ScoreResult(
31 name=self.name,
32 value=num_turns,
33 reason=f"Conversation has {num_turns} turns"
34 )

Advanced Example: LLM-as-a-Judge Conversation Metric

For more sophisticated evaluation, you can use an LLM to judge conversation quality. This pattern is particularly useful when you need nuanced assessment of conversation attributes like helpfulness, coherence, or tone.

Here’s an example that evaluates the quality of assistant responses:

Step 1: Define the Output Schema

1import pydantic
2
3class ConversationQualityScore(pydantic.BaseModel):
4 """Schema for LLM judge output."""
5 score_value: float # Score between 0.0 and 1.0
6 reason: str # Explanation for the score
7
8 __hash__ = object.__hash__

Step 2: Create the Evaluation Prompt

1def create_evaluation_prompt(conversation: list) -> str:
2 """
3 Create a prompt that asks the LLM to evaluate conversation quality.
4 """
5 return f"""Evaluate the quality of the assistant's responses in this conversation.
6Consider the following criteria:
71. Helpfulness: Does the assistant provide useful, relevant information?
82. Clarity: Are the responses clear and easy to understand?
93. Consistency: Does the assistant maintain context across turns?
104. Professionalism: Is the tone appropriate and respectful?
11
12Return a JSON object with:
13- score_value: A number between 0.0 (poor) and 1.0 (excellent)
14- reason: A brief explanation of your assessment
15
16Conversation:
17{conversation}
18
19Your evaluation (JSON only):
20"""

Step 3: Implement the Metric

1import logging
2from typing import Optional, Union, Any
3import pydantic
4
5from opik import exceptions
6from opik.evaluation.metrics import score_result
7from opik.evaluation.metrics.conversation import (
8 ConversationThreadMetric,
9 types as conversation_types,
10)
11from opik.evaluation.metrics.llm_judges import parsing_helpers
12from opik.evaluation.models import base_model, models_factory
13
14LOGGER = logging.getLogger(__name__)
15
16
17class ConversationQualityMetric(ConversationThreadMetric):
18 """
19 An LLM-as-judge metric that evaluates conversation quality.
20
21 Args:
22 model: The LLM to use as a judge (e.g., "gpt-4", "claude-3-5-sonnet-20241022").
23 If None, uses the default model.
24 name: The name of this metric.
25 track: Whether to track the metric in Opik.
26 project_name: Optional project name for tracking.
27 """
28
29 def __init__(
30 self,
31 model: Optional[Union[str, base_model.OpikBaseModel]] = None,
32 name: str = "conversation_quality_score",
33 track: bool = True,
34 project_name: Optional[str] = None,
35 ):
36 super().__init__(name=name, track=track, project_name=project_name)
37 self._init_model(model)
38
39 def _init_model(
40 self, model: Optional[Union[str, base_model.OpikBaseModel]]
41 ) -> None:
42 """Initialize the LLM model for judging."""
43 if isinstance(model, base_model.OpikBaseModel):
44 self._model = model
45 else:
46 # Get model from factory (supports various providers via LiteLLM)
47 self._model = models_factory.get(model_name=model)
48
49 def score(
50 self,
51 conversation: conversation_types.Conversation,
52 **kwargs: Any,
53 ) -> score_result.ScoreResult:
54 """
55 Evaluate the conversation quality using an LLM judge.
56
57 Args:
58 conversation: List of conversation messages.
59 **kwargs: Additional arguments (ignored).
60
61 Returns:
62 ScoreResult with value between 0.0 and 1.0.
63 """
64 try:
65 # Create the evaluation prompt
66 llm_query = create_evaluation_prompt(conversation)
67
68 # Call the LLM with structured output
69 model_output = self._model.generate_string(
70 input=llm_query,
71 response_format=ConversationQualityScore,
72 )
73
74 # Parse the LLM response
75 score_data = self._parse_llm_output(model_output)
76
77 # Ensure score is within valid range [0.0, 1.0]
78 validated_score = max(0.0, min(1.0, score_data.score_value))
79
80 return score_result.ScoreResult(
81 name=self.name,
82 value=validated_score,
83 reason=score_data.reason,
84 )
85
86 except Exception as e:
87 LOGGER.error(f"Failed to calculate conversation quality: {e}")
88 raise exceptions.MetricComputationError(
89 f"Failed to calculate conversation quality: {e}"
90 ) from e
91
92 def _parse_llm_output(self, model_output: str) -> ConversationQualityScore:
93 """Parse and validate the LLM's output."""
94 try:
95 # Extract JSON from the model output
96 dict_content = parsing_helpers.extract_json_content_or_raise(
97 model_output
98 )
99
100 # Validate against schema
101 return ConversationQualityScore.model_validate(dict_content)
102
103 except pydantic.ValidationError as e:
104 LOGGER.warning(
105 f"Failed to parse LLM output: {model_output}, error: {e}",
106 exc_info=True,
107 )
108 raise

Step 4: Use the Metric

1from opik.evaluation import evaluate_threads
2
3# Initialize the metric with your preferred judge model
4quality_metric = ConversationQualityMetric(
5 model="gpt-4o", # or "claude-3-5-sonnet-20241022", etc.
6 name="conversation_quality"
7)
8
9# Evaluate threads in your project
10results = evaluate_threads(
11 project_name="my_chatbot_project",
12 eval_project_name="quality_evaluation",
13 metrics=[quality_metric],
14)

Key Patterns in LLM-as-Judge Metrics

When building LLM-as-judge metrics, follow these best practices:

  1. Structured Output: Use Pydantic models to ensure consistent LLM responses
  2. Clear Prompts: Provide specific evaluation criteria to the judge
  3. Error Handling: Wrap LLM calls in try-except blocks with proper logging
  4. Model Flexibility: Allow users to specify their preferred judge model
  5. Reason Field: Always include an explanation for transparency

Using Custom Conversation Metrics

You can use custom metrics with evaluate_threads:

1from opik.evaluation import evaluate_threads
2
3# Initialize your metrics
4conversation_length_metric = ConversationLengthMetric()
5quality_metric = ConversationQualityMetric(model="gpt-4o")
6
7# Evaluate threads in your project
8results = evaluate_threads(
9 project_name="my_chatbot_project",
10 filter_string='status = "inactive"',
11 eval_project_name="chatbot_evaluation",
12 metrics=[conversation_length_metric, quality_metric],
13 trace_input_transform=lambda x: x["input"],
14 trace_output_transform=lambda x: x["output"],
15)

For more details on evaluating conversation threads, see the Evaluate Threads guide.

Next Steps