ConversationThreadMetric¶

class opik.evaluation.metrics.conversation.conversation_thread_metric.ConversationThreadMetric(name: str | None = None, track: bool = True, project_name: str | None = None)¶

Bases: BaseMetric

Abstract base class for all conversation thread metrics. When creating a custom conversation metric, you should inherit from this class and implement the abstract methods.

Conversation metrics are designed to evaluate multi-turn conversations rather than single input-output pairs. They accept a conversation as a list of message dictionaries, where each message has a ‘role’ (either ‘user’ or ‘assistant’) and ‘content’.

Parameters:

name – The name of the metric. If not provided, uses the class name as default.
track – Whether to track the metric. Defaults to True.
project_name – Optional project name to track the metric in for the cases when there is no parent span/trace to inherit project name from.

Example

>>> from opik.evaluation.metrics.conversation import conversation_thread_metric, types
>>> from opik.evaluation.metrics import score_result
>>> from typing import Any
>>>
>>> class ConversationLengthMetric(conversation_thread_metric.ConversationThreadMetric):
>>>     def __init__(self, name: str = "conversation_length_score"):
>>>         super().__init__(name)
>>>
>>>     def score(self, conversation: types.Conversation, **kwargs: Any):
>>>         num_turns = sum(1 for msg in conversation if msg["role"] == "assistant")
>>>         return score_result.ScoreResult(
>>>             name=self.name,
>>>             value=num_turns,
>>>             reason=f"Conversation has {num_turns} turns"
>>>         )

score(conversation: List[Dict[Literal['role', 'content'], str]], **kwargs: Any) → ScoreResult | List[ScoreResult]¶

Evaluate a conversation and return a score.

Parameters:

conversation – A list of conversation messages. Each message is a dictionary with ‘role’ (either ‘user’ or ‘assistant’) and ‘content’ (the message text).
**kwargs – Additional keyword arguments that may be used by specific metric implementations.

Returns:

A ScoreResult object or list of ScoreResult objects containing the evaluation score, metric name, and optional reasoning.

async ascore(conversation: List[Dict[Literal['role', 'content'], str]], **kwargs: Any) → ScoreResult | List[ScoreResult]¶

Asynchronously evaluate a conversation and return a score.

This is the async version of the score method. By default, it calls the synchronous score method, but can be overridden for true async implementations.

Parameters:

conversation – A list of conversation messages. Each message is a dictionary with ‘role’ (either ‘user’ or ‘assistant’) and ‘content’ (the message text).
**kwargs – Additional keyword arguments that may be used by specific metric implementations.

Returns:

A ScoreResult object or list of ScoreResult objects containing the evaluation score, metric name, and optional reasoning.