Conversation metrics evaluate multi-turn conversations rather than single input-output pairs. These metrics are particularly useful for evaluating chatbots, conversational agents, and any multi-turn dialogue systems.
Conversation thread metrics work with a standardized conversation format:
To create a custom conversation metric, subclass ConversationThreadMetric and implement the score method:
For more sophisticated evaluation, you can use an LLM to judge conversation quality. This pattern is particularly useful when you need nuanced assessment of conversation attributes like helpfulness, coherence, or tone.
Here’s an example that evaluates the quality of assistant responses:
When building LLM-as-judge metrics, follow these best practices:
You can use custom metrics with evaluate_threads:
For more details on evaluating conversation threads, see the Evaluate Threads guide.