Custom conversation metric
Custom Conversation (Multi-turn) Metrics
Conversation metrics evaluate multi-turn conversations rather than single input-output pairs. These metrics are particularly useful for evaluating chatbots, conversational agents, and any multi-turn dialogue systems.
Understanding the Conversation Format
Conversation thread metrics work with a standardized conversation format:
Creating a Custom Conversation Metric
To create a custom conversation metric, subclass ConversationThreadMetric and implement the score method:
Advanced Example: LLM-as-a-Judge Conversation Metric
For more sophisticated evaluation, you can use an LLM to judge conversation quality. This pattern is particularly useful when you need nuanced assessment of conversation attributes like helpfulness, coherence, or tone.
Here’s an example that evaluates the quality of assistant responses:
Step 1: Define the Output Schema
Step 2: Create the Evaluation Prompt
Step 3: Implement the Metric
Step 4: Use the Metric
Key Patterns in LLM-as-Judge Metrics
When building LLM-as-judge metrics, follow these best practices:
- Structured Output: Use Pydantic models to ensure consistent LLM responses
- Clear Prompts: Provide specific evaluation criteria to the judge
- Error Handling: Wrap LLM calls in try-except blocks with proper logging
- Model Flexibility: Allow users to specify their preferred judge model
- Reason Field: Always include an explanation for transparency
Using Custom Conversation Metrics
You can use custom metrics with evaluate_threads:
For more details on evaluating conversation threads, see the Evaluate Threads guide.
Next Steps
- Learn about built-in conversation metrics
- Read the Evaluate Threads guide