Conversation-level GEval Metrics
Opik ships adapters that wrap GEval-based judges so they can score entire conversation threads. Each adapter implements ConversationThreadMetric, which means you can plug them into evaluate_threads or any pipeline that operates on chat transcripts.
[!NOTE]
The adapters keep the underlying judge’s reasoning. For example, ConversationComplianceRiskMetric returns the same detailed rationale as ComplianceRiskJudge, but scopes the analysis to the conversation context.
Usage
Each adapter accepts the same keyword arguments as the underlying GEval judge (model, track, temperature, project_name, etc.).
ConversationDialogueHelpfulnessMetricAnswers the question “did the assistant ultimately help the user?” after considering the exchange so far. The judge weighs context handed over by the user, detects if the assistant ignored clarifications, and rewards concrete, actionable replies. Use it to track assistant quality in customer-support or onboarding flows where the last response is the hand-off back to the user.
ConversationSummarizationConsistencyMetricValidates that an auto-generated summary sticks to the facts shared in the transcript. It is particularly useful when you summarise long support chats or sales calls and need confidence that the synopsis won’t fabricate commitments or omit key blockers. Feed it alongside human-written spot checks to prioritise which summaries require review.
ConversationSummarizationCoherenceMetricLooks at the same summary through a writing-quality lens: is it organised, easy to skim, and logically grouped? Combine it with the consistency judge to ensure summaries are both faithful and readable before they populate CRM notes or ticket backlogs.
ConversationPromptUncertaintyMetricPinpoints last-turn prompts that lack critical context (“Can you finish it?”) or contain conflicting instructions. Surfacing these cases lets you proactively ask the user for clarification or enrich the prompt with missing metadata before rerunning expensive evaluations.