For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • Upgrading to Opik 2.0
    • Ollie Agent
    • FAQ
    • Changelog
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
  • Evaluation
    • Overview
    • Getting started
    • Concepts
      • Overview
      • Heuristic metrics
      • Hallucination
      • LLM Juries
      • G-Eval
      • Conversation-level GEval
      • Compliance risk
      • Prompt uncertainty
      • Moderation
      • Meaning Match
      • Usefulness
      • Summarization consistency
      • Summarization coherence
      • Dialogue helpfulness
      • Answer relevance
      • Context precision
      • Context recall
      • Trajectory accuracy
      • Agent task completion
      • Agent tool correctness
      • Conversational metrics
      • Custom model
      • Advanced configuration
      • Custom metric
      • Custom conversation metric
      • Structured Output Compliance
      • Task span metrics
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Conversation-level GEval Metrics
  • Usage
  • ConversationComplianceRiskMetric
  • ConversationDialogueHelpfulnessMetric
  • ConversationQARelevanceMetric
  • ConversationSummarizationConsistencyMetric
  • ConversationSummarizationCoherenceMetric
  • ConversationPromptUncertaintyMetric
EvaluationMetrics

Conversation-level GEval Metrics

Was this page helpful?
Previous

Compliance risk

Next
Built with

Conversation-level GEval Metrics

Opik ships adapters that wrap GEval-based judges so they can score entire conversation threads. Each adapter implements ConversationThreadMetric, which means you can plug them into evaluate_threads or any pipeline that operates on chat transcripts.

[!NOTE] The adapters keep the underlying judge’s reasoning. For example, ConversationComplianceRiskMetric returns the same detailed rationale as ComplianceRiskJudge, but scopes the analysis to the conversation context.

MetricDescriptionUnderlying Judge
ConversationComplianceRiskMetricFlags non-factual / non-compliant statements in regulated contexts.ComplianceRiskJudge
ConversationDialogueHelpfulnessMetricScores how helpful the final assistant reply is.DialogueHelpfulnessJudge
ConversationQARelevanceMetricChecks whether the answer directly addresses the user question.QARelevanceJudge
ConversationSummarizationConsistencyMetricGauges how faithful a summary is to the source discussion.SummarizationConsistencyJudge
ConversationSummarizationCoherenceMetricEvaluates the coherence of a conversation summary.SummarizationCoherenceJudge
ConversationPromptUncertaintyMetricEstimates how ambiguous the prompt is for downstream models.PromptUncertaintyJudge

Usage

1from opik.evaluation.metrics import ConversationComplianceRiskMetric
2from opik.evaluation import evaluate_threads
3
4metrics = [ConversationComplianceRiskMetric(model="gpt-4o-mini")]
5
6results = evaluate_threads(
7 dataset="my_threads_dataset",
8 metrics=metrics,
9)

Each adapter accepts the same keyword arguments as the underlying GEval judge (model, track, temperature, project_name, etc.).

ConversationComplianceRiskMetric

Flags the latest assistant reply when it strays into non-compliant or risky territory (financial advice, medical guidance, KYC breaches, etc.). The underlying ComplianceRiskJudge reviews the full transcript but concentrates its verdict on the most recent assistant turn, making it ideal for inbox-style workflows where an agent hands off to a human reviewer. Pair it with automated routing so high-risk threads escalate immediately.

ConversationDialogueHelpfulnessMetric

Answers the question “did the assistant ultimately help the user?” after considering the exchange so far. The judge weighs context handed over by the user, detects if the assistant ignored clarifications, and rewards concrete, actionable replies. Use it to track assistant quality in customer-support or onboarding flows where the last response is the hand-off back to the user.

ConversationQARelevanceMetric

Scores how well the final answer resolves the user’s question, even if the conversation meandered. It picks up on subtle forms of deflection (“see our docs”) or hallucinated follow-ups. Teams often combine it with retrieval-based guardrails to ensure the agent grounds every final answer in the right snippet.

ConversationSummarizationConsistencyMetric

Validates that an auto-generated summary sticks to the facts shared in the transcript. It is particularly useful when you summarise long support chats or sales calls and need confidence that the synopsis won’t fabricate commitments or omit key blockers. Feed it alongside human-written spot checks to prioritise which summaries require review.

ConversationSummarizationCoherenceMetric

Looks at the same summary through a writing-quality lens: is it organised, easy to skim, and logically grouped? Combine it with the consistency judge to ensure summaries are both faithful and readable before they populate CRM notes or ticket backlogs.

ConversationPromptUncertaintyMetric

Pinpoints last-turn prompts that lack critical context (“Can you finish it?”) or contain conflicting instructions. Surfacing these cases lets you proactively ask the user for clarification or enrich the prompt with missing metadata before rerunning expensive evaluations.