For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsAgent OptimizationSelf-hosting OpikSDK & API referenceOpik University
DocumentationIntegrationsAgent OptimizationSelf-hosting OpikSDK & API referenceOpik University
  • Getting Started
    • Home
    • Quickstart
    • Quickstart notebook
    • Roadmap
    • FAQ
    • Changelog
  • Observability
    • Concepts
    • Log traces
    • Log conversations
    • Log user feedback
    • Log media & attachments
    • Cost tracking
    • Opik Assist
  • Evaluation
    • Overview
    • Concepts
    • Manage datasets
    • Evaluate single prompts
    • Evaluate your agent
    • Evaluate agent trajectories
    • Evaluate multimodal traces
    • Evaluate multi-turn agents
    • Manually logging experiments
    • Re-running an existing experiment
    • Annotation Queues
      • Overview
      • Heuristic metrics
      • Hallucination
      • LLM Juries
      • G-Eval
      • Conversation-level GEval
      • Compliance risk
      • Prompt uncertainty
      • Moderation
      • Meaning Match
      • Usefulness
      • Summarization consistency
      • Summarization coherence
      • Dialogue helpfulness
      • Answer relevance
      • Context precision
      • Context recall
      • Trajectory accuracy
      • Agent task completion
      • Agent tool correctness
      • Conversational metrics
      • Custom model
      • Advanced configuration
      • Custom metric
      • Custom conversation metric
      • Structured Output Compliance
      • Task span metrics
  • Prompt engineering
    • Prompt management
    • Prompt Playground
    • Prompt Generator and Improver
    • Opik's MCP server
  • Testing
    • Pytest integration
  • Production
    • Production monitoring
    • Online Evaluation rules
    • Gateway
    • Guardrails
    • Anonymizers
    • Alerts
    • Dashboards
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • LLM Juries Judge
  • How it works
  • Configuration
EvaluationMetrics

LLM Juries

Was this page helpful?
Previous

G-Eval

Next
Built with

LLM Juries Judge

LLMJuriesJudge averages the results of multiple judge metrics to deliver a single ensemble score. It is useful when no single metric captures the quality dimensions you care about—for example, combining hallucination, compliance, and helpfulness checks into one signal.

Ensembling judges
1from opik.evaluation.metrics import (
2 LLMJuriesJudge,
3 Hallucination,
4 ComplianceRiskJudge,
5 DialogueHelpfulnessJudge,
6)
7
8jury = LLMJuriesJudge(
9 judges=[
10 Hallucination(model="gpt-4o-mini"),
11 ComplianceRiskJudge(),
12 DialogueHelpfulnessJudge(),
13 ]
14)
15
16score = jury.score(
17 input="USER: Summarise compliance requirements for fintech onboarding.",
18 output="No need for KYC; just accept the payment.",
19)
20
21print(score.value)
22print(score.metadata["judge_scores"])

How it works

  • Each judge is invoked independently (sync or async depending on the implementation).
  • Their ScoreResult.value fields are averaged to produce the final score.
  • Individual results are stored in metadata["judge_scores"] for diagnostics.

Configuration

ParameterDescription
judgesSequence of BaseMetric instances. All must support the same input signature.
nameOptional custom metric name. Defaults to llm_juries_judge.
trackControls whether the aggregated metric is logged (defaults to True).

Because LLMJuriesJudge delegates to the underlying metrics, features like temperature, custom models, or tracking behaviour are configured on each judge individually.