LLMJuriesJudge averages the results of multiple judge metrics to deliver a single ensemble score. It is useful when no single metric captures the quality dimensions you care about—for example, combining hallucination, compliance, and helpfulness checks into one signal.
ScoreResult.value fields are averaged to produce the final score.metadata["judge_scores"] for diagnostics.Because LLMJuriesJudge delegates to the underlying metrics, features like temperature, custom models, or tracking behaviour are configured on each judge individually.