LLM Juries Judge
LLMJuriesJudge averages the results of multiple judge metrics to deliver a single ensemble score. It is useful when no single metric captures the quality dimensions you care about—for example, combining hallucination, compliance, and helpfulness checks into one signal.
Ensembling judges
How it works
- Each judge is invoked independently (sync or async depending on the implementation).
- Their
ScoreResult.valuefields are averaged to produce the final score. - Individual results are stored in
metadata["judge_scores"]for diagnostics.
Configuration
Because LLMJuriesJudge delegates to the underlying metrics, features like temperature, custom models, or tracking behaviour are configured on each judge individually.