{"id":12957,"date":"2025-02-24T07:21:32","date_gmt":"2025-02-24T15:21:32","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=12957"},"modified":"2025-11-13T19:22:47","modified_gmt":"2025-11-13T19:22:47","slug":"llm-juries-for-evaluation","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/","title":{"rendered":"LLM Juries for Evaluation"},"content":{"rendered":"\n<div class=\"wp-block-buttons is-content-justification-left is-layout-flex wp-container-core-buttons-is-layout-fc4fd283 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1Lt-4rvNIYPhgCMpaTd2N6GxJu9LkfcE5\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab!&nbsp;<\/a><\/div>\n<\/div>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Evaluating the correctness of generated responses is an inherently challenging task. <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-as-a-judge\/\">LLM-as-a-Judge<\/a> evaluators have gained popularity for their ability to provide nuanced, reference-free, and scalable assessments across diverse tasks. However, individual models still suffer from biases, inconsistencies, and blind spots.&nbsp;<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries-1024x576.jpg\" alt=\"LLM Juries for Evaluation featured image\" class=\"wp-image-18410\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries-1024x576.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries-300x169.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries-768x432.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries-1536x864.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Ensemble learning, a long-standing technique in traditional machine learning, enhances accuracy and robustness by aggregating multiple models. This principle applies to <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-guide\/\">LLM evaluation<\/a> workflows as well. So-called LLM Juries\u2013 <a href=\"https:\/\/arxiv.org\/abs\/2404.18796\">panels of diverse LLM judges<\/a>\u2013 leverage ensembling to improve accuracy, fairness, and interpretability, offering a more robust and reliable <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-frameworks\/\">LLM evaluation framework<\/a>.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">In this article, we\u2019ll explore the advantages and limitations of LLM Juries and how to implement one from scratch in <a href=\"https:\/\/www.comet.com\/docs\/opik\">Opik<\/a>. <\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-is-an-llm-jury\">What is an LLM Jury?<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function. Unlike a single LLM-as-a-Judge evaluator, which often relies on a large model like GPT-4o, an LLM Jury typically consists of smaller models from distinct families (e.g., GPT, Claude, Command R, Mistral).<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12958\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"900\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-45.png\" alt=\"Diagram of an LLM Jury consisting of Cohere's Command R, OpenAI's GPT 3.5, and Anthropic's Claude Haiku with uses an average pooling method to aggregate the LLM judge scores\" class=\"wp-image-12958\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-45.png 1600w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-45-300x169.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-45-1024x576.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-45-768x432.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-45-1536x864.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\">An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.<\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\"><a href=\"https:\/\/arxiv.org\/html\/2403.02839v1\">While larger models handle diverse evaluation tasks well, some smaller or fine-tuned models may struggle with generalization.<\/a> For this reason, each (smaller) model in the jury should support the required scoring type, for example reference-based scoring, or pair-wise scoring.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12966\"><img loading=\"lazy\" decoding=\"async\" width=\"3912\" height=\"2196\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1.png\" alt=\"Single-point scoring is The model evaluates an output independently of any point of comparison on a predefined scale, which is usually provided in the prompt. Pros: - Simple, efficient, and flexible- Does not require references\/ground truths - Scales well - Can be adapted to various scoring rubrics and tasks. Cons: - Scores can be inconsistent across different evaluators (human or model) and prompts - Difficult to assess relative quality; two different responses could receive similar scores even if one is clearly better. - Calibration is often required to mitigate bias, but this are often insufficient to align with human judgment. Reference-based scoring: The model evaluates an output by comparing it to a predefined reference response (e.g., a ground-truth answer or human annotation). Pros: - Provides an objective benchmark - Useful for tasks with clear &quot;correctness&quot; criteria (e.g. machine translation, summarization) - Very consistent, scalable and highly automatable. Cons:- Fails for tasks with multiple valid responses (open-ended tasks) - May limit model creativity by penalizing valid but different outputs - Very sensitive to the quality of the reference data - Reference data is costly and time-consuming to acquire. Pair-wise scoring: The model compares two competing outputs (e.g., from different models or different prompts) and determines which one is better based on specific criteria. Pros: - Most robust and better alignment with human judgment, especially for open-ended tasks - Reduces subjective variance in individual scores since only relative quality matters -Useful for ranking models and fine-tuning LLMs. Cons: - Requires additional comparisons to establish rankings, a popular application of pair-wise scoring - Computationally expensive for large datasets - Transitivity issues - Difficult to benchmark - Requires careful sampling\" class=\"wp-image-12966\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1.png 3912w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1-300x168.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1-1024x575.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1-768x431.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1-1536x862.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/blog-graphic-1-2048x1150.png 2048w\" sizes=\"auto, (max-width: 3912px) 100vw, 3912px\" \/><figcaption class=\"wp-element-caption\">Each model in the jury should support the relevant scoring type, as <a href=\"https:\/\/arxiv.org\/html\/2403.02839v1\">smaller models don&#8217;t generalize as well to new tasks as larger models do<\/a>.<\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">LLM Juries use traditional ensemble learning techniques for score aggregation, depending on the task. For example, max pooling may be appropriate for binary classification, average or median pooling may be appropriate for rating scales, soft or hard voting for binary or multi-class classification, or stacking for open ended evaluations.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-why-use-a-jury-instead-of-a-judge\">Why Use a Jury Instead of a Judge?<\/h2>\n\n\n\n<p>A single large model like GPT-4o is often used as an evaluator due to its generalizability, depth of knowledge, and reasoning capabilities. However, this approach has significant drawbacks.<\/p>\n\n\n\n<p><a href=\"https:\/\/huggingface.co\/papers\/2404.13076\">If the same model generates and evaluates responses, <strong>self-recognition can lead to self-preference<\/strong>, introducing bias and reducing evaluation fairness.<\/a> This <strong>intra-model bias<\/strong> is a serious issue in safety-critical and ethical applications, and can also affect baseline performance.<\/p>\n\n\n\n<p>Ultra-large models are also <strong>expensive<\/strong> and <strong>slow<\/strong> due to their high computational requirements. This makes them unsuitable for many real-time, edge, or large-scale applications.<\/p>\n\n\n\n<p><a href=\"https:\/\/arxiv.org\/abs\/2404.18796\">Research from Cohere<\/a> suggests that a diverse panel of smaller models outperforms a single large judge, reduces bias, and does so at over 7x lower cost. Additionally, multiple smaller models can run in parallel, further improving speed and efficiency.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12962\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"900\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-47.png\" alt=\"\" class=\"wp-image-12962\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-47.png 1600w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-47-300x169.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-47-1024x576.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-47-768x432.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Untitled-design-47-1536x864.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\">Research from Cohere suggests that a diverse panel of smaller models outperforms a single large judge, reduces bias, and does so at over 7x lower cost. Image from <a href=\"https:\/\/arxiv.org\/abs\/2404.18796\">Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models<\/a><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-implementing-llm-juries-in-opik\">Implementing LLM Juries in Opik<\/h2>\n\n\n\n<p>Using what we\u2019ve learned so far about LLM Juries, let\u2019s implement one as a <a href=\"https:\/\/www.comet.com\/docs\/opik\/evaluation\/metrics\/custom_metric\">custom metric in Opik<\/a>. Follow along in <a href=\"https:\/\/colab.research.google.com\/drive\/1Lt-4rvNIYPhgCMpaTd2N6GxJu9LkfcE5\">the full-code Colab<\/a> if you aren\u2019t already!<\/p>\n\n\n\n<p>For this tutorial, we\u2019ll be using a toy subset of the <a href=\"https:\/\/huggingface.co\/datasets\/rongzhangibm\/NaturalQuestionsV2\">Natural Questions (NQ) dataset<\/a> created by <a href=\"https:\/\/github.com\/google-research-datasets\/natural-questions\">Google Research<\/a> and available in the <a href=\"https:\/\/huggingface.co\/docs\/datasets\/en\/index\">Hugging Face datasets module<\/a>. It consists of two fields:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><code>question<\/code><\/strong> : Input open domain question (str)<\/li>\n\n\n\n<li><strong><code>answer<\/code><\/strong> :&nbsp; List of possible answers to the question (list)<\/li>\n<\/ul>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12963\"><img loading=\"lazy\" decoding=\"async\" width=\"1896\" height=\"624\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.23.55\u202fPM.png\" alt=\"A screenshot of the Natural Questions (NQ) dataset created by Google Research and available in the Hugging Face datasets module\" class=\"wp-image-12963\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.23.55\u202fPM.png 1896w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.23.55\u202fPM-300x99.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.23.55\u202fPM-1024x337.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.23.55\u202fPM-768x253.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.23.55\u202fPM-1536x506.png 1536w\" sizes=\"auto, (max-width: 1896px) 100vw, 1896px\" \/><figcaption class=\"wp-element-caption\">We\u2019ll be using a toy subset of the <a href=\"https:\/\/huggingface.co\/datasets\/rongzhangibm\/NaturalQuestionsV2\">Natural Questions (NQ) dataset<\/a> created by <a href=\"https:\/\/github.com\/google-research-datasets\/natural-questions\">Google Research<\/a> and available in the Hugging Face <a href=\"https:\/\/huggingface.co\/docs\/datasets\/en\/index\">datasets module<\/a><\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p>Next, we\u2019ll select the model to be evaluated and define the evaluation task. Here, we use <a href=\"https:\/\/huggingface.co\/Qwen\/Qwen2.5-3B-Instruct\">Qwen2.5-3B-Instruct<\/a>, since it easily fits within a Colab notebook and is readily available in Hugging Face\u2019s <a href=\"https:\/\/huggingface.co\/docs\/transformers\/en\/index\">transformers module<\/a>.<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from transformers import AutoModelForCausalLM, AutoTokenizer\n\n# Load the model and tokenizer\nMODEL_NAME = \"Qwen\/Qwen2.5-3B-Instruct\"\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    MODEL_NAME,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\ntokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">We write a very simple function that accepts as input the question from our NQ dataset and returns the model\u2019s response. By adding the track decorator, we ensure that everything is automatically tracked to Opik.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from opik import track\n\n@track\ndef generate_answer(input_question: str) -&gt; str:\n  \"\"\"Generates an answer based on the input question using the loaded LLM.\"\"\"\n  messages = &#91;\n    {\"role\": \"system\", \"content\": \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": input_question}\n  ]\n  text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True\n  )\n  model_inputs = tokenizer(&#91;text], return_tensors=\"pt\").to(model.device)\n\n  generated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=512\n  )\n  generated_ids = &#91;\n    output_ids&#91;len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n  ]\n\n  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)&#91;0]\n  return response\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">We\u2019ll also add the track decorator to our evaluation task, which simply calls the function we defined above and returns it in the appropriate format for the evaluate function we\u2019ll call later on.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>@track\ndef evaluation_task(data):\n    \"\"\"Evaluates the LLM output given a dataset sample.\"\"\"\n    llm_output = generate_answer(data&#91;'question'])\n    return {\"output\": llm_output}\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p>For our particular use case, we want each of the models in our jury to return structured outputs in the form of valid JSON objects. This way, we can ensure that each model returns its output in the same format and we can easily aggregate the scores. For this, we\u2019ll use <a href=\"https:\/\/platform.openai.com\/docs\/guides\/structured-outputs\">OpenAI\u2019s structured formats<\/a> and define our <code>response_format<\/code>:<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># JSON schema for hallucination scoring response_format\nRESPONSE_FORMAT = {\n      \"type\": \"json_schema\",\n      \"json_schema\": {\n        \"name\": \"hallucination_score\",\n        \"strict\": True,\n        \"schema\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"score\": {\n              \"type\": \"number\",\n              \"description\": \"A hallucination score between 0 and 1\"\n            },\n            \"reason\": {\n              \"type\": \"string\",\n              \"description\": \"The reasoning for the assessed hallucination score\"\n            }\n          },\n          \"required\": &#91;\"score\", \"reason\"],\n          \"additionalProperties\": False\n        }\n      }\n    }\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Next, we\u2019ll define our LLM Jury metric as a <a href=\"https:\/\/www.comet.com\/docs\/opik\/evaluation\/metrics\/custom_metric\"><span class=\"s2\">custom metric<\/span><\/a> using <a href=\"https:\/\/github.com\/comet-ml\/opik\/blob\/main\/sdks\/python\/src\/opik\/evaluation\/metrics\/base_metric.py\"><span class=\"s2\">Opik\u2019s BaseMetric class<\/span><\/a>.&nbsp;<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-creating-our-llm-jury-custom-metric\">Creating Our LLM Jury Custom Metric<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">First, we define a prompt which will be passed to each of the models in our jury. We make sure to specify the single-point scoring task and evaluation criteria and use a form-filling paradigm.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\"\"\"\nYou are an impartial judge evaluating the following claim for factual accuracy. Analyze it carefully\nand respond with a number between 0 and 1: 1 if completely accurate, 0.5 if mixed accuracy, or 0 if inaccurate.\nThen provide one brief sentence explaining your ruling.\n\nThe format of the your response should be a JSON object with no additional text or backticks that follows the format:\n{{\n      \"score\": &lt;score between 0 and 1&gt;,\n      \"reason\": \"&lt;reason for the score&gt;\"\n}}\n\nClaim to evaluate: {output}\n\nResponse:\n\"\"\"\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">For simplicity\u2019s sake, we will use <\/span><a href=\"https:\/\/openrouter.ai\/\"><span style=\"font-weight: 400;\">OpenRouter<\/span><\/a><span style=\"font-weight: 400;\"> to call each of the models in our jury. OpenRouter is a unified API service that provides access to multiple AI models through a single standardized interface. This way we will have uniform input and output formatting from each model and can simply loop through a list of our model names. For a full list of models supported by OpenRouter, <\/span><a href=\"https:\/\/openrouter.ai\/models\"><span style=\"font-weight: 400;\">see here<\/span><\/a><span style=\"font-weight: 400;\">. Note that we have selected models from three distinct model groups, namely OpenAI, Mistral AI, and Cohere. We opted not to use Anthropic\u2019s Claude Haiku, as in the <a href=\"https:\/\/arxiv.org\/abs\/2404.18796\">original paper<\/a>, because it does not support JSON schema response format.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>model_names = &#91;\"openai\/gpt-4o-mini\", \"mistralai\/mistral-small-24b-instruct-2501\", \"cohere\/command-r-08-2024\"]\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Next we define the score method of our <code>LLMJuryMetric<\/code> class. The score method should return a <code>ScoreResult<\/code> object with a value and a name.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In our case, our score method loops through our list of models and uses our prompt to evaluate Qwen2.5-3B-Instruct\u2019s answer to the question from our dataset. The method checks to make sure a valid JSON object is returned and collects the responses in a list called <code>completions<\/code>. The response scores are then pooled using an average function, along with a dictionary of the reasons for those scores. Later, we can choose to set a rule for the result of the average function, like:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>is_hallucination = avg_score &gt;= 0.5\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">For now, though, let\u2019s take a look at our scores and we can set a threshold later on.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s put it all together in our custom Opik metric <code>LLMJuryMetric<\/code>:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from opik.evaluation.metrics import base_metric, score_result\nfrom opik.evaluation import models\nimport json\nfrom typing import Any\nfrom openai import OpenAI\nfrom opik.integrations.openai import track_openai\nimport numpy as np\n\n\nclass LLMJuryMetric(base_metric.BaseMetric):\n\"\"\"Metric to evaluate LLM outputs for factual accuracy using multiple models and an avergae voting function.\"\"\"\n    def __init__(self, name: str = \"LLM Jury\"):\n        self.name = name\n        self.llm_client = track_openai(OpenAI(base_url=\"https:\/\/openrouter.ai\/api\/v1\",\n                                              api_key=os.getenv(\"OPENROUTER_API_KEY\"),)\n        )\n        self.prompt_template = \"\"\"\n        You are an impartial judge evaluating the following claim for factual accuracy. Analyze it carefully\n        and respond with a number between 0 and 1: 1 if completely accurate, 0.5 if mixed accuracy, or 0 if inaccurate.\n        Then provide one brief sentence explaining your ruling.\n\n        The format of the your response should be a JSON object with no additional text or backticks that follows the format:\n        {{\n            \"score\": ,\n            \"reason\": \"\"\n        }}\n\n        Claim to evaluate: {output}\n\n        Response:\n        \"\"\"\n        self.model_names = &#91;\"openai\/gpt-4o-mini\", \"mistralai\/mistral-small-24b-instruct-2501\", \"cohere\/command-r-08-2024\"]\n    def score(self, output: str, **ignored_kwargs: Any):\n        \"\"\"\n        Score the output of an LLM.\n\n        Args:\n            output: The output of an LLM to score.\n            **ignored_kwargs: Any additional keyword arguments. This is important so that the metric can be used in the `evaluate` function.\n        \"\"\"\n\n        # Construct the prompt based on the output of the LLM\n        prompt = self.prompt_template.format(output=output)\n\n        completions = &#91;]\nfor model in self.model_names:\n          try:\n              completion = self.llm_client.chat.completions.create(\n                  model=model,\n                  messages=&#91;\n                      {\n                          \"role\": \"user\",\n                          \"content\": prompt\n                          }\n                      ],\n                  response_format=RESPONSE_FORMAT\n                  )\n\n              response_data = json.loads(completion.choices&#91;0].message.content)\n              completions.append(response_data)\n          except (json.JSONDecodeError, AttributeError, IndexError):\n              print(f\"Error parsing response from model {model}: {completion}\")\n              continue  # Skip this model if an error occurs\n\n        if completions:\n              avg_score = np.mean(&#91;float(response&#91;\"score\"]) for response in completions])\n              reasons = {self.model_names&#91;i]: response&#91;\"reason\"] for i, response in enumerate(completions)}\n\n        else:\n              avg_score = 0.0\n              reasons = \"No valid responses received.\"\n\n        return score_result.ScoreResult(\n            name=self.name,\n            value=avg_score,\n            reason=str(reasons)\n        )\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-evaluating-with-opik\">Evaluating With Opik<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">To use our custom <code>LLMJuryMetric<\/code>, we simply instantiate it and pass it to the <code>scoring_metrics<\/code> parameter of <code>opik.evaluation.evaluate<\/code>:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Instantiate our custom LLM Jury metric\nLLMJuryMetric = LLMJuryMetric()\n\nfrom opik.evaluation import evaluate\n\n# Perform the evaluation\nevaluation = evaluate(\n    experiment_name=\"My LLM Jury Experiment\",\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;LLMJuryMetric],\n    task_threads=1\n)\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">And here is what the output of your evaluation should look like from within the Opik UI:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12968 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1911\" height=\"928\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.55.20\u202fPM.png\" alt=\"A screenshot of the result of logging our custom LLM Jury metric using Opik\" class=\"wp-image-12968\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.55.20\u202fPM.png 1911w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.55.20\u202fPM-300x146.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.55.20\u202fPM-1024x497.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.55.20\u202fPM-768x373.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-3.55.20\u202fPM-1536x746.png 1536w\" sizes=\"auto, (max-width: 1911px) 100vw, 1911px\" \/><figcaption class=\"wp-element-caption\">The result of logging our custom LLM Jury metric using Opik<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-limitations-of-llm-juries\">Limitations of LLM Juries<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Despite their advantages, LLM Juries come with several trade-offs. Managing multiple models is inherently more complex than using a single evaluator. Models from different families often have incompatible input\/output formats, requiring additional preprocessing and infrastructure. If each model specializes in evaluating a different criterion, the system must be carefully designed to handle diverse evaluation strategies, further increasing complexity.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Smaller models also underperform in reasoning tasks compared to larger counterparts. If these models lack diverse training data, bias mitigation may be minimal, potentially undermining one of the key motivations for using a jury. Since many modern LLMs share similar datasets, finding sufficiently diverse models can be challenging. The additional engineering overhead required to integrate multiple models must be weighed against the expected improvements in evaluation quality.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Cost considerations have also evolved. Token prices have dropped significantly since <a href=\"https:\/\/arxiv.org\/abs\/2404.18796\">the original PoLL paper<\/a>, making the claim that LLM Juries are \u201c7\u20138x cheaper\u201d than a single large model potentially outdated. While a jury of smaller models is still likely more cost-efficient, the savings may not always justify the added complexity of implementation.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Furthermore, some researchers have questioned PoLL\u2019s findings, particularly its conclusion that GPT-3.5 outperforms GPT-4. The study focused on relatively simple tasks (Single-Hop QA, Multi-Hop QA, and Chatbot Arena Hard), leaving open the possibility that a single large model could still outperform an LLM Jury on more complex, nuanced evaluations.&nbsp;<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-case-for-llm-juries\">The Case for LLM Juries<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Still, the idea that an ensemble of different models can outperform a single larger model is well established both in research and industry. Mixture-of-Experts approaches have been at the forefront of LLM leaderboards for some time now, as an example. At Comet, we can anecdotally say that, after working with hundreds of teams, we have seen many instances where LLM jury techniques have been the optimal approach for evaluations, outperforming single-model judges. With the introduction of platforms like Opik and OpenRouter, there\u2019s no reason not to experiment with an LLM jury approach inside your <a href=\"https:\/\/www.comet.com\/site\/products\/opik\/\">evaluation<\/a> pipelines.<\/span><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"p1\"><span class=\"s1\"><b>If you found this article useful, follow me on <\/b><a href=\"https:\/\/www.linkedin.com\/in\/anmorgan24\/\"><span class=\"s2\"><b>LinkedIn<\/b><\/span><\/a><b> and <\/b><a href=\"https:\/\/x.com\/anmorgan2414\"><span class=\"s2\"><b>Twitter<\/b><\/span><\/a><b> for more content!<\/b><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Evaluating the correctness of generated responses is an inherently challenging task. LLM-as-a-Judge evaluators have gained popularity for their ability to provide nuanced, reference-free, and scalable assessments across diverse tasks. However, individual models still suffer from biases, inconsistencies, and blind spots.&nbsp; Ensemble learning, a long-standing technique in traditional machine learning, enhances accuracy and robustness by aggregating [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":18410,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,65,7],"tags":[14,15,71,52,31,96,33],"coauthors":[133],"class_list":["post-12957","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-tutorials","tag-comet-ml","tag-deep-learning-experiment-management","tag-language-models","tag-llm","tag-llmops","tag-llms","tag-openai"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>LLM Juries for Evaluation - Comet<\/title>\n<meta name=\"description\" content=\"An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"LLM Juries for Evaluation\" \/>\n<meta property=\"og:description\" content=\"An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2025-02-24T15:21:32+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-13T19:22:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"LLM Juries for Evaluation - Comet","description":"An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/","og_locale":"en_US","og_type":"article","og_title":"LLM Juries for Evaluation","og_description":"An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.","og_url":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2025-02-24T15:21:32+00:00","article_modified_time":"2025-11-13T19:22:47+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg","type":"image\/jpeg"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"LLM Juries for Evaluation","datePublished":"2025-02-24T15:21:32+00:00","dateModified":"2025-11-13T19:22:47+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/"},"wordCount":1501,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg","keywords":["Comet ML","Deep Learning Experiment Management","Language Models","LLM","LLMOps","LLMs","OpenAI"],"articleSection":["Comet Community Hub","LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/","url":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/","name":"LLM Juries for Evaluation - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg","datePublished":"2025-02-24T15:21:32+00:00","dateModified":"2025-11-13T19:22:47+00:00","description":"An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/02\/LLM-Juries.jpg","width":1920,"height":1080,"caption":"LLM Juries for Evaluation featured image"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"LLM Juries for Evaluation"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12957","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=12957"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12957\/revisions"}],"predecessor-version":[{"id":18412,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12957\/revisions\/18412"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/18410"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=12957"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=12957"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=12957"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=12957"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}