{"id":12079,"date":"2024-11-27T12:40:16","date_gmt":"2024-11-27T20:40:16","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=12079"},"modified":"2025-11-13T21:25:36","modified_gmt":"2025-11-13T21:25:36","slug":"structured-generation-llm-as-a-judge","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/","title":{"rendered":"Structured Generation for LLM-as-a-Judge Evaluations"},"content":{"rendered":"\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1-lQn0qvJMN1BBuDjRuCzySA7gLhpcdBo\" target=\"_blank\" rel=\"noreferrer noopener\"><span class=\"s1\">Follow along with the Colab!<\/span><\/a><\/div>\n<\/div>\n\n\n\n<p class=\"p1\">For the past few months, I\u2019ve been working on LLM-based evaluations (\u201d<a href=\"https:\/\/www.comet.com\/site\/blog\/llm-as-a-judge\/\">LLM-as-a-Judge<\/a>\u201d metrics) for language models. The results have so far been extremely encouraging, particularly for evaluations like <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-hallucination\/\">hallucination detection<\/a> or content moderation, which are hard to quantify with heuristic methods.<\/p>\n\n\n\n<p>Engineering <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-metrics-every-developer-should-know\/\">LLM evaluation metrics<\/a>, however, has been surprisingly challenging. Evaluations and unit tests, especially those with more complex logic, require you to know the structure of your data. And with LLMs and their probabilistic outputs, it\u2019s difficult to reliably output specific formats and structures. Some hosted model providers now offer <code>structured outputs<\/code> modes, but these still come with limitations, and if you\u2019re using open source or local models, those modes won\u2019t do you much good.<\/p>\n\n\n\n<p>The solution to this problem is to use <strong>structured generation<\/strong>. Beyond its ability to make LLM-based evaluations more reliable, it also unlocks an entirely new category of complex, powerful multi-stage evaluations.<\/p>\n\n\n\n<p>In this piece, I want to introduce structured generation and some of the big ideas behind it, before diving into specific examples of hallucination detection with an LLM judge. All of the code samples below can be run from within this <a href=\"https:\/\/colab.research.google.com\/drive\/1-lQn0qvJMN1BBuDjRuCzySA7gLhpcdBo#scrollTo=8QOySg8J5AcT\">Colab notebook<\/a>, so feel free to run the samples as you follow along.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-a-brief-introduction-to-structured-generation-with-context-free-grammars\">A Brief Introduction to Structured Generation with Context-Free Grammars<\/h2>\n\n\n\n<p>Structured generation is a subfield of machine learning focused on guiding the outputs of generative models by constraining the outputs to fit some particular schema. As an example, instead of fine-tuning a model to output valid JSON, you might constrain a more generalized model\u2019s output to only match valid JSON schemas.<\/p>\n\n\n\n<p>You can constrain the outputs of a model through different strategies, but the most common is to interfere directly in the sampling phase, using some external schema to prevent \u201cincorrect\u201d tokens from being sampled.<\/p>\n\n\n\n<p>At this point, structured generation has become a fairly common feature in LLM servers. vLLM, NVIDIA NIM, llama.cpp, and Ollama all support it. If you\u2019re not working with a model server, libraries like <a href=\"https:\/\/github.com\/dottxt-ai\/outlines\">Outlines<\/a> make it trivial to implement for any model. OpenAI also provides a \u201cStructured Output\u201d mode, which similarly allows you to specify a response schema from their API.<\/p>\n\n\n\n<p>But, I find it helps me develop my intuition for a concept to try a simple implementation from scratch, and so that\u2019s what we\u2019re going to do here.<\/p>\n\n\n\n<p>There are two main components to structured generation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Defining a schema<\/li>\n\n\n\n<li>Parsing the output<\/li>\n<\/ul>\n\n\n\n<p>For the schema, I\u2019m going to use a context-free grammar (CFG). If you\u2019re unfamiliar, a grammar is a schema for parsing a language. Loosely, it defines what is and isn\u2019t considered \u201cvalid\u201d in a language. If you\u2019re in the mood for an <em>excellent<\/em> rabbit hole, context-free languages are a part of Chomsky\u2019s hierarchy of languages. The amazing Kay Lack has <a href=\"https:\/\/www.youtube.com\/watch?v=ENKT0Z3gldE\">a fantastic introductory video to grammars and parsing here<\/a>, if you\u2019re interested in learning more.<\/p>\n\n\n\n<p>The most popular library for parsing and constructing CFGs is Lark. In the below code, I\u2019ve written out a simple JSON grammar using the library:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from lark import Lark\n\ngrammar = r\"\"\"\n?start: value\n\n?value: object\n       | array\n       | ESCAPED_STRING\n       | SIGNED_NUMBER      -&gt; number\n       | \"true\"             -&gt; true\n       | \"false\"            -&gt; false\n       | \"null\"             -&gt; null\n\narray  : \"&#91;\" &#91;value (\",\" value)*] &#91;\"]\"]\nobject : \"{\" &#91;pair (\",\" pair)*] &#91;\"}\"]\npair   : ESCAPED_STRING \":\" value\n\n%import common.ESCAPED_STRING\n%import common.SIGNED_NUMBER\n%import common.WS_INLINE\n%ignore WS_INLINE\n\"\"\"\n\nparser = Lark(grammar, start=\"start\", parser=\"lalr\", debug=True)\n<\/code><\/pre>\n\n\n\n<p>If you\u2019re not familiar with CFGs or Lark, the above might seem a little intimidating, but it\u2019s actually pretty straightforward. The <code>?start<\/code> line indicates that we begin with a <code>value<\/code>. We then define a <code>value<\/code> to be either an object, an array, an escaped string, a signed number, a boolean, or a null value. The <code>-&gt;<\/code> symbols indicate that we map these string values to literal values. We then further specify what we mean by <code>array<\/code> , <code>object<\/code>, and <code>pair<\/code>, before finally instructing our parser to ignore inline whitespace. Try to think of it as if we are constantly \u201cexpanding\u201d each high level concept, like a <code>start<\/code> or a <code>value<\/code>, into composite parts, until we reach such a low level of abstraction that we can no longer expand. In the parlance of grammars, these \u201ctoo low level to be expanded\u201d symbols are called \u201cterminals.\u201d<\/p>\n\n\n\n<p>One immediate issue you\u2019ll run into with this above code is that it only determines if a string is valid or invalid JSON. Since we\u2019re using a language model and generating one token at a time, we\u2019re going to have a lot of intermediary strings that are technically invalid. There are more elegant ways of handling this, but for the sake of speed, I\u2019m just going to define a simple function to check if we\u2019re in the middle of generating a string or not:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def is_incomplete_string(input_string):\n    quote_count = input_string.count('\"')\n    if quote_count % 2 != 0:\n        return True\n    return False\n<\/code><\/pre>\n\n\n\n<p>With all of this defined, let\u2019s run a little test to see if our parser can accurately differentiate between valid, invalid, and incomplete JSON strings:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from lark import UnexpectedCharacters, UnexpectedToken\n\n# We will use this method later in constraining our model output\ndef try_and_recover(json_string):\n    try:\n        parser.parse(json_string)\n        return {\"status\": \"valid\", \"message\": \"The JSON is valid.\"}\n    except UnexpectedToken as e:\n        return {\"status\": \"incomplete\", \"message\": f\"Incomplete JSON. Error: {str(e)}\"}\n    except UnexpectedCharacters as e:\n        if is_incomplete_string(json_string):\n            return {\"status\": \"incomplete\", \"message\": \"Incomplete string detected.\"}\n        return {\"status\": \"invalid\", \"message\": f\"Invalid JSON. Error: {str(e)}\"}\n    except Exception as e:\n        return {\"status\": \"invalid\", \"message\": f\"Unknown error. JSON is invalid. Error: {str(e)}\"}\n\n# Test cases\ntest_cases = &#91;\n    '{\"key\": \"value\", \"key2\": ',  # Incomplete JSON\n    '&#91;1, 2, 3',                   # Incomplete JSON\n    '{\"key\": \"value\"}',           # Complete JSON\n    'true',                       # Valid JSON\n    '{\"key\": true, \"nested\": {',  # Incomplete JSON\n    '{\"answer\": \"Paris',          # Incomplete JSON\n    'invalid syntax'              # Invalid JSON\n]\n\n# Test and display results\nresults = &#91;]\nfor test in test_cases:\n    result = try_and_recover(test)\n    results.append({\"input\": test, \"result\": result})\n\nfor test in results:\n  print(test)\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>{'input': '{\"key\": \"value\", \"key2\": ', 'result': {'status': 'incomplete', 'message': \"...\"}}\n{'input': '&#91;1, 2, 3', 'result': {'status': 'valid', 'message': '...'}}\n{'input': '{\"key\": \"value\"}', 'result': {'status': 'valid', 'message': '...'}}\n{'input': 'true', 'result': {'status': 'valid', 'message': '...'}}\n{'input': '{\"key\": true, \"nested\": {', 'result': {'status': 'valid', 'message': '...'}}\n{'input': '{\"answer\": \"Paris', 'result': {'status': 'incomplete', 'message': '...'}}\n{'input': 'invalid syntax', 'result': {'status': 'invalid', 'message': \"...\"}}\n<\/code><\/pre>\n\n\n\n<p>And it works!<\/p>\n\n\n\n<p>As a final test, let\u2019s use this <code>try_and_recover()<\/code> function to guide our decoding process with a relatively smaller model. In the below code, we\u2019ll use an instruction-tuned Qwen 2.5 model with 3 billion parameters, and we\u2019ll ask it a simple question. First, let\u2019s initialize the model and tokenizer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from transformers import AutoModelForCausalLM, AutoTokenizer\nmodel_name = \"Qwen\/Qwen2.5-3B-Instruct\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\")\n\n<\/code><\/pre>\n\n\n\n<p>Now, we want to define a function to recursively sample from the model, using our <code>try_and_recover()<\/code> function to constrain the outputs. Below, I\u2019ve defined the function, which works by recursively sampling the top 20 most likely next tokens, and selecting the first one which satisfies a valid or incomplete JSON string:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch\n\ndef sample_with_guidance(initial_text):\n    \"\"\"\n    Generates a structured response from the model, guided by a validation function.\n\n    Args:\n        initial_text (str): The initial input text to the model.\n\n    Returns:\n        str: The structured response generated by the model.\n    \"\"\"\n    response = \"\"  # Accumulate the response string here\n    next_token = None  # Placeholder for the next token\n\n    while next_token != tokenizer.eos_token:  # Continue until the end-of-sequence token is generated\n        # Encode the current input (initial_text + response) for the model\n        input_ids = tokenizer.encode(initial_text + response, return_tensors=\"pt\").to(device)\n\n        with torch.no_grad():  # Disable gradients for inference\n            outputs = model(input_ids)\n\n            # Get the top 20 most likely next tokens\n            top_tokens = torch.topk(outputs.logits&#91;0, -1, :], 20, dim=-1).indices\n            candidate_tokens = tokenizer.batch_decode(top_tokens)\n\n        for token in candidate_tokens:\n            # Check if the token is the end-of-sequence token\n            if token == tokenizer.eos_token:\n                # Validate the current response to decide if we should finish\n                validation_result = try_and_recover(response)\n                if validation_result&#91;'status'] == 'valid':  # Finish if the response is valid\n                    next_token = token\n                    break\n                else:\n                    continue  # Skip to the next token if invalid\n\n            # Simulate appending the token to the response\n            extended_response = response + token\n\n            # Validate the extended response\n            validation_result = try_and_recover(extended_response)\n            if validation_result&#91;'status'] in {'valid', 'incomplete'}:\n                # Update the response and set the token as the next token\n                response = extended_response\n                next_token = token\n                print(response)  # Just to see our intermediate outputs\n                break\n\n    return response\n\n<\/code><\/pre>\n\n\n\n<p>This isn\u2019t the most performant or robust approach, but it works well enough for our purposes. If you want a better look at more optimal approaches, you can see how <a href=\"https:\/\/github.com\/ggerganov\/llama.cpp\/blob\/master\/grammars\/README.md\">llama.cpp implements structured generation<\/a>, or how a library like <a href=\"https:\/\/github.com\/dottxt-ai\/outlines\">Outlines handles things<\/a>.<\/p>\n\n\n\n<p>With the following code, we can test the performance of this structured generation function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import json\n\nmessages = &#91;\n    {\n\t    \"role\": \"user\",\n\t    \"content\": \"What is the capital of France? Please only answer using the following JSON schema: { \\\\\"answer\\\\\": str }.\"\n\t    }\n]\n\n# Format the text for our particular model\ninput_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n\noutput = sample_with_guidance(input_text)\n\nprint(\"Parsed JSON Object:\")\nprint(json.loads(output))\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n{ \"\n{ \"answer\n{ \"answer\":\n{ \"answer\": \"\n{ \"answer\": \"Paris\n{ \"answer\": \"Paris\"\n{ \"answer\": \"Paris\" }\n\nParsed JSON Object:\n{ \"answer\": \"Paris\" }\n<\/code><\/pre>\n\n\n\n<p>This particular approach will obviously add some computational overhead to your code, but some of the more optimized implementations are actually capable of structuring the output of a model with minimal latency impact. Below is a side-by-side comparison of unstructured generation versus structured generation using llama.cpp\u2019s grammar-structured generation feature:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"750\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/structured_and_unstructured_generation-final-2.gif\" alt=\"\" class=\"wp-image-12082\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>This comparison was recorded by Brandon Willard from .txt (the company behind Outlines), as part of <a href=\"https:\/\/blog.dottxt.co\/how-fast-cfg.html\">his fantastic article on latency in structured generation<\/a>. I\u2019d highly recommend giving it a read, if you\u2019re interested in diving deeper into the field.<\/p>\n\n\n\n<p>Alright, with that bit of introduction out of the way, let\u2019s look at applying structured generation to an LLM-as-a-judge metric, like hallucination.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-to-detect-hallucinations-with-structured-generation\">How to detect hallucinations with structured generation<\/h2>\n\n\n\n<p>Hallucination detection is one of the \u201cclassic\u201d applications of LLM-based evaluation. Traditional heuristic methods struggle with the subtlety of hallucination\u2014in no small part due to the fact that there is no universally agreed upon definition of \u201challucination.\u201d For the purposes of this article, we\u2019re going to use a definition from a <a href=\"https:\/\/arxiv.org\/html\/2403.16527v1\">recent paper out of the University of Illinois Champagne-Urbana<\/a>, which I find to be descriptive and usable:<\/p>\n\n\n\n<p><em>A hallucination is a generated output from a model that conflicts with constraints or deviates from desired behavior in actual deployment, or is completely irrelevant to the task at hand, but could be deemed syntactically plausible under the circumstances.<\/em><\/p>\n\n\n\n<p>In other words, a hallucination is an output that seems plausible. It is grammatically correct, it makes reference to its surrounding context, and it seems to fit the \u201cflow\u201d of the task. It also, however, contradicts some basic instruction of the task. This could mean drawing incorrect conclusions, citing nonexistent data, or completely ignoring the actual instructions of the task.<\/p>\n\n\n\n<p>Obviously, encoding a discrete system of rules to parse outputs for something as ambiguous as hallucinations is a challenge. LLMs, however, are very well suited towards this kind of complex task.<\/p>\n\n\n\n<p>Using an LLM to perform hallucination analysis isn\u2019t too difficult to setup. All we need to do is prompt the model to analyze the output text for hallucinations. In <a href=\"https:\/\/github.com\/comet-ml\/opik\">Opik\u2019s built-in Hallucination() metric<\/a>, we use the following prompt:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\ncontext_hallucination_template = \"\"\"You are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to the given context. Analyze the provided INPUT, CONTEXT, and OUTPUT to determine if the OUTPUT contains any hallucinations or unfaithful information.\n\nGuidelines:\n1. The OUTPUT must not introduce new information beyond what's provided in the CONTEXT.\n2. The OUTPUT must not contradict any information given in the CONTEXT.\n2. The OUTPUT should not contradict well-established facts or general knowledge.\n3. Ignore the INPUT when evaluating faithfulness; it's provided for context only.\n4. Consider partial hallucinations where some information is correct but other parts are not.\n5. Pay close attention to the subject of statements. Ensure that attributes, actions, or dates are correctly associated with the right entities (e.g., a person vs. a TV show they star in).\n6. Be vigilant for subtle misattributions or conflations of information, even if the date or other details are correct.\n7. Check that the OUTPUT doesn't oversimplify or generalize information in a way that changes its meaning or accuracy.\n\nAnalyze the text thoroughly and assign a hallucination score between 0 and 1, where:\n- 0.0: The OUTPUT is entirely faithful to the CONTEXT\n- 1.0: The OUTPUT is entirely unfaithful to the CONTEXT\n\nINPUT (for context only, not to be used for faithfulness evaluation):\n{input}\n\nCONTEXT:\n{context}\n\nOUTPUT:\n{output}\n\nProvide your verdict in JSON format:\n{{\n    \"score\": &lt;your score between 0.0 and 1.0&gt;,\n    \"reason\": &#91;\n        &lt;list your reasoning as bullet points&gt;\n    ]\n}}\"\"\"\n\n<\/code><\/pre>\n\n\n\n<p>The difficult part, however, is performing this analysis programatically. In a real world setting, we\u2019ll want to automatically parse the output of our model and collect the hallucination scores, either as part of our model evaluation or as part of our inference pipeline. Doing this will require us to write code that acts on the model outputs, and if the LLM responds with incorrectly formatted output, the evaluation will break.<\/p>\n\n\n\n<p>This is a problem even for state of the art foundation models, but it is greatly exaggerated when working with smaller language models. Their outputs are probabilistic, and no matter how thorough you are in your prompt, there is no guarantee that they will always respond with the correct structure.<\/p>\n\n\n\n<p><em>Unless<\/em>, of course, you use structured generation.<\/p>\n\n\n\n<p>Let\u2019s run through a simple example using Outlines and Opik. First, we want to initialize our model using Outlines. In this example, we\u2019ll be using the 0.5 billion parameter version of Qwen2.5. While this model is impressive for its size, and small enough for us to run quickly in a Colab notebook, you will likely want to use a larger model for more accurate results.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import outlines\n\nmodel_kwargs = {\n    \"device_map\": \"auto\"\n}\n\nmodel = outlines.models.transformers(\"Qwen\/Qwen2.5-0.5B-Instruct\", model_kwargs=model_kwargs)\n<\/code><\/pre>\n\n\n\n<p>When your model finishes downloading, you can then create a <code>generator<\/code>. In Outlines, a <code>generator<\/code> is an inference pipeline that combines an output schema with a model. In the below code, we\u2019ll define a schema in Pydantic and initialize our generator:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pydantic\nfrom typing import List\n\nclass HallucinationResponse(pydantic.BaseModel):\n    score: int\n    reason: List&#91;str]\n\ngenerator = outlines.generate.json(model, HallucinationResponse)\n<\/code><\/pre>\n\n\n\n<p>Now, if we pass a string into the generator, it will output a properly formatted object.<\/p>\n\n\n\n<p>Next, let\u2019s setup our Hallucination metric in Opik. It\u2019s pretty straightforward to create a metric using Opik\u2019s baseMetric class:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from typing import Optional, List, Any\nfrom opik.evaluation.metrics import base_metric\n\nclass HallucinationWithOutlines(base_metric.BaseMetric):\n    \"\"\"\n    A metric that evaluates whether an LLM's output contains hallucinations based on given input and context.\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str = \"hallucination_metric\",\n    ):\n        super().__init__(name=name)\n\n    def score(\n        self,\n        input: str,\n        output: str,\n        context: Optional&#91;List&#91;str]] = None,\n        **ignored_kwargs: Any,\n    ) -&gt; HallucinationResponse:\n        \"\"\"\n        Calculate the hallucination score for the given input, output, and optional context field.\n\n        Args:\n            input: The original input\/question.\n            output: The LLM's output to evaluate.\n            context: A list of context strings. If not provided, the presence of hallucinations will be evaluated based on the output only.\n            **ignored_kwargs: Additional keyword arguments that are ignored.\n\n        Returns:\n            HallucinationResponse: A HallucinationResponse object with a score of 1.0 if hallucination\n                is detected, 0.0 otherwise, along with the reason for the verdict.\n        \"\"\"\n        llm_query = context_hallucination_template.format(input=input, output=output, context=context)\n\n        with torch.no_grad():\n            return generator(llm_query)\n\n<\/code><\/pre>\n\n\n\n<p>All we really do in the above is generate our prompt using the previously defined template string, and then pass it into our generator.<\/p>\n\n\n\n<p>Now, let\u2019s try out our metric on an actual hallucination dataset, to get a sense of how it works. We\u2019ll use a split from Patronus\u2019s HaluBench dataset, which is freely available via HuggingFace, and we\u2019ll upload it as an Opik Dataset for our experiments. We\u2019ll use a little extra logic to make sure the dataset is balanced between hallucinated and non-hallucinated samples:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import opik\nimport pandas as pd\n\nclient = opik.Opik()\n\n# Create dataset\ndataset = client.get_or_create_dataset(\n    name=\"HaluBench balanced\", description=\"HaluBench dataset balanced\"\n)\n\n# Insert items into dataset\ndf = pd.read_parquet(\n    \"hf:\/\/datasets\/PatronusAI\/HaluBench\/data\/test-00000-of-00001.parquet\"\n)\n# Sample equal number of PASS and FAIL records\nn_per_class = 25  # 25 each to get 50 total\ndf_balanced = pd.concat(&#91;\n    df&#91;df&#91;'label'] == 'PASS'].sample(n=n_per_class, random_state=42),\n    df&#91;df&#91;'label'] == 'FAIL'].sample(n=n_per_class, random_state=42)\n])\ndf = df_balanced\n\ndataset_records = &#91;\n    {\n        \"input\": x&#91;\"question\"],\n        \"context\": &#91;x&#91;\"passage\"]],\n        \"output\": x&#91;\"answer\"],\n        \"expected_output\": x&#91;\"label\"],\n    }\n    for x in df.to_dict(orient=\"records\")\n]\n\ndataset.insert(dataset_records)\n<\/code><\/pre>\n\n\n\n<p>And now, we simply define an evaluation task using our HallucinationWithOutlines() metric, and run it against our dataset:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from opik.evaluation import evaluate\nfrom opik.evaluation.metrics import Equals\nfrom typing import Dict\n\n# Define the evaluation task\ndef evaluation_task(x: Dict):\n    metric = HallucinationWithOutlines()\n    try:\n        metric_score = metric.score(\n            input=x&#91;\"input\"], context=x&#91;\"context\"], output=x&#91;\"output\"]\n        )\n        hallucination_score = metric_score.score\n        hallucination_reason = metric_score.reason\n    except Exception as e:\n        print(e)\n        hallucination_score = None\n        hallucination_reason = str(e)\n\n    return {\n        \"output\": \"FAIL\" if hallucination_score == 1 else \"PASS\",\n        \"hallucination_reason\": hallucination_reason,\n        \"reference\": x&#91;\"expected_output\"],\n    }\n\n# Define the scoring metric\ncheck_hallucinated_metric = Equals(name=\"Correct hallucination score\")\n\nres = evaluate(\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;check_hallucinated_metric],\n)\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>Evaluation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 50\/50 &#91;02:38&lt;00:00,  3.18s\/it]\n\u256d\u2500 HaluBench22 (50 samples) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502                                           \u2502\n\u2502 Total time:        00:02:39               \u2502\n\u2502 Number of samples: 50                     \u2502\n\u2502                                           \u2502\n\u2502 Correct hallucination score: 0.5600 (avg) \u2502\n\u2502                                           \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nUploading results to Opik ...\nView the results in your Opik dashboard.\n<\/code><\/pre>\n\n\n\n<p>And that\u2019s all it takes! Notice that none of our samples failed because of improperly structured outputs. Let\u2019s try running this same evaluation, but without structured generation. To achieve this, we can switch our generator type:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>generator = outlines.generate.text(model)\n<\/code><\/pre>\n\n\n\n<p>And modify our metric to parse JSON from the model output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from typing import Optional, List, Any\nfrom opik.evaluation.metrics import base_metric\nimport json\n\nclass HallucinationUnstructured(base_metric.BaseMetric):\n    \"\"\"\n    A metric that evaluates whether an LLM's output contains hallucinations based on given input and context.\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str = \"hallucination_metric\",\n    ):\n        super().__init__(name=name)\n\n    def score(\n        self,\n        input: str,\n        output: str,\n        context: Optional&#91;List&#91;str]] = None,\n        **ignored_kwargs: Any,\n    ) -&gt; HallucinationResponse:\n        \"\"\"\n        Calculate the hallucination score for the given input, output, and optional context field.\n\n        Args:\n            input: The original input\/question.\n            output: The LLM's output to evaluate.\n            context: A list of context strings. If not provided, the presence of hallucinations will be evaluated based on the output only.\n            **ignored_kwargs: Additional keyword arguments that are ignored.\n\n        Returns:\n            HallucinationResponse: A HallucinationResponse object with a score of 1.0 if hallucination\n                is detected, 0.0 otherwise, along with the reason for the verdict.\n        \"\"\"\n        llm_query = context_hallucination_template.format(input=input, output=output, context=context)\n\n        with torch.no_grad():\n            return json.loads(generator(llm_query)) # Parse JSON string from response\n\n<\/code><\/pre>\n\n\n\n<p>Keeping the rest of the code the same and running this now results in:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Evaluation:   0%|          | 0\/50 &#91;00:00&lt;?, ?it\/s]Unterminated string starting at: line 5 column 9 (char 47)\nEvaluation:   2%|\u258f         | 1\/50 &#91;00:56&lt;46:15, 56.63s\/it]Expecting value: line 1 column 2 (char 1)\nExpecting value: line 1 column 2 (char 1)\nEvaluation:   6%|\u258c         | 3\/50 &#91;00:57&lt;10:09, 12.96s\/it]Unterminated string starting at: line 4 column 9 (char 45)\nExpecting value: line 1 column 2 (char 1)\nEvaluation:  12%|\u2588\u258f        | 6\/50 &#91;00:57&lt;03:01,  4.12s\/it]Unterminated string starting at: line 4 column 9 (char 45)\n...\n<\/code><\/pre>\n\n\n\n<p>Nearly every string fails to parse correctly. The inference time is also increased dramatically because of the variable length of responses, whereas the structured output helps keep the responses terse.<\/p>\n\n\n\n<p>Without structured generation, it just isn\u2019t feasible to run this kind of evaluation, especially with a model this small. As an experiment, try running this same code with a bigger model and see how the average accuracy score improves.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-can-we-build-more-complex-llm-judges-with-structured-generation\">Can we build more complex LLM judges with structured generation?<\/h2>\n\n\n\n<p>The above example of hallucination detection is pretty straightforward. The real value that structured generation brings to LLM judges, however, is that it enables us to build more complex, multi-turn evaluations.<\/p>\n\n\n\n<p>To give an extreme example of what a multi-step evaluation might look like, one recent paper found success in LLM evals by constructing multiple \u201cpersonas\u201d for different <a href=\"https:\/\/www.comet.com\/site\/blog\/ai-agents\/\">AI agents<\/a>, and having the <a href=\"https:\/\/arxiv.org\/html\/2405.20267v4\">agents debate in an actual courtroom structure<\/a>:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"159\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/image-3-300x159.png\" alt=\"\" class=\"wp-image-12085\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/image-3-300x159.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/image-3-1024x542.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/image-3-768x406.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/image-3.png 1130w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Forcing different agents to advocate for different positions and examine each other\u2019s arguments, all while having yet another agent act as a \u201cjudge\u201d to emit a final decision, significantly increased the accuracy of evaluations.<\/p>\n\n\n\n<p>In order for such a system to work, the handoffs between different agents must go smoothly. If an agent needs to pick between 5 possible actions, we need to be 100% sure that the model will only output one of those 5 valid actions. With structured generation, we can achieve that level of reliability.<\/p>\n\n\n\n<p>Let\u2019s try a worked example, extending our hallucination metric from earlier. We\u2019ll try the following improvement:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>On first pass, the model will generate 3 candidate hallucinations, with reasoning for each.<\/li>\n\n\n\n<li>For each candidate, the model will evaluate them individually and assess if they are a hallucination, with expanded reasoning.<\/li>\n\n\n\n<li>If the model finds any candidate to be a hallucination, it will return 1.0 for the entire sample.<\/li>\n<\/ul>\n\n\n\n<p>By giving the model the ability to generate longer chains of context, we give it space for more \u201cintermediary computation,\u201d and hopefully, a more accurate final output.<\/p>\n\n\n\n<p>First, let\u2019s define a series of prompts for this task:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>generate_candidates_prompt = \"\"\"\nYou are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to a given context. Your goal is to determine if the provided output contains any hallucinations or unfaithful information when compared to the given context.\n\nHere are the key elements you'll be working with:\n\n1. &lt;context&gt;{context}&lt;\/context&gt;\n   This is the factual information against which you must evaluate the output. All judgments of faithfulness must be based solely on this context.\n\n2. &lt;output&gt;{output}&lt;\/output&gt;\n   This is the AI-generated answer that you need to evaluate for faithfulness.\n\n3. &lt;input&gt;{input}&lt;\/input&gt;\n   This is the original question or prompt. It's provided for context only and should not be used in your faithfulness evaluation.\n\nEvaluation Process:\n1. Carefully read the CONTEXT and OUTPUT.\n2. Analyze the OUTPUT for any discrepancies or additions when compared to the CONTEXT.\n3. Consider the following aspects:\n   - Does the OUTPUT introduce any new information not present in the CONTEXT?\n   - Does the OUTPUT contradict any information given in the CONTEXT?\n   - Does the OUTPUT contradict well-established facts or general knowledge?\n   - Are there any partial hallucinations where some information is correct but other parts are not?\n   - Is the subject of statements correct? Ensure that attributes, actions, or dates are correctly associated with the right entities.\n   - Are there any subtle misattributions or conflations of information, even if dates or other details are correct?\n   - Does the OUTPUT oversimplify or generalize information in a way that changes its meaning or accuracy?\n\n4. Based on your analysis, create a list of 3 statements in the OUTPUT which are potentially hallucinations or unfaithful. For each potentially hallucinated or unfaithful statement from the OUTPUT, explain why you think it violates any of the aspects from step 3.\n\n5. Return your list of statements and associated reasons in the following structured format:\n\n{{\n  \"potential_hallucinations\": &#91;\n    {{\n      \"output_statement\": string,\n      \"reasoning\": string,\n    }},\n  ]\n}}\n\nHere is an example output structure (do not use these specific values, this is just to illustrate the format):\n\n{{\n  \"potential_hallucinations\": &#91;\n    {{\n      \"output_statement\": \"The company was founded in 1995\",\n      \"reasoning\": \"There is no mention of a founding date in the CONTEXT. The OUTPUT introduces new information not present in the CONTEXT.\n    }},\n    {{\n      \"output_statement\": \"The product costs $49.99.\",\n      \"reasoning\": \"The CONTEXT lists the flagship product price at $39.99. The OUTPUT directly contradicts the price given in the CONTEXT.\"\n    }},\n    {{\n      \"output_statement\": \"The flagship product was their most expensive item.\",\n      \"reasoning\": \"The CONTEXT lists mentions another product which is more expensive than the flagship product. The OUTPUT directly contradicts information given in the CONTEXT.\"\n    }}\n  ]\n}}\n\nNow, please proceed with your analysis and evaluation of the provided INPUT, CONTEXT, and OUTPUT.\n\"\"\"\n\nevaluate_candidate_prompt = \"\"\"\nPlease examine the following potential hallucination you detected in the OUTPUT:\n\n{candidate}\n\nYou explained your reasons for flagging the statement like so:\n\n{reason}\n\nAs a reminder, the CONTEXT you are evaluating the statement against is:\n\n{context}\n\nBased on the above, could you answer \"yes\" to any of the following questions?\n  - Does the OUTPUT introduce any new information not present in the CONTEXT?\n  - Does the OUTPUT contradict any information given in the CONTEXT?\n  - Does the OUTPUT contradict well-established facts or general knowledge?\n  - Are there any partial hallucinations where some information is correct but other parts are not?\n  - Is the subject of statements correct? Ensure that attributes, actions, or dates are correctly associated with the right entities.\n  - Are there any subtle misattributions or conflations of information, even if dates or other details are correct?\n  - Does the OUTPUT oversimplify or generalize information in a way that changes its meaning or accuracy?\n\nPlease score the potentially hallucinated statement using the following scale:\n\n  - 1.0 if you answered \"yes\" to any of the previous questions, and you believe the statement is hallucinated or unfaithful to the CONTEXT.\n  - 0.0 if you answered \"no\" to all of the previous questions, and after further reflection, you believe the statement is not hallucinated or unfaithful to the CONTEXT.\n\nBefore responding, please structure your response with the following format\n\n{{\n  \"score\": float,\n  \"reason\": string\n\n}}\n\nHere is an example output structure (do not use these specific values, this is just to illustrate the format):\n\n{{\n  \"score\": 1.0,\n  \"reason\": \"The CONTEXT and OUTPUT list different prices for the same product. This leads me to answer 'yes' to the question, 'Does the OUTPUT contradict any information given in the CONTEXT?'\"\n}}\n\nNow, please proceed with your analysis and evaluation.\n\n\"\"\"\n\n<\/code><\/pre>\n\n\n\n<p>And now, we can define some Pydantic models for our different model outputs:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Generated by generate_candidates_prompt\nclass PotentialHallucination(pydantic.BaseModel):\n    output_statement: str\n    reasoning: str\n\nclass HallucinationCandidates(pydantic.BaseModel):\n    potential_hallucinations: List&#91;PotentialHallucination]\n\n# Generated by evaluate_candidate_prompt\nclass HallucinationScore(pydantic.BaseModel):\n    score: float\n    reason: str\n<\/code><\/pre>\n\n\n\n<p>With all of this, we can put together two generators, one for generating candidate hallucinations, and one for scoring individual candidates:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import outlines\n\nmodel_kwargs = {\n    \"device_map\": \"auto\"\n}\n\nmodel = outlines.models.transformers(\"Qwen\/Qwen2.5-0.5B-Instruct\", model_kwargs=model_kwargs)\n\ncandidate_generator = outlines.generate.json(model, HallucinationCandidates)\ngenerator = outlines.generate.json(model, HallucinationScore)\n<\/code><\/pre>\n\n\n\n<p>Finally, we can construct an Opik metric. We\u2019ll keep the code for this simple:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\nclass HallucinationMultistep(base_metric.BaseMetric):\n    \"\"\"\n    A metric that evaluates whether an LLM's output contains hallucinations using a multi-step appraoch.\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str = \"hallucination_metric\",\n    ):\n        super().__init__(name=name)\n\n    def score(\n        self,\n        input: str,\n        output: str,\n        context: Optional&#91;List&#91;str]] = None,\n        **ignored_kwargs: Any,\n    ) -&gt; HallucinationScore:\n\t\t\t  # Generate candidates\n        candidates_query = generate_candidates_prompt.format(input=input, output=output, context=context)\n        output = candidate_generator(candidates_query)\n\n        # Initialize to zero, in case the model simply finds no candidates for hallucination\n        score = HallucinationScore(score=0.0, reason=\"Found no candidates for hallucination\")\n\n        for candidate in output.potential_hallucinations:\n          followup_query = evaluate_candidate_prompt.format(candidate=candidate.output_statement, reason=candidate.reasoning, context=context)\n          new_score = generator(followup_query)\n          score = new_score\n          if new_score.score &gt; 0.0:\n\t          # Early return if we find a hallucination\n            return new_score\n\n        return score\n\n<\/code><\/pre>\n\n\n\n<p>All we do here is generate the first prompt, which should produce several hallucination candidates when fed to the candidate generator. Then, we pass each candidate (formatted with the candidate evaluation prompt) into the candidate evaluation generator.<\/p>\n\n\n\n<p>If we run it using the same code as before, with slight modifications to use the new metric:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\n# Define the evaluation task\ndef evaluation_task(x: Dict):\n\t\t# Use new metric\n    metric = HallucinationMultistep()\n    try:\n        metric_score = metric.score(\n            input=x&#91;\"input\"], context=x&#91;\"context\"], output=x&#91;\"output\"]\n        )\n        hallucination_score = metric_score.score\n        hallucination_reason = metric_score.reason\n    except Exception as e:\n        print(e)\n        hallucination_score = None\n        hallucination_reason = str(e)\n\n    return {\n        \"output\": \"FAIL\" if hallucination_score == 1 else \"PASS\",\n        \"hallucination_reason\": hallucination_reason,\n        \"reference\": x&#91;\"expected_output\"],\n    }\n\n# Define the scoring metric\ncheck_hallucinated_metric = Equals(name=\"Correct hallucination score\")\n\nres = evaluate(\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;check_hallucinated_metric],\n)\n\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>Evaluation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 50\/50 &#91;02:42&lt;00:00,  3.26s\/it]\n\u256d\u2500 HaluBench22 (50 samples) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502                                           \u2502\n\u2502 Total time:        00:02:43               \u2502\n\u2502 Number of samples: 50                     \u2502\n\u2502                                           \u2502\n\u2502 Correct hallucination score: 0.6800 (avg) \u2502\n\u2502                                           \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n<\/code><\/pre>\n\n\n\n<p>We see an <em>immense<\/em> improvement. Remember that running this same model, with a very similar initial prompt, on this same dataset, resulted in a score of 0.56. By simply adding this additional candidate evaluation step, we immediately increased the score to 0.68. For such a small model, this is great!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-structured-generation-s-role-in-the-future-of-llm-evaluations\">Structured generation\u2019s role in the future of LLM evaluations<\/h2>\n\n\n\n<p>Most foundation model providers, like OpenAI and Anthropic, offer some kind of <code>structured output<\/code> mode which will respond to your queries with a predefined schema. However, the world of <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-guide\/\">LLM evaluation <\/a>extends well beyond the closed ecosystems of these providers\u2019 APIs.<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>So-called \u201cwhite box\u201d evaluations, which incorporate models\u2019 internal states into the evaluation, are impossible with hosted models like GPT-4o.<\/li>\n\n\n\n<li>Fine-tuning a model for your specific evaluation use-case requires you to use open source models.<\/li>\n\n\n\n<li>If you need to run your evaluation pipeline locally, you obviously cannot use a hosted API.<\/li>\n<\/ul>\n\n\n\n<p>And that\u2019s without getting into comparisons of particular open source models against popular foundation models.<\/p>\n\n\n\n<p>The future of LLM evaluations involves more complex evaluation suites, combining white box metrics, classic heuristic methods, LLM judges, and <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/\">LLM juries<\/a> into robust, multi-turn systems. Open source, or at the very least, locally-available LLMs are a major part of that future. Structured generation is a fundamental part of the infrastructure that is enabling that future, along with open source <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-frameworks\/\">LLM evaluation frameworks<\/a> like <a href=\"https:\/\/github.com\/comet-ml\/opik\">Opik<\/a>.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>For the past few months, I\u2019ve been working on LLM-based evaluations (\u201dLLM-as-a-Judge\u201d metrics) for language models. The results have so far been extremely encouraging, particularly for evaluations like hallucination detection or content moderation, which are hard to quantify with heuristic methods. Engineering LLM evaluation metrics, however, has been surprisingly challenging. Evaluations and unit tests, especially [&hellip;]<\/p>\n","protected":false},"author":25,"featured_media":11803,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[65,1],"tags":[],"coauthors":[142],"class_list":["post-12079","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-llmops","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Structured Generation for LLM-as-a-Judge Evaluations - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Structured Generation for LLM-as-a-Judge Evaluations\" \/>\n<meta property=\"og:description\" content=\"For the past few months, I\u2019ve been working on LLM-based evaluations (\u201dLLM-as-a-Judge\u201d metrics) for language models. The results have so far been extremely encouraging, particularly for evaluations like hallucination detection or content moderation, which are hard to quantify with heuristic methods. Engineering LLM evaluation metrics, however, has been surprisingly challenging. Evaluations and unit tests, especially [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-27T20:40:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-13T21:25:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/10\/Logo-Final.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1551\" \/>\n\t<meta property=\"og:image:height\" content=\"527\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Caleb Kaiser\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@KaiserFrose\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Caleb Kaiser\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Structured Generation for LLM-as-a-Judge Evaluations - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/","og_locale":"en_US","og_type":"article","og_title":"Structured Generation for LLM-as-a-Judge Evaluations","og_description":"For the past few months, I\u2019ve been working on LLM-based evaluations (\u201dLLM-as-a-Judge\u201d metrics) for language models. The results have so far been extremely encouraging, particularly for evaluations like hallucination detection or content moderation, which are hard to quantify with heuristic methods. Engineering LLM evaluation metrics, however, has been surprisingly challenging. Evaluations and unit tests, especially [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-11-27T20:40:16+00:00","article_modified_time":"2025-11-13T21:25:36+00:00","og_image":[{"width":1551,"height":527,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/10\/Logo-Final.png","type":"image\/png"}],"author":"Caleb Kaiser","twitter_card":"summary_large_image","twitter_creator":"@KaiserFrose","twitter_site":"@Cometml","twitter_misc":{"Written by":"Caleb Kaiser","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/"},"author":{"name":"Caleb Kaiser","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/baa7ccdd5a25dfa5618749d6c504d203"},"headline":"Structured Generation for LLM-as-a-Judge Evaluations","datePublished":"2024-11-27T20:40:16+00:00","dateModified":"2025-11-13T21:25:36+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/"},"wordCount":2416,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/10\/Logo-Final.png","articleSection":["LLMOps"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/","url":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/","name":"Structured Generation for LLM-as-a-Judge Evaluations - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/10\/Logo-Final.png","datePublished":"2024-11-27T20:40:16+00:00","dateModified":"2025-11-13T21:25:36+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/10\/Logo-Final.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/10\/Logo-Final.png","width":1551,"height":527,"caption":"opik logo"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/structured-generation-llm-as-a-judge\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Structured Generation for LLM-as-a-Judge Evaluations"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/baa7ccdd5a25dfa5618749d6c504d203","name":"Caleb Kaiser","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/3a75e34ba4e2ba18dd960aae0d6d022a","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/01\/cropped-Caleb-Kaiser-96x96.jpeg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/01\/cropped-Caleb-Kaiser-96x96.jpeg","caption":"Caleb Kaiser"},"sameAs":["https:\/\/x.com\/KaiserFrose"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/calebcomet-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12079","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/25"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=12079"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12079\/revisions"}],"predecessor-version":[{"id":18439,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12079\/revisions\/18439"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/11803"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=12079"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=12079"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=12079"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=12079"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}