{"id":12617,"date":"2025-01-28T12:45:22","date_gmt":"2025-01-28T20:45:22","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=12617"},"modified":"2026-02-02T19:14:34","modified_gmt":"2026-02-02T19:14:34","slug":"g-eval-for-llm-evaluation","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/","title":{"rendered":"G-Eval for LLM Evaluation"},"content":{"rendered":"\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1Y4LfLimBxx4KscjF0kRXiWmUR7tgXDfj\" target=\"_blank\" rel=\"noreferrer noopener\"><span class=\"s1\">Follow along with the Colab!<\/span><\/a><\/div>\n<\/div>\n\n\n\n<p class=\"p1\"><span class=\"s1\"><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-as-a-judge\/\">LLM-as-a-judge<\/a> evaluators have gained widespread adoption due to their flexibility, scalability, and close alignment with human judgment. They excel at tasks that are difficult to quantify and evaluate with traditional heuristic metrics like <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-hallucination\/\">hallucination detection<\/a>, creative generation, content moderation, and logical reasoning.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-1024x576.jpg\" alt=\"featured image for a guide to using g-eval for evaluation\" class=\"wp-image-18425\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-1024x576.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-300x169.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-768x432.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-1536x864.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-2048x1152.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Yet, while they may seem simple on the surface, <a href=\"https:\/\/arxiv.org\/abs\/2411.16594\">implementing them can prove very challenging<\/a>. Evaluating a model across multiple metrics often requires creating separate LLM-as-a-Judge pipelines for each metric and combining their outputs. G-Eval simplifies this process by consolidating <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-guide\/\">LLM evaluation<\/a> into a single metric, effectively providing the model with a unified scorecard.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Compared with other LLM-as-a-judge evaluators, G-Eval stands out for its ease-of-use and adaptability.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">So, how does G-Eval work?<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">G-Eval is composed of three main components: the prompt, automatic CoT reasoning, and the scoring function. The user only defines the input prompt, which consists of the \u201cTask Introduction\u201d and \u201cEvaluation Criteria.\u201d These are then fed into an LLM which generates detailed \u201cEvaluation Steps\u201d using <a href=\"https:\/\/www.comet.com\/site\/blog\/chain-of-thought-prompting\/\">Chain-of-Thought prompting<\/a>. The LLM uses these automatically-generated steps, along with the original user-defined prompt, to evaluate the NLG outputs and format them in a form-filling pattern. Finally, a scoring function is applied.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12621\"><img loading=\"lazy\" decoding=\"async\" width=\"1874\" height=\"1312\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-6.08.56\u202fPM.png\" alt=\"G-Eval architecture, as depicted in the original paper, G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment\" class=\"wp-image-12621\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-6.08.56\u202fPM.png 1874w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-6.08.56\u202fPM-300x210.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-6.08.56\u202fPM-1024x717.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-6.08.56\u202fPM-768x538.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-6.08.56\u202fPM-1536x1075.png 1536w\" sizes=\"auto, (max-width: 1874px) 100vw, 1874px\" \/><figcaption class=\"wp-element-caption\">G-Eval architecture, as depicted in the original paper, <a href=\"https:\/\/arxiv.org\/abs\/2404.18796\">G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"p1\"><span class=\"s1\">In this article, you\u2019ll learn more about each of these components as we walk through a step-by-step example. Let\u2019s dive in!<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-evaluating-gpt-4o-summarizations-with-g-eval\">Evaluating GPT-4o Summarizations With G-Eval<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">A major advantage of G-Eval is its versatility. The initial user-defined input prompt can be customized for almost any NLG use case, including text summarization, dialogue generation, and machine translation.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">For this tutorial, we\u2019ll be using G-Eval to score how well OpenAI\u2019s GPT-4o summarizes news articles from <a href=\"https:\/\/huggingface.co\/datasets\/mteb\/summeval\"><span class=\"s2\">the SummEval dataset, hosted by Hugging Face<\/span><\/a>, the same dataset used in <a href=\"https:\/\/arxiv.org\/pdf\/2303.16634\"><span class=\"s2\">the original G-Eval paper<\/span><\/a>.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">We\u2019ll first take a subset of this dataset to walk through the core components of G-Eval and how to implement them with Opik. Once we\u2019ve gotten a grasp on the basics, we\u2019ll use the full SummEval dataset to benchmark our customized G-Eval evaluation tasks and criterion against <a href=\"https:\/\/www.comet.com\/site\/blog\/human-in-the-loop\/\">human-in-the-loop<\/a> annotated scores and calculate the Spearman coefficient.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-1-prompt-for-nlg-evaluation\">1. Prompt for NLG Evaluation<\/h2>\n\n\n\n<p>G-Eval supports a wide range of NLG tasks, making it crucial to clearly define the evaluation objective for the specific task at hand. The initial prompt is the only user-defined input to G-Eval in most out-of-the-box implementations of the metric, including Opik\u2019s. This prompt is a natural language instruction that defines:<\/p>\n\n\n\n<p>\u2022 the <strong>evaluation task<\/strong> (<code>task_introduction<\/code>)<br>\n\u2022 the <strong>evaluation criteria<\/strong> (<code>evaluation_criteria<\/code>)<\/p>\n\n\n\n<p>For general-purpose scenarios, we might provide broad instructions, such as:<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>TASK_INTRODUCTION = \"You are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to the given context.\"\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Alternatively, for specialized use cases, we can craft more tailored instructions, ensuring the model understands the task clearly and doesn\u2019t stray off topic. For example:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>TASK_INTRODUCTION = \"\"\"\n\nYou will be given one summary written for a news article.\n\nYour task is to rate the summary on one metric.\n\nPlease make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.\n\"\"\"\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The initial prompt should also include evaluation criteria for the model to use. This may be as simple as ensuring that the model does not hallucinate or introduce new information:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>EVALUATION_CRITERIA = \"In provided text the OUTPUT must not introduce new information beyond what's provided in the CONTEXT.\"\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Or, we may want the model to focus on more specific aspects of the output, such as coherence, conciseness, relevance, fluency, or grammar. We may also define the scoring system we wish the model to use. For example:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>COHERENCE_EVALUATION_CRITERIA = \"\"\"\nCoherence (1-5) - the collective quality of all sentences.\n\nWe align this dimension with the DUC quality question of structure and coherence whereby \"the summary should be well-structured and well-organized.\n\nThe summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.\n\"\"\"\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">As the only user-defined input to G-Eval, you\u2019ll spend most of your time optimizing these two variables. This will most likely be an iterative process, as we\u2019ll explore later on in this article.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-2-auto-chain-of-thought-for-nlg-evaluation\">2. Auto Chain-of-Thought for NLG Evaluation<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Traditionally, <a href=\"https:\/\/arxiv.org\/pdf\/2201.11903\">Chain-of-Thought (CoT) is a prompting technique<\/a> where reasoning steps are explicitly broken down in sequence to improve the problem-solving and decision-making of an AI. It can significantly improve LLM outputs, but can be a costly and tedious manual task.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Fortunately, <a href=\"https:\/\/arxiv.org\/abs\/2210.03493\">most modern large language models are capable of automating this process by generating the reasoning steps on their own<\/a> when prompted to do so. Auto CoT not only gives G-Eval a more robust reasoning process (leading to higher quality outputs), but also makes G-Eval more scalable than manual approaches.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">This step is especially useful for complex evaluation tasks that require more than one step, or for evaluations with multiple dependencies. An example of auto CoT reasoning steps might look something like this:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Step 1: Identify the key information in the question.\nStep 2: Verify if the AI-generated response directly answers the question.\nStep 3: Confirm the correctness of the answer; Paris is indeed the capital of France.\nStep 4: Verify that no new information has been introduced in the AI-generated response.\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Note that these steps are generated automatically by the model and are not input by the user.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"2708\" height=\"1518\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt.png\" alt=\"full-g-eval-prompt\" class=\"wp-image-12626\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt.png 2708w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt-300x168.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt-1024x574.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt-768x431.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt-1536x861.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/full-g-eval-prompt-2048x1148.png 2048w\" sizes=\"auto, (max-width: 2708px) 100vw, 2708px\" \/><\/figure>\n\n\n\n<p class=\"p1\"><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-3-scoring-function\">3. Scoring Function<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">The user-defined task introduction and evaluation criteria, along with the auto-generated CoT reasoning steps, are now concatenated with the original context and target text and the result is passed to the scoring function. The scoring function calls the evaluating LLM, which is prompted to output a score using a form-filling paradigm. The resulting input might look like <a href=\"https:\/\/github.com\/nlpyang\/geval\/blob\/main\/prompts\/summeval\/rel_detailed.txt\">the following example from the original G-Eval paper<\/a>:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>You will be given one summary written for a news article.\n\nYour task is to rate the summary on one metric.\n\nPlease make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.\n\nEvaluation Criteria:\n\nRelevance (1-5) - selection of important content from the source. The summary should include only important information from the source document. Annotators were instructed to penalize summaries which contained redundancies and excess information.\n\nEvaluation Steps:\n\n1. Read the summary and the source document carefully.\n2. Compare the summary to the source document and identify the main points of the article.\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\n4. Assign a relevance score from 1 to 5.\n\n\nExample:\n\n\nSource Text:\n\n{{Document}}\n\nSummary:\n\n{{Summary}}\n\n\nEvaluation Form (scores ONLY):\n\n- Relevance:\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">The scoring function then normalizes the scores using the probabilities of the output tokens and takes their weighted sum. This is done to:<\/span><\/p>\n\n\n\n<ol class=\"wp-block-list ol1\">\n<li><span class=\"s1\">Prevent low variance in LLM output scores, which occurs for some evaluation tasks.<\/span><\/li>\n\n\n\n<li><span class=\"s1\"><span class=\"s1\">Obtain more fine-grained, continuous scores, as LLMs will often only output integer scores, even when prompted to do otherwise.<\/span><\/span><\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12628\"><img loading=\"lazy\" decoding=\"async\" width=\"1360\" height=\"496\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-3.58.58\u202fPM.png\" alt=\"\" class=\"wp-image-12628\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-3.58.58\u202fPM.png 1360w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-3.58.58\u202fPM-300x109.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-3.58.58\u202fPM-1024x373.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-3.58.58\u202fPM-768x280.png 768w\" sizes=\"auto, (max-width: 1360px) 100vw, 1360px\" \/><figcaption class=\"wp-element-caption\">By using the probabilities of the output tokens from the LLM to normalize the scores and take their weighted summation gives us more fine-grained, continuous scores.<\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">This is done automatically in Opik, but you can check out <a href=\"https:\/\/github.com\/comet-ml\/opik\/blob\/77139a0c541504d2eaf2ab06b43c9f93f297c8dc\/sdks\/python\/src\/opik\/evaluation\/metrics\/llm_judges\/g_eval\/metric.py#L148\"><span class=\"s2\">the source code here<\/span><\/a> if you\u2019d like to see how it\u2019s implemented.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-g-eval-with-opik\">G-Eval With Opik<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Now that you\u2019re familiar with the core components of G-Eval, let\u2019s implement it in Opik! If you aren\u2019t already, feel free to follow along with <a href=\"https:\/\/colab.research.google.com\/drive\/1Y4LfLimBxx4KscjF0kRXiWmUR7tgXDfj\"><span class=\"s2\">the full-code Colab tutorial here<\/span><\/a>. For more information, check out <a href=\"https:\/\/www.comet.com\/docs\/opik\/\">the docs here<\/a>.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">In our first example, we\u2019ll take a subset of the SummEval dataset and use OpenAI\u2019s GPT-4o to generate summaries of the articles. We\u2019ll then use G-Eval to evaluate the quality of those summaries, starting off with a simple, generic set of evaluation criteria, and then venturing into some more specific criterion and comparing the results.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">We start off by defining our model, system prompt, and the application function we\u2019ll call to generate the summaries:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import openai\nMODEL = \"gpt-4o\"\n\nSYSTEM_PROMPT = \"Generate a concise summarization of the article you are provided with by the user.\"\n\n# Define the LLM application with tracking\ndef generate_summary(input: str) -&gt; str:\n  response = openai_client.chat.completions.create(\n        model=MODEL,\n        messages=&#91;\n            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n            {\"role\": \"user\", \"content\": input},\n        ],\n    )\n  return {\"summary\": response.choices&#91;0].message.content}\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">We also define <a href=\"https:\/\/www.comet.com\/docs\/opik\/evaluation\/evaluate_your_llm\">our evaluation task<\/a>, in which we return a dictionary containing the exact variables needed to calculate the evaluation score.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def evaluation_task(data):\n    llm_output = generate_summary(data&#91;'text'])\n    return {\"context\": data&#91;'text'], \"output\": llm_output}\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">We\u2019ll then instantiate <a href=\"https:\/\/www.comet.com\/docs\/opik\/evaluation\/metrics\/g_eval\">Opik\u2019s built-in G-Eval metric<\/a> and specify the <code>task_introduction<\/code> and <code>evaluation_criteria<\/code>. First, we\u2019ll create a general G-Eval metric, followed by some that specifically look at aspects like coherence, relevance, fluency, and consistency.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>TASK_INTRODUCTION = \"You are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to the given context.\"\nEVALUATION_CRITERIA = \"In provided text the OUTPUT must not introduce new information beyond what's provided in the CONTEXT.\"\n\n\ng_eval_general = GEval(\n    task_introduction=TASK_INTRODUCTION,\n    evaluation_criteria=EVALUATION_CRITERIA,\n    name=\"g_eval_general\"\n)\n\nSUMMEVAL_TASK_INTRODUCTION = \"\"\"\n\nYou will be given one summary written for a news article.\n\nYour task is to rate the summary on one metric.\n\nPlease make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.\n\"\"\"\n\n\nCOHERENCE_EVALUATION_CRITERIA = \"\"\"\n\nCoherence (1-5) - the collective quality of all sentences.\n\nWe align this dimension with the DUC quality question of structure and coherence whereby \"the summary should be well-structured and well-organized.\n\nThe summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.\"\"\"\n\n\ng_eval_coherence = GEval(\n    task_introduction=SUMMEVAL_TASK_INTRODUCTION,\n    evaluation_criteria=COHERENCE_EVALUATION_CRITERIA,\n    name=\"g_eval_coherence\"\n)\n\nCONSISTENCY_EVALUATION_CRITERIA = \"\"\"\nConsistency (1-5) - the factual alignment between the summary and the summarized source.\n\nA factually consistent summary contains only statements that are entailed by the source document.\n\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\n\n\"\"\"\n\n\ng_eval_consistency = GEval(\n    task_introduction=SUMMEVAL_TASK_INTRODUCTION,\n    evaluation_criteria=CONSISTENCY_EVALUATION_CRITERIA,\n    name=\"g_eval_consistency\"\n)\n\nFLUENCY_EVALUATION_CRITERIA = \"\"\"\n\nFluency (1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\n\n- 1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\n- 2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\n- 3: Good. The summary has few or no errors and is easy to read and follow.\n\n\"\"\"\n\n\ng_eval_fluency = GEval(\n    task_introduction=SUMMEVAL_TASK_INTRODUCTION,\n    evaluation_criteria=FLUENCY_EVALUATION_CRITERIA,\n    name=\"g_eval_fluency\"\n)\n\nRELEVANCE_EVALUATION_CRITERIA = \"\"\"\nRelevance (1-5) - selection of important content from the source.\n\nThe summary should include only important information from the source document.\n\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\n\n\"\"\"\n\n\ng_eval_relevance = GEval(\n    task_introduction=SUMMEVAL_TASK_INTRODUCTION,\n    evaluation_criteria=RELEVANCE_EVALUATION_CRITERIA,\n    name=\"g_eval_relevance\"\n)\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">We can then use <a href=\"https:\/\/www.comet.com\/docs\/opik\/production\/production_monitoring#fetching-traces-using-the-search-api\">the evaluate function<\/a> to run the LLM application and evaluation task defined above, and apply each of the G-Eval metrics we defined, passed as a list to the <code>scoring_metrics<\/code> parameter.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from opik.evaluation import evaluate\n\n# Perform the evaluation\nevaluation = evaluate(\n    experiment_name=\"My G-Eval Experiment\",\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;g_eval_general,\n                     g_eval_coherence,\n                     g_eval_consistency,\n                     g_eval_fluency,\n                     g_eval_relevance],\n    experiment_config={\n        \"model\": MODEL,\n        \"system_prompt\": SYSTEM_PROMPT,\n    }\n)\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12630\"><img loading=\"lazy\" decoding=\"async\" width=\"3004\" height=\"1588\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM.png\" alt=\"\" class=\"wp-image-12630\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM.png 3004w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM-300x159.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM-1024x541.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM-768x406.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM-1536x812.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-27-at-4.25.24\u202fPM-2048x1083.png 2048w\" sizes=\"auto, (max-width: 3004px) 100vw, 3004px\" \/><figcaption class=\"wp-element-caption\">By passing our G-Eval metrics as a list to the <code>scoring_metrics<\/code> parameter of the <code>evaluate<\/code> function, Opik automatically calculates each evaluation metric for each sample and aggregates them across the full dataset.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-evaluating-our-evaluation\">Evaluating Our Evaluation<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">In this next section we&#8217;ll try to answer the question, &#8220;how good are our evaluations, anyways?&#8221;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Luckily, the SummEval dataset also provides human annotations or \u201cground truth labels\u201d, which we can use to evaluate how closely our G-Eval scores align with human judgment (the gold standard for most LLM tasks).&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">We\u2019ll be using a slightly different subset of the dataset than in the first example, so we aren\u2019t exactly comparing apples to apples here. Importantly, however, we\u2019ll be using the same user inputs (task introductions and evaluation criterion). This means we\u2019ll be able to tweak our prompts and measure a corresponding quantifiable difference in output quality.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">For full details on how we created this second subset of the dataset, check out <a href=\"https:\/\/colab.research.google.com\/drive\/1Y4LfLimBxx4KscjF0kRXiWmUR7tgXDfj#scrollTo=5lgixsFO_IBt\"><span class=\"s2\">the Colab here<\/span><\/a>.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Once we have our dataset, the only thing we need to change from the previous workflow is the evaluation task. Here we return a dictionary including the context and LLM outputs. We won\u2019t need to define an LLM application step, as the dataset already has LLM-generated summaries.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from opik.evaluation import evaluate\n\ndef evaluation_task(data):\n    return {\"context\": data&#91;'text'], \"output\": data&#91;\"machine_summaries\"]}\n\n# Perform the evaluation\nevaluation = evaluate(\n    experiment_name=\"My G-Eval Experiment\",\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;g_eval_coherence,\n                     g_eval_consistency,\n                     g_eval_fluency,\n                     g_eval_relevance],\n    project_name=\"g-eval-demo-correlations\"\n)\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">The evaluate function calculates and logs each of the G-Eval metrics listed in the <code>scoring_metrics<\/code> parameter. To fetch these scores so that we can calculate the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Spearman%27s_rank_correlation_coefficient\">Spearman correlation coefficient<\/a> on them, we\u2019ll use the <code><a href=\"https:\/\/www.comet.com\/docs\/opik\/production\/production_monitoring#fetching-traces-using-the-search-api\">search_traces<\/a><\/code> method of the Opik <code>client<\/code>.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>traces = client.search_traces(project_name=\"g-eval-demo-correlations\", max_results=1000000) # some number greater than total results\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Once we have our trace and experiment information locally, we\u2019ll isolate the feedback scores:&nbsp;<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Gather relevance scores\ng_eval_relevance_list = &#91;\n    score.value\n    for trace in traces\n    for score in trace.feedback_scores\n    if score.name == \"g_eval_relevance\"\n]\n<\/code><\/pre>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">And finally, we can calculate the correlation between the human-annotated scores and those generated from our G-Eval metrics:<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">How close were you able to get to the results reported in <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2303.16634\"><span style=\"font-weight: 400;\">the original G-Eval paper<\/span><\/a><span style=\"font-weight: 400;\">?<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12633\"><img loading=\"lazy\" decoding=\"async\" width=\"1758\" height=\"840\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-28-at-3.28.32\u202fPM.png\" alt=\"\" class=\"wp-image-12633\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-28-at-3.28.32\u202fPM.png 1758w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-28-at-3.28.32\u202fPM-300x143.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-28-at-3.28.32\u202fPM-1024x489.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-28-at-3.28.32\u202fPM-768x367.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/Screenshot-2025-01-28-at-3.28.32\u202fPM-1536x734.png 1536w\" sizes=\"auto, (max-width: 1758px) 100vw, 1758px\" \/><figcaption class=\"wp-element-caption\">From <a href=\"https:\/\/arxiv.org\/pdf\/2303.16634\">G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment<\/a><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-g-eval-performance-and-concerns\">G-Eval Performance and Concerns<\/h2>\n\n\n\n<p class=\"p1\"><span class=\"s1\">But, why use an LLM as an evaluator at all? After all, LLMs are near-complete black boxes and require far more computational overhead than heuristic or even learned <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-metrics-every-developer-should-know\/\">LLM evaluation metrics.<\/a>&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Firstly, using an LLM as an evaluator bypasses the need for a reference or \u201cground truth\u201d text, which is particularly problematic when dealing with natural language tasks that often have multiple valid outputs due to paraphrasing, synonyms, contextual nuance, and semantic ambiguity. These restrictions could lead to biased evaluations or overfitting. Ground truth labels are not only less flexible, though, they\u2019re also extremely expensive themselves to acquire, which can limit the scalability of an application.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">Prior to G-Eval, other reference-free LLM-based evaluators like GPTScore were used especially for tasks that require creativity, diversity, or complex contextual understanding. These evaluators showed marked improvement over heuristic metrics, but they generally had much lower human correspondence than even medium-sized neural network evaluators.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">G-Eval improved upon the performance of previous LLM-based evaluators like GPTScore and more closely aligned with human judgment than most existing metrics. It is scalable and consistent, and highly adaptable to various NLG tasks.<\/span><\/p>\n\n\n\n<p class=\"p1\"><span class=\"s1\">As with all LLM-based metrics, however, it is important to remember that it inherits biases and limitations from the underlying LLM, which may impact fairness or accuracy. It is also more computationally expensive than most heuristic metrics and far less interpretable.<\/span><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"p1\"><span class=\"s1\"><b>If you found this article useful, follow me on <\/b><a href=\"https:\/\/www.linkedin.com\/in\/anmorgan24\/\"><span class=\"s2\"><b>LinkedIn<\/b><\/span><\/a><b> and <\/b><a href=\"https:\/\/x.com\/anmorgan2414\"><span class=\"s2\"><b>Twitter<\/b><\/span><\/a><b> for more content!<\/b><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>LLM-as-a-judge evaluators have gained widespread adoption due to their flexibility, scalability, and close alignment with human judgment. They excel at tasks that are difficult to quantify and evaluate with traditional heuristic metrics like hallucination detection, creative generation, content moderation, and logical reasoning. Yet, while they may seem simple on the surface, implementing them can prove [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":18425,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,65,6,9,7],"tags":[14,30,15,52,95,31,94],"coauthors":[133],"class_list":["post-12617","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-machine-learning","category-product","category-tutorials","tag-comet-ml","tag-deep-learning","tag-deep-learning-experiment-management","tag-llm","tag-llm-evaluation","tag-llmops","tag-opik"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>G-Eval for LLM Evaluation<\/title>\n<meta name=\"description\" content=\"G-Eval is composed of three main components: the prompt, automatic CoT reasoning, and the scoring function.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"G-Eval for LLM Evaluation\" \/>\n<meta property=\"og:description\" content=\"G-Eval is composed of three main components: the prompt, automatic CoT reasoning, and the scoring function.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2025-01-28T20:45:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-02T19:14:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"G-Eval for LLM Evaluation","description":"G-Eval is composed of three main components: the prompt, automatic CoT reasoning, and the scoring function.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/","og_locale":"en_US","og_type":"article","og_title":"G-Eval for LLM Evaluation","og_description":"G-Eval is composed of three main components: the prompt, automatic CoT reasoning, and the scoring function.","og_url":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2025-01-28T20:45:22+00:00","article_modified_time":"2026-02-02T19:14:34+00:00","og_image":[{"width":2560,"height":1440,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-scaled.jpg","type":"image\/jpeg"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"G-Eval for LLM Evaluation","datePublished":"2025-01-28T20:45:22+00:00","dateModified":"2026-02-02T19:14:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/"},"wordCount":1691,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-scaled.jpg","keywords":["Comet ML","Deep Learning","Deep Learning Experiment Management","LLM","LLM Evaluation","LLMOps","Opik"],"articleSection":["Comet Community Hub","LLMOps","Machine Learning","Product","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/","url":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/","name":"G-Eval for LLM Evaluation","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-scaled.jpg","datePublished":"2025-01-28T20:45:22+00:00","dateModified":"2026-02-02T19:14:34+00:00","description":"G-Eval is composed of three main components: the prompt, automatic CoT reasoning, and the scoring function.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-scaled.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/g-eval-scaled.jpg","width":2560,"height":1440,"caption":"featured image for a guide to using g-eval for evaluation"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"G-Eval for LLM Evaluation"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12617","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=12617"}],"version-history":[{"count":3,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12617\/revisions"}],"predecessor-version":[{"id":19074,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12617\/revisions\/19074"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/18425"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=12617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=12617"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=12617"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=12617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}