{"id":13134,"date":"2025-03-26T17:48:33","date_gmt":"2025-03-27T01:48:33","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=13134"},"modified":"2025-11-14T17:01:47","modified_gmt":"2025-11-14T17:01:47","slug":"selfcheckgpt-for-llm-evaluation","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/","title":{"rendered":"SelfCheckGPT for LLM Evaluation"},"content":{"rendered":"\n<div class=\"wp-block-buttons is-style-flat is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1E5yEq-d2pF9BQVkl0sE3XKBs1jksYNIa\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab!<\/a><\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-1024x576.jpg\" alt=\"futuristic visualization for selfcheck gpt in use for llm evaluation\" class=\"wp-image-18420\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-1024x576.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-300x169.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-768x432.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-1536x864.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-2048x1152.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Detecting hallucinations in language models is challenging. There are three general approaches:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Measuring token-level probability distributions<\/strong>&nbsp;for indications that a model is \u201cconfused.\u201d Though sometimes effective, these methods rely on model internals being accessible\u2014which is often not the case when working with hosted LLMs.<\/li>\n\n\n\n<li><strong>Referencing external fact-verification systems<\/strong>, like a database or document store. These methods are great for RAG-style use-cases, but they are only effective if you have a useful dataset and the infrastructure to use it.<\/li>\n\n\n\n<li><strong>Using LLM-as-a-judge techniques<\/strong>&nbsp;to assess whether or not a model hallucinated. These techniques are becoming standard in the LLM ecosystem, but as I\u2019ll explain throughout this piece, using them effectively requires a deceptive amount of work.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The problem with many <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-as-a-judge\/\">LLM-as-a-Judge<\/a> techniques is that they tend towards two polarities: they are either too simple, using a basic zero-shot approach, or they are wildly complex, involving multiple LLMs interacting via multi-turn reasoning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">SelfCheckGPT offers a&nbsp;<strong>reference-free zero-resource<\/strong>&nbsp;alternative: a&nbsp;<strong>sampling-based approach<\/strong>&nbsp;that fact-checks responses without external resources or intrinsic uncertainty metrics. The key idea is&nbsp;<strong>consistency<\/strong>\u2014if an LLM truly \u201cknows\u201d a fact, multiple randomly sampled responses should align. However, if a claim is hallucinated, responses will vary and contradict each other.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"attachment_13154\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"834\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-07-at-10.51.17\u202fAM-1024x834.png\" alt=\"\" class=\"wp-image-13154\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-07-at-10.51.17\u202fAM-1024x834.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-07-at-10.51.17\u202fAM-300x244.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-07-at-10.51.17\u202fAM-768x625.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-07-at-10.51.17\u202fAM.png 1282w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">SelfCheckGPT with LLM Prompt; each LLM-generated sentence is compared against stochastically generated responses with no external database. A comparison method can be, for example, through LLM prompting as shown above. From&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2303.08896\">SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models<\/a><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-detecting-hallucinations-via-consistency\">Detecting Hallucinations Via Consistency<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">There are several varieties of SelfCheckGPT, but all share a general basic structure:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>First, a user asks a question, and the AI gives an answer.<\/li>\n\n\n\n<li>SelfCheckGPT then asks the same AI the same question multiple times and collects several new responses.<\/li>\n\n\n\n<li>It compares the original answer to these new responses.<\/li>\n\n\n\n<li>If the answers are&nbsp;<strong>consistent<\/strong>, the original response is likely&nbsp;<strong>accurate<\/strong>.<\/li>\n\n\n\n<li>If the answers&nbsp;<strong>contradict each other<\/strong>, the original response is likely a&nbsp;<strong>hallucination<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To quantify this, SelfCheckGPT assigns each sentence a hallucination score between 0 and 1:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>0.0<\/strong>&nbsp;means the sentence is based on reliable information.<\/li>\n\n\n\n<li><strong>1.0<\/strong>&nbsp;means the sentence is likely a hallucination.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The major advantage to this approach is that it provides a practical way to assess factual reliability without external dependencies or internal model access. It uses a consensus mechanism that is reminiscent of ensemble techniques, like&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/llm-juries-for-evaluation\/\">LLM juries<\/a>, but it only requires the use of a single model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">SelfCheckGPT also benefits from being an extremely flexible framework. We\u2019ll explore this in the next section, but you can easily imagine many different approaches to assessing \u201cagreement\u201d between answers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-an-overview-of-selfcheckgpt-approaches\">An Overview of SelfCheckGPT Approaches<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The different types of SelfCheckGPT are variations of the same general framework for <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-hallucination\/\">hallucination detection<\/a> in LLMs, but each uses a unique method to measure response consistency. The original SelfCheckGPT paper described the following five methods:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>SelfCheckGPT with BERTScore:<\/strong>&nbsp;This variant evaluates the factuality of a sentence by comparing it to similar sentences from multiple sampled responses using the&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/\">BERTScore<\/a> metric. For each sentence, the maximum BERTScore between it and the most similar sentence in each sample is averaged. If the sentence is consistent across many samples, it\u2019s considered factual; if not, it may be a hallucination. This approach uses RoBERTa-Large to compute the BERTScore.<\/li>\n\n\n\n<li><strong>SelfCheckGPT with Question Answering (QA):<\/strong>&nbsp;This approach generates multiple-choice questions using MQAG based on the main response and then compares the answers from different samples. If the answers are consistent, the sentence is considered valid; if they diverge, it indicates a potential hallucination.<\/li>\n\n\n\n<li><strong>SelfCheckGPT with an N-gram Model:<\/strong>&nbsp;Here, a simple n-gram model is trained using multiple samples to approximate the LLM\u2019s token probabilities. The likelihood of each sentence is then computed to detect inconsistencies. A higher log-probability suggests consistency, while lower probabilities may indicate hallucination.<\/li>\n\n\n\n<li><strong>SelfCheckGPT with Natural Language Inference (NLI):<\/strong>&nbsp;This method uses an NLI model (DeBERTa-v3-large) to assess whether a sampled response contradicts the sentence being evaluated. The NLI model classifies the relationship as either entailment, neutral, or contradiction, and only the contradiction score is used for the evaluation. The higher the contradiction score, the more likely the sentence is inconsistent or a hallucination.<\/li>\n\n\n\n<li><strong>SelfCheckGPT with Prompting:<\/strong>&nbsp;In this variant, an LLM is prompted to assess if a sentence is supported by a sample response. Based on the answer (Yes or No), an inconsistency score is computed. This approach is effective with models like GPT-3.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">In this tutorial, we\u2019ll be focusing on the three most popular variants: SelfCheckGPT-BERTScore, SelfCheckGPTMQAG, and SelfCheckGPTLLMPrompt.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-selfcheckgpt-with-bertscore\">SelfCheckGPT with BERTScore<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">SelfCheckGPT with&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/\">BERTScore<\/a>&nbsp;is the most popular variant due to its speed, low memory usage, and computational efficiency. To better understand how SelfCheckGPT works under the hood, we\u2019ll first implement this variant from scratch. Once we\u2019ve gained that insight, we\u2019ll use the&nbsp;<a href=\"https:\/\/github.com\/potsawee\/selfcheckgpt\"><code>selfcheckgpt<\/code><\/a>&nbsp;module\u2019s built-in implementation of three variants to automatically evaluate a dataset using Opik, Comet&#8217;s open-source <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-frameworks\/\">LLM evaluation framework<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/colab.research.google.com\/drive\/1E5yEq-d2pF9BQVkl0sE3XKBs1jksYNIa\">Follow along with the full code in the Colab<\/a>&nbsp;if you aren\u2019t already!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, we\u2019ll assume that we already have the original generated output (which we are evaluating) as well as the stochastically generated additional outputs (to which we are comparing our original output). We\u2019ll cover how to incorporate this step into your workflow a little later on. For now, let\u2019s just focus on what the metric is doing under the hood.<\/p>\n\n\n\n<figure class=\"wp-block-image is-resized\" id=\"attachment_13199\"><img loading=\"lazy\" decoding=\"async\" width=\"2292\" height=\"1288\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1.png\" alt=\"A step-by-step diagram of SelfCheckGPT with BERTScore\" class=\"wp-image-13199\" style=\"aspect-ratio:1;width:865px;height:auto\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1.png 2292w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1-300x169.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1-1024x575.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1-768x432.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1-1536x863.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-bertscore-1-2048x1151.png 2048w\" sizes=\"auto, (max-width: 2292px) 100vw, 2292px\" \/><figcaption class=\"wp-element-caption\">SelfCheckGPT with BERTScore, step-by-step<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Given our original generated output,&nbsp;<code>passage<\/code>, and a list of 3 additional generated passages,&nbsp;<code>samples<\/code>, we\u2019ll need to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Break down the original passage into individual sentences with&nbsp;<a href=\"https:\/\/spacy.io\/\">SpaCy<\/a><\/li>\n\n\n\n<li>Prepare an empty array to serve as our comparison matrix between the sentences in the original passage and each additional sample. As such, the dimensions of this array should be&nbsp;<code>num_sentences<\/code>&nbsp;x&nbsp;<code>num_samples<\/code>.&nbsp;<\/li>\n\n\n\n<li>Define two functions:\n<ul class=\"wp-block-list\">\n<li>Creates a new list where each individual sentence from the original passage is repeated the number of times equal to the number of sample sentences. For example: [\u201cCat\u201d, \u201cDog\u201d] would become [\u201cCat\u201d, \u201dCat\u201d, \u201dCat\u201d, \u201cDog\u201d, \u201dDog\u201d, \u201dDog\u201d]<\/li>\n\n\n\n<li>Creates a list where the entire list of sample sentences is repeated the number of times equal to the number of sentences in the original passage. For example: [\u201cCat\u201d, \u201cDog\u201d] would become [\u201cCat\u201d, \u201dDog\u201d, \u201dCat\u201d, \u201dDog\u201d, \u201dCat\u201d, \u201dDog\u201d]<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>For each additional sample:\n<ul class=\"wp-block-list\">\n<li>Break it down into sentences<\/li>\n\n\n\n<li>Compare it to every sentence in the original passage with the two functions defined above<\/li>\n\n\n\n<li>Calculate the&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/\">BERTScore<\/a>&nbsp;between these paired sentences (BERT embeddings via RoBERTa-Large \u2192 cosine similarity)<\/li>\n\n\n\n<li>Reshape the scores into a matrix where each row represents an original sentence and each column represents a sample sentence.<\/li>\n\n\n\n<li>For each original sentence, find the highest similarity score across all sample sentences \u2013 essentially finding the \u201cbest match\u201d for each original sentence within this sample passage.<\/li>\n\n\n\n<li>Store the best-match scores in a column of the results matrix, with each row representing an original sentence<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>After processing all sample passages, the code averages the scores across all samples to get a final similarity score for each original sentence, which is then inverted (1 \u2013 score) to represent dissimilarity.<\/li>\n\n\n\n<li>This process returns a list of scores for each sentence in the original passage, as per the original paper. For our purposes, we\u2019ll just return the max score in this list of scores, which represents the highest likelihood that a given sentence in the passage is a hallucination.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>def evaluate_sentences_with_bertscore(\n    original_passage: str,\n    sampled_passages: List&#91;str],\n    nlp=None,\n    language: str = \"en\",\n    rescale_with_baseline: bool = True\n) -&gt; np.ndarray:\n    \"\"\"\n    Evaluate sentences against sampled passages using BERTScore.\n\n    This function computes the semantic similarity between each input sentence and\n    sentences from multiple sampled passages using BERTScore. For each input sentence,\n    it finds the best matching sentence within each sample and averages these scores.\n\n    Args:\n        original_passage: Original passage to be evaluated\n        sampled_passages: List of reference passages to compare against\n        nlp: Spacy NLP model for sentence tokenization (if None, loads en_core_web_sm)        \n        language: Language model to use for BERTScore\n        rescale_with_baseline: Whether to rescale BERTScore with baseline\n\n    Returns:\n        np.ndarray: Array of dissimilarity scores (1 - BERTScore) for each input sentence\n    \"\"\"\n    # Initialize spaCy if not provided\n    if nlp is None:\n        nlp = spacy.load(\"en_core_web_sm\")\n\n    sentences = &#91;sent for sent in nlp(original_passage).sents] # List&#91;spacy.tokens.span.Span]\n    sentences = &#91;sent.text.strip() for sent in sentences if len(sent) &gt; 3]\n\n    # Prepare dimensions\n    num_sentences = len(sentences)\n    num_samples = len(sampled_passages)\n    bertscore_matrix = np.zeros((num_sentences, num_samples))\n\n    # Helper functions for list expansion\n    def expand_list_per_element(source_list: List&#91;str], repeat_count: int) -&gt; List&#91;str]:\n        \"\"\"Repeat each element in source_list repeat_count times.\"\"\"\n        return &#91;item for item in source_list for _ in range(repeat_count)]\n\n    def expand_list_whole(source_list: List&#91;str], repeat_count: int) -&gt; List&#91;str]:\n        \"\"\"Repeat the entire source_list repeat_count times.\"\"\"\n        return &#91;item for _ in range(repeat_count) for item in source_list]\n\n    # Process each sample passage\n    for sample_idx, sample_passage in enumerate(sampled_passages):\n        # Split passage into sentences using spaCy\n        sample_sentences = &#91;sent.text.strip() for sent in nlp(sample_passage).sents]\n        num_sample_sentences = len(sample_sentences)\n\n        # Prepare comparison pairs\n        references = expand_list_per_element(sentences, num_sample_sentences)\n        candidates = expand_list_whole(sample_sentences, num_sentences)\n\n        # Calculate BERTScore (precision, recall, F1)\n        _, _, f1_scores = bert_score.score(\n            candidates,\n            references,\n            lang=language,\n            verbose=False,\n            rescale_with_baseline=rescale_with_baseline\n        )\n\n        # Reshape and extract maximum scores\n        f1_matrix = f1_scores.reshape(num_sentences, num_sample_sentences)\n        max_f1_scores = f1_matrix.max(axis=1).values.cpu().numpy()\n\n        # Store scores for this sample\n        bertscore_matrix&#91;:, sample_idx] = max_f1_scores\n\n    # Calculate mean BERTScore across all samples\n    mean_bertscore_per_sentence = bertscore_matrix.mean(axis=1)\n\n    # Return dissimilarity score (1 - similarity)\n    return 1.0 - mean_bertscore_per_sentence\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can then test our function on a few sampled outputs. We can call this function on a factual passage and a hallucinated passage and compare their scores on the same samples to see the difference.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>selfcheckgpt_bertscore_factual = evaluate_sentences_with_bertscore(original_passage=passage,\n                                  sampled_passages=&#91;sample_1, sample_2, sample_3])\n\nselfcheckgpt_bertscore_hallucinated = evaluate_sentences_with_bertscore(original_passage=hallucination_passage,\n                                  sampled_passages=&#91;sample_1, sample_2, sample_3])\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-selfcheckgpt-bertscore-with-opik\">SelfCheckGPT-BERTScore with Opik<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now that we\u2019ve gained some intuition for what\u2019s going on under the hood, we can use the&nbsp;<a href=\"https:\/\/github.com\/potsawee\/selfcheckgpt\"><code>selfcheckgpt<\/code><\/a><a href=\"https:\/\/github.com\/potsawee\/selfcheckgpt\">&nbsp;module<\/a>&nbsp;to create a custom Opik evaluation metric and automatically evaluate a dataset. If you aren\u2019t already, you can&nbsp;<a href=\"https:\/\/colab.research.google.com\/drive\/1E5yEq-d2pF9BQVkl0sE3XKBs1jksYNIa#scrollTo=adllX1AHnxxr\">follow along with the Colab here<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, we\u2019ll use&nbsp;<a href=\"https:\/\/platform.openai.com\/docs\/models\/gpt-4o\">OpenAI\u2019s GPT-4o<\/a>&nbsp;via&nbsp;<a href=\"https:\/\/github.com\/BerriAI\/litellm\">LiteLLM<\/a>&nbsp;to answer a list of questions without any ground truth references, external context, or criteria. We\u2019ll then randomly generate three additional responses from GPT-4o and compare them to the original response.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After setting up our environment, we\u2019ll start by defining our model and logger. We\u2019ll also load the&nbsp;<a href=\"https:\/\/spacy.io\/models\/en\">small English language model<\/a>&nbsp;from spaCy, which will be used for sentence tokenization in our scoring metric.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\nimport litellm\nfrom litellm import completion\nfrom litellm.integrations.opik.opik import OpikLogger\nfrom opik import track\nimport spacy\n\nopik_logger = OpikLogger()\nlitellm.callbacks = &#91;opik_logger]\n\nMODEL = \"gpt-4o\"\nSYSTEM_PROMPT = \"Answer the question as truthfully as possible in no more than six sentences.\"\n\nsentence_model = spacy.load('en_core_web_sm')\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll define a simple function to generate our original response from GPT-4o, as well as an evaluation function which calls our application function and returns the output in the format expected by our pipeline later on. We decorate them both with the track decorator so that all of the details of the calls are automatically logged to&nbsp;<a href=\"https:\/\/github.com\/comet-ml\/opik\">Opik<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Define the LLM application with tracking\n@track\ndef generate_answer(question: str) -&gt; str:\n  response = litellm.completion(\n        model=MODEL,\n        messages=&#91;\n            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n            {\"role\": \"user\", \"content\": question},\n        ],\n    )\n  return response.choices&#91;0].message.content\n\n@track\ndef evaluation_task(data):\n    llm_output = generate_answer(data&#91;'question'])\n    return {\"question\": data&#91;'question'],\n            \"answer\": llm_output,\n            \"model_name\": MODEL,\n            \"system_prompt\": SYSTEM_PROMPT}\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll use the&nbsp;<a href=\"https:\/\/github.com\/potsawee\/selfcheckgpt\"><code>selfcheckgpt<\/code>&nbsp;module<\/a>\u2019s implementation of SelfCheckGPT-BERTScore to create a custom Opik metric by subclassing the&nbsp;<a href=\"https:\/\/www.comet.com\/docs\/opik\/python-sdk-reference\/\/evaluation\/metrics\/BaseMetric.html\"><code>base_metric.BaseMetric<\/code><\/a>&nbsp;class and implementing a&nbsp;<a href=\"https:\/\/www.comet.com\/docs\/opik\/evaluation\/metrics\/custom_metric#writing-your-own-custom-metric\">score method<\/a>.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note that we\u2019ll need to use the same model specified in the&nbsp;<code>MODEL<\/code>&nbsp;variable above to create three additional output samples. We then loop through these samples to determine their consistency with the original output generated by our&nbsp;<code>generate_answer<\/code>&nbsp;function.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from typing import Any\nfrom opik.evaluation.metrics import base_metric, score_result\nimport selfcheckgpt\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckBERTScore\nimport spacy\n\nclass SelfCheckGPTBERTScore(base_metric.BaseMetric):\n    def __init__(self, name: str = \"SelfCheckGPT\", model_name: str = \"gpt-4o\", language: str = \"en\", nlp = sentence_model):\n        self.name = name\n        self.model_name = model_name\n        self.language = language\n        self.opik_logger = OpikLogger()\n        self.selfcheck_bertscore = SelfCheckBERTScore()\n        self.nlp = nlp\n\n    def score(self, question: str, answer: str, model_name: str, system_prompt: str, num_samples: int = 3, **ignored_kwargs: Any):\n        \"\"\"\n        Score the output of an LLM.\n        Args:\n            question: The question asked to the LLM.\n            answer: The answer from the LLM to score.\n            model_name: Name of the model to use for generating samples.\n            system_prompt: System prompt to use for generating samples.\n            num_samples: Number of additional samples to generate for comparison.\n            **ignored_kwargs: Any additional keyword arguments.\n        \"\"\"\n        # Generate num_samples # of additional responses\n        litellm.callbacks = &#91;self.opik_logger]\n        samples = &#91;]\n        for samp in range(num_samples):\n            response = litellm.completion(\n                model=model_name,\n                messages=&#91;\n                    {\"role\": \"system\", \"content\": system_prompt},\n                    {\"role\": \"user\", \"content\": question}\n                ]\n            )\n            samples.append(response.choices&#91;0].message.content.replace(\"\\n\", \" \").strip())\n\n        sentences = &#91;sent for sent in self.nlp(answer).sents]\n        sentences = &#91;sent.text.strip() for sent in sentences if len(sent) &gt; 3]\n\n        sent_scores_bertscore = self.selfcheck_bertscore.predict(\n            sentences,\n            samples)\n\n        return score_result.ScoreResult(\n            name=self.name,\n            value=max(sent_scores_bertscore)\n        )\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-selfcheckgpt-with-mqag\">SelfCheckGPT with MQAG<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">BERTScore isn\u2019t the only way to use SelfCheckGPT, however. Another popular variety is SelfCheckGPT-MQAG (<a href=\"https:\/\/arxiv.org\/abs\/2301.12307\">Multiple-choice Question-Answer Generation<\/a>).&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This approach generates multiple-choice questions using the MQAG model based on the main response and then compares the answers from different samples. If the answers are consistent, the sentence is considered valid; if they diverge, it indicates a potential hallucination.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\" id=\"attachment_13167\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"664\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-mqag-1024x664.png\" alt=\"Step-by-step diagram of selfcheckgpt with MQAG\" class=\"wp-image-14843\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-mqag-1024x664.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-mqag-300x195.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-mqag-768x498.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-mqag-1536x996.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-mqag.png 2044w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Step-by-step diagram of SelfCheckGPT with MQAG<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A dictionary is returned containing the&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Kullback%E2%80%93Leibler_divergence\">Kullback-Leibler divergence<\/a>, mismatch count,&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Hellinger_distance\">Hellinger distance<\/a>, and total variation. For our use case we\u2019ll simply take the max total variation, as this represents the maximum likelihood that somewhere in the passage is a hallucination.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from typing import Any\nfrom opik.evaluation.metrics import base_metric, score_result\nimport selfcheckgpt\nfrom selfcheckgpt.modeling_mqag import MQAG\nimport spacy\nimport torch\n\nclass SelfCheckGPTMQAG(base_metric.BaseMetric):\n    def __init__(self, name: str = \"SelfCheckGPTMQAG\", model_name: str = \"gpt-4o\", language: str = \"en\", nlp = sentence_model):\n        self.name = name\n        self.model_name = model_name\n        self.language = language\n        self.opik_logger = OpikLogger()\n        self.nlp = nlp\n        self.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n        self.selfcheck_mqag = MQAG(device=self.device)\n\n\n    def score(self, question: str, answer: str, model_name: str, system_prompt: str, num_samples: int = 3,\n              num_questions: int = 3, **ignored_kwargs: Any):\n        \"\"\"\n        Score the output of an LLM using MQAG with CUDA acceleration.\n\n        Args:\n            question: The question asked to the LLM.\n            answer: The answer from the LLM to score.\n            model_name: Name of the model to use for generating samples.\n            system_prompt: System prompt to use for generating samples.\n            num_samples: Number of additional samples to generate for comparison.\n            num_questions_per_sent: Number of questions to generate per sentence.\n            **ignored_kwargs: Any additional keyword arguments.\n        \"\"\"\n        # Generate num_samples # of additional responses\n        litellm.callbacks = &#91;self.opik_logger]\n        samples = &#91;]\n        for _ in range(num_samples):\n            response = litellm.completion(\n                model=model_name,\n                messages=&#91;\n                    {\"role\": \"system\", \"content\": system_prompt},\n                    {\"role\": \"user\", \"content\": question}\n                ]\n            )\n            samples.append(response.choices&#91;0].message.content.replace(\"\\n\", \" \").strip())\n\n        var_scores = &#91;]\n        for sample in samples:\n            score = self.selfcheck_mqag.score(candidate=sample, reference=answer, num_questions=3, verbose=True)\n            var_scores.append(score&#91;\"total_variation\"])\n\n        return score_result.ScoreResult(\n            name=self.name,\n            value=max(var_scores)\n        )\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-selfcheckgpt-with-llm-prompting\">SelfCheckGPT with LLM Prompting<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\" id=\"attachment_13169\"><img loading=\"lazy\" decoding=\"async\" width=\"948\" height=\"1024\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-llm-prompt-948x1024.png\" alt=\"Step-by-step diagram of SelfCheckGPT with LLM Prompting\" class=\"wp-image-14844\" style=\"aspect-ratio:1;width:865px;height:auto\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-llm-prompt-948x1024.png 948w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-llm-prompt-278x300.png 278w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-llm-prompt-768x830.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt-llm-prompt.png 1314w\" sizes=\"auto, (max-width: 948px) 100vw, 948px\" \/><figcaption class=\"wp-element-caption\">Step-by-step diagram of SelfCheckGPT with LLM Prompting<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In this variant, an LLM is prompted to assess if a sentence is supported by a sample response. Based on the answer (<code>Yes<\/code>&nbsp;or&nbsp;<code>No<\/code>), an inconsistency score is computed. This approach is effective with models like GPT-3.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from typing import Any\nfrom opik.evaluation.metrics import base_metric, score_result\nimport selfcheckgpt\nfrom selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt\nimport spacy\n\nclass SelfCheckGPTAPIPrompt(base_metric.BaseMetric):\n    def __init__(self, name: str = \"SelfCheckGPTAPIPrompt\", model_name: str = \"gpt-4o\", language: str = \"en\", nlp = sentence_model):\n        self.name = name\n        self.model_name = model_name\n        self.language = language\n        self.opik_logger = OpikLogger()\n\n        # API access currently only supports client_type=\"openai\"\n        self.selfcheck_prompt = SelfCheckAPIPrompt(client_type=\"openai\", model=self.model_name)\n\n        # Load spacy model only once\n        self.nlp = spacy.load(\"en_core_web_sm\")\n\n    def score(self, question: str, answer: str, model_name: str, system_prompt: str, num_samples: int = 3,\n              num_questions: int = 3, **ignored_kwargs: Any):\n        \"\"\"\n        Score the output of an LLM using MQAG with CUDA acceleration.\n\n        Args:\n            question: The question asked to the LLM.\n            answer: The answer from the LLM to score.\n            model_name: Name of the model to use for generating samples.\n            system_prompt: System prompt to use for generating samples.\n            num_samples: Number of additional samples to generate for comparison.\n            num_questions_per_sent: Number of questions to generate per sentence.\n            **ignored_kwargs: Any additional keyword arguments.\n        \"\"\"\n        # Generate num_samples # of additional responses\n        litellm.callbacks = &#91;self.opik_logger]\n        samples = &#91;]\n        for _ in range(num_samples):\n            response = litellm.completion(\n                model=model_name,\n                messages=&#91;\n                    {\"role\": \"system\", \"content\": system_prompt},\n                    {\"role\": \"user\", \"content\": question}\n                ]\n            )\n            samples.append(response.choices&#91;0].message.content.replace(\"\\n\", \" \").strip())\n\n        sentences = &#91;sent.text.strip() for sent in self.nlp(answer).sents]\n\n        sent_scores_prompt = self.selfcheck_prompt.predict(sentences=sentences,  # list of sentences\n                                                      sampled_passages=samples,  # list of sampled passages\n                                                      verbose=False,  # whether to show a progress bar\n        )\n        return score_result.ScoreResult(\n            name=self.name,\n            value=max(sent_scores_prompt)\n        )\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-automating-evaluation-with-opik\">Automating Evaluation with Opik<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let\u2019s put it all together and run our three varieties of SelfCheckGPT on a toy dataset. We can create a simple list of questions, which we upload to&nbsp;<a href=\"https:\/\/www.comet.com\/docs\/opik\/\">Opik<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>questions = &#91;\n    \"Who was Ernst Opik?\",\n    \"How big was the Spanish Armada?\",\n    \"What is the Great Wall of China?\",\n    \"Do honeybees sleep?\",\n    \"What are asteroids made of?\",\n    \"How does a starfish move?\",\n    \"What did Marie Curie do?\",\n    \"What did the sixth president of the United States accomplish?\",\n    \"When is the best time to launch a rocket?\",\n    \"Where is the largest man-made structure located?\"\n]\n\nfrom opik import Opik\n\n# Log dataset to Opik\nclient = Opik()\ndataset = client.get_or_create_dataset(name=\"SelfCheckGPT-dataset\")\ndataset.insert(\n    &#91;{\"question\": q} for q in questions])\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then, to use our three SelfCheckGPT metrics, we simply instantiate them and pass them as a list to the&nbsp;<code>scoring_metrics<\/code>&nbsp;parameter of the&nbsp;<code>opik.evaluation.evaluate<\/code>&nbsp;function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SelfCheckGPTBERTScore = SelfCheckGPTBERTScore()\nSelfCheckGPTMQAG = SelfCheckGPTMQAG()\nSelfCheckGPTAPIPrompt = SelfCheckGPTAPIPrompt()\n\nfrom opik.evaluation import evaluate\n\n# Perform the evaluation\nevaluation = evaluate(\n    experiment_name=\"My SelfCheckGPT Experiment\",\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;SelfCheckGPTBERTScore,\n                     SelfCheckGPTMQAG,\n                     SelfCheckGPTAPIPrompt],\n    task_threads=1\n)\n\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Once this has finished running, your results should look something like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"555\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt_UI-1024x555.png\" alt=\"Screenshot of selfcheckgpt project dashboard in Comet\/Opik\" class=\"wp-image-14845\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt_UI-1024x555.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt_UI-300x162.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt_UI-768x416.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt_UI-1536x832.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/selfcheckgpt_UI-2048x1109.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Comparing three flavors of SelfCheckGPT, as automatically tracked by Opik<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-advantages-and-limitations-of-selfcheckgpt\">Advantages and Limitations of SelfCheckGPT<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">SelfCheckGPT offers a reference-free, zero-resource alternative for evaluating the factuality of LLM outputs, but it comes with some notable limitations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A major drawback is its computational overhead.&nbsp; SelfCheckGPT requires multiple sampled generations per input, leading to significant costs. While some components (e.g., multiple generations) can be parallelized across GPUs, others, like BERTScore similarity computations, have sequential dependencies that limit speedups. For instance, using MQAG, processing a small dataset of ten questions takes over ten minutes on a T4 GPU, which severely limits scalability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another limitation stems from its core assumptions. SelfCheckGPT relies on the idea that true facts will appear more frequently across sampled generations. But this doesn\u2019t always happen in practice and if a model consistently generates the same hallucinated information, it could incorrectly classify it as factual. This is especially problematic if the model has learned false knowledge from biased or incorrect training data. To this end, if a model is biased towards certain types of answers, SelfCheckGPT may inadvertently reinforce that bias.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Despite these limitations, SelfCheckGPT remains a valuable tool for hallucination detection. It offers a flexible, domain-agnostic approach that, while computationally expensive, can be more cost-effective than <a href=\"https:\/\/www.comet.com\/site\/blog\/human-in-the-loop\/\">human-in-the-loop<\/a> annotation, and works without references or external resources or databases. Its strengths make it particularly useful in high-stakes domains that demand both factual accuracy and scalability, representing a step toward more reliable and trustworthy AI.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>If you found this article useful, follow me on&nbsp;<\/strong><a href=\"https:\/\/www.linkedin.com\/in\/anmorgan24\/\"><strong>LinkedIn<\/strong><\/a><strong>&nbsp;and&nbsp;<a href=\"https:\/\/x.com\/anmorgan2414\">X\/<\/a><\/strong><a href=\"https:\/\/x.com\/anmorgan2414\"><strong>Twitter<\/strong><\/a><strong>&nbsp;for more content!<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Detecting hallucinations in language models is challenging. There are three general approaches: The problem with many LLM-as-a-Judge techniques is that they tend towards two polarities: they are either too simple, using a basic zero-shot approach, or they are wildly complex, involving multiple LLMs interacting via multi-turn reasoning. SelfCheckGPT offers a&nbsp;reference-free zero-resource&nbsp;alternative: a&nbsp;sampling-based approach&nbsp;that fact-checks responses [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":18420,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,65,7],"tags":[40,93,95,94,34],"coauthors":[133],"class_list":["post-13134","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-tutorials","tag-comet","tag-evaluation-metrics","tag-llm-evaluation","tag-opik","tag-prompt-engineering"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>SelfCheckGPT for LLM Evaluation - Comet<\/title>\n<meta name=\"description\" content=\"SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"SelfCheckGPT for LLM Evaluation\" \/>\n<meta property=\"og:description\" content=\"SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-27T01:48:33+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-14T17:01:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"SelfCheckGPT for LLM Evaluation - Comet","description":"SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/","og_locale":"en_US","og_type":"article","og_title":"SelfCheckGPT for LLM Evaluation","og_description":"SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations","og_url":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2025-03-27T01:48:33+00:00","article_modified_time":"2025-11-14T17:01:47+00:00","og_image":[{"width":2560,"height":1440,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg","type":"image\/jpeg"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"SelfCheckGPT for LLM Evaluation","datePublished":"2025-03-27T01:48:33+00:00","dateModified":"2025-11-14T17:01:47+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/"},"wordCount":2064,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg","keywords":["Comet","Evaluation metrics","LLM Evaluation","Opik","Prompt Engineering"],"articleSection":["Comet Community Hub","LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/","url":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/","name":"SelfCheckGPT for LLM Evaluation - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg","datePublished":"2025-03-27T01:48:33+00:00","dateModified":"2025-11-14T17:01:47+00:00","description":"SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg","width":2560,"height":1440,"caption":"futuristic visualization for selfcheck gpt in use for llm evaluation"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/selfcheckgpt-for-llm-evaluation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"SelfCheckGPT for LLM Evaluation"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/03\/SelfCheckGPT-scaled.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/13134","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=13134"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/13134\/revisions"}],"predecessor-version":[{"id":18447,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/13134\/revisions\/18447"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/18420"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=13134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=13134"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=13134"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=13134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}