SelfCheckGPT for LLM Evaluation

Words By

Detecting hallucinations in language models is challenging. There are three general approaches:

  • Measuring token-level probability distributions for indications that a model is “confused.” Though sometimes effective, these methods rely on model internals being accessible—which is often not the case when working with hosted LLMs.
  • Referencing external fact-verification systems, like a database or document store. These methods are great for RAG-style use-cases, but they are only effective if you have a useful dataset and the infrastructure to use it.
  • Using LLM-as-a-judge techniques to assess whether or not a model hallucinated. These techniques are becoming standard in the LLM ecosystem, but as I’ll explain throughout this piece, using them effectively requires a deceptive amount of work.

The problem with many LLM-as-a-Judge techniques is that they tend towards two polarities: they are either too simple, using a basic zero-shot approach, or they are wildly complex, involving multiple LLMs interacting via multi-turn reasoning.

SelfCheckGPT offers a reference-free zero-resource alternative: a sampling-based approach that fact-checks responses without external resources or intrinsic uncertainty metrics. The key idea is consistency—if an LLM truly “knows” a fact, multiple randomly sampled responses should align. However, if a claim is hallucinated, responses will vary and contradict each other.

SelfCheckGPT diagram
SelfCheckGPT with LLM Prompt; each LLM-generated sentence is compared against stochastically generated responses with no external database. A comparison method can be, for example, through LLM prompting as shown above. From SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Detecting Hallucinations Via Consistency

There are several varieties of SelfCheckGPT, but all share a general basic structure:

  • First, a user asks a question, and the AI gives an answer.
  • SelfCheckGPT then asks the same AI the same question multiple times and collects several new responses.
  • It compares the original answer to these new responses.
  • If the answers are consistent, the original response is likely accurate.
  • If the answers contradict each other, the original response is likely a hallucination.

To quantify this, SelfCheckGPT assigns each sentence a hallucination score between 0 and 1:

  • 0.0 means the sentence is based on reliable information.
  • 1.0 means the sentence is likely a hallucination.

The major advantage to this approach is that it provides a practical way to assess factual reliability without external dependencies or internal model access. It uses a consensus mechanism that is reminiscent of ensemble techniques, like LLM juries, but it only requires the use of a single model.

SelfCheckGPT also benefits from being an extremely flexible framework. We’ll explore this in the next section, but you can easily imagine many different approaches to assessing “agreement” between answers.

An Overview of SelfCheckGPT Approaches

The different types of SelfCheckGPT are variations of the same general framework for detecting hallucinations in LLMs, but each uses a unique method to measure response consistency. The original SelfCheckGPT paper described the following five methods:

  • SelfCheckGPT with BERTScore: This variant evaluates the factuality of a sentence by comparing it to similar sentences from multiple sampled responses using the BERTScore metric.
  • SelfCheckGPT with Question Answering (QA): This approach generates multiple-choice questions using MQAG based on the main response and compares the answers from different samples.
  • SelfCheckGPT with an N-gram Model: Here, a simple n-gram model is trained using multiple samples to approximate the LLM’s token probabilities and detect inconsistencies.
  • SelfCheckGPT with Natural Language Inference (NLI): This method uses an NLI model (DeBERTa-v3-large) to assess contradictions between responses.
  • SelfCheckGPT with Prompting: An LLM is prompted to assess if a sentence is supported by a sample response, using a Yes/No format.

In this tutorial, we’ll be focusing on the three most popular variants: SelfCheckGPT-BERTScore, SelfCheckGPT-MQAG, and SelfCheckGPT-LLMPrompt.

Abby Morgan

AI/ML Growth Engineer @ Comet