{"id":11966,"date":"2024-11-21T14:33:47","date_gmt":"2024-11-21T22:33:47","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=11966"},"modified":"2025-11-17T21:20:19","modified_gmt":"2025-11-17T21:20:19","slug":"perplexity-for-llm-evaluation","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/","title":{"rendered":"Perplexity for LLM Evaluation"},"content":{"rendered":"\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1EtrGnKij2OdXA23ty-4tSMuLDSSfGeen\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab!<\/a><\/div>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity is, historically speaking, one of the &#8220;standard&#8221; evaluation metrics for language models. And while recent years have seen a surge in more complex and robust metrics, including LLM-based evaluations, perplexity still has a lot of value as a component in your evaluation suite. If you want to build effective evaluation pipelines\u2014or just understand what researchers mean when they report perplexity scores\u2014you need to have a grasp on what perplexity is and how it can be implemented.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity-1024x576.jpg\" alt=\"Perplexity for LLM Evaluation Blog title image\" class=\"wp-image-18416\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity-1024x576.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity-300x169.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity-768x432.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity-1536x864.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity seeks to quantify the &#8220;uncertainty&#8221; a model experiences when when predicting the next token in a sequence. High uncertainty occurs when the model is unsure about the next word or token in a sequence. This can happen when the input is ambiguous or the model hasn&#8217;t seen similar examples during training. <\/span><b>Quantifying uncertainty in language models helps us judge when it might need <a href=\"https:\/\/www.comet.com\/site\/blog\/human-in-the-loop\/\">human-in-the-loop<\/a> oversight or further training, allowing us to handle those cases differently.<\/b><span style=\"font-weight: 400;\"> This is especially critical in high-stakes situations, like with medical or legal advice, where an overconfident wrong answer could have serious consequences.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">But this is just scratching the surface of perplexity. In this article, I want to go in depth, covering perplexity&#8217;s mathematical basis, underlying intuitions, and limitations. I&#8217;ll show you how to implement perplexity from scratch in Python, and how to add perplexity to your evaluation suite using <a href=\"https:\/\/github.com\/comet-ml\/opik\">Opik<\/a>, our open-source <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-frameworks\/\">LLM evaluation framework<\/a>.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Let\u2019s dive in!<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-a-little-background-on-perplexity\">A Little Background on Perplexity<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity was first introduced 1977 by a team of IBM researchers working on speech recognition<\/span><span style=\"font-weight: 400;\">. The team, led by Frederick Jelinek, was looking for a metric that could measure the difficulty a statistical model experienced while making a prediction. As an interesting aside, Jelinek is the original author of the famous quote \u201cEvery time I fire a linguist, the performance of the speech recognizer goes up.\u201d<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11968 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1331\" height=\"349\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-07-at-2.43.26\u202fPM.png\" alt=\"Table from the original paper by Frederick Jelinek that first introduced the term perplexity\" class=\"wp-image-11968\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-07-at-2.43.26\u202fPM.png 1331w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-07-at-2.43.26\u202fPM-300x79.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-07-at-2.43.26\u202fPM-1024x269.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-07-at-2.43.26\u202fPM-768x201.png 768w\" sizes=\"auto, (max-width: 1331px) 100vw, 1331px\" \/><figcaption class=\"wp-element-caption\">Table from the original paper: <a href=\"https:\/\/pubs.aip.org\/asa\/jasa\/article\/62\/S1\/S63\/642598\/Perplexity-a-measure-of-the-difficulty-of-speech\">Perplexity\u2014a measure of the difficulty of speech recognition tasks<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">The key insight of the initial perplexity paper is that by applying concepts from information theory to a model\u2019s internal state, we can begin to quantify more subtle qualities of a model. While the original authors were looking for a metric to approximate the \u201cdifficulty\u201d of speech recognition tasks, researchers working on NLP quickly recognized perplexity as relevant to their work as well.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Throughout the 1980s and 1990s, perplexity emerged as the key metric for evaluating the performance of n-gram models. Perplexity was used to measure how well these models captured linguistic patterns by quantifying the average uncertainty of predictions. \u201cUncertainty\u201d&nbsp; was calculated using entropy and its close mathematical relative, cross-entropy, both of which we\u2019ll explore in more detail shortly.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity remains a primary benchmark to this day and is a popular metric for evaluating sequential neural networks (including the GPT family of models). Its many advantages, and its historical role in benchmarking, make it common even in contemporary research. At the same time, its many limitations make it insufficient as a standalone evaluation metric, especially for modern LLMs.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">In order to gain a more intuitive understanding of perplexity and its pros and cons, we need to first explore the underlying mathematics. Namely, we need to understand entropy and cross-entropy. If you already feel comfortable with these topics, feel free to skip the following section.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-entropy-cross-entropy-and-information-theory\">Entropy, Cross-entropy and Information Theory<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity, as an evaluation metric, has its roots in information theory and probabilistic modeling, building on Claude Shannon\u2019s work on entropy in the 1940s. Shannon used language entropy to describe the amount of information in a message, specifically when converting from a programming language to raw binary and back to a programming language:<\/span><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">\u201cThe entropy is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in the language. If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average of binary digits required per letter of the original language.\u201d<\/span><\/p>\n<cite><i><span style=\"font-weight: 400;\">Claude Shannon in <\/span><\/i><a href=\"https:\/\/www.princeton.edu\/~wbialek\/rome\/refs\/shannon_51.pdf\"><span style=\"font-weight: 400;\">Prediction and Entropy of Printed English<\/span><\/a><span style=\"font-weight: 400;\">, 1951<\/span><\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">As described by Shannon in the 40s and 50s, language entropy quantifies the average amount of information contained in a word or sequence of words, reflecting how unpredictable the next word is based on previous context. In other words, language entropy refers to the degree of uncertainty or unpredictability in a language\u2019s word distribution.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">In the experiment below, Shannon counted how many guesses it took a human being to correctly predict each letter (including spaces) in the sentence, given only the preceding letters in the sequence.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11976 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"835\" height=\"183\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.21.55\u202fAM.png\" alt=\"Graphic from Prediction and Entropy of Printed English by Claude Shannon, 1950, showing the number of guesses it took humans to guess the correct next character in a sentence, including spaces. This research led to the development of perplexity as a metric.\" class=\"wp-image-11976\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.21.55\u202fAM.png 835w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.21.55\u202fAM-300x66.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.21.55\u202fAM-768x168.png 768w\" sizes=\"auto, (max-width: 835px) 100vw, 835px\" \/><figcaption class=\"wp-element-caption\">Graphic from <a href=\"https:\/\/www.princeton.edu\/~wbialek\/rome\/refs\/shannon_51.pdf\">Prediction and Entropy of Printed English<\/a> by Claude Shannon, 1950<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Entropy is calculated as <strong>H(P)<\/strong> where <\/span><strong><i>p<\/i>(<i>w<\/i><i>1<\/i><\/strong><span style=\"font-weight: 400;\"><strong>)<\/strong> is the probability of the <\/span><strong><i>i<\/i><\/strong><span style=\"font-weight: 400;\">th word occurring in a given context, and the summation runs over all possible words in the vocabulary. The negative sign ensures that entropy is a non-negative value, as <strong>log<em>p<\/em>(w)<\/strong> is negative.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11980\"><img loading=\"lazy\" decoding=\"async\" width=\"1328\" height=\"156\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.40.56\u202fAM.png\" alt=\"Mathematical equation for entropy\" class=\"wp-image-11980\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.40.56\u202fAM.png 1328w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.40.56\u202fAM-300x35.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.40.56\u202fAM-1024x120.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.40.56\u202fAM-768x90.png 768w\" sizes=\"auto, (max-width: 1328px) 100vw, 1328px\" \/><figcaption class=\"wp-element-caption\">Mathematical equation for entropy<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Higher entropy values indicate lower predictability and greater diversity in word choice, while lower values suggest more predictable language patterns, reflecting the underlying complexity of the language being modeled.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Because the output of a large language model is typically a probability distribution calculated across all possible output tokens, entropy is very straightforward to calculate. A natural next question is how we might use entropy to train a language model, and that is where entropy\u2019s close relative <\/span><b>cross-entropy<\/b><span style=\"font-weight: 400;\"> comes in.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">While entropy measures the average uncertainty in a <\/span><i><span style=\"font-weight: 400;\">single<\/span><\/i><span style=\"font-weight: 400;\"> probability distribution, cross-entropy quantifies the difference between <\/span><i><span style=\"font-weight: 400;\">two<\/span><\/i><span style=\"font-weight: 400;\"> probability distributions. In the case of language modeling, these would be the true distribution, <\/span><i><span style=\"font-weight: 400;\">P,<\/span><\/i><span style=\"font-weight: 400;\"> and the model\u2019s predicted distribution, <\/span><i><span style=\"font-weight: 400;\">Q. <\/span><\/i><span style=\"font-weight: 400;\">In this way cross-entropy provides a way to assess how well the model\u2019s predictions approximate the actual distribution of the data.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11983\"><img loading=\"lazy\" decoding=\"async\" width=\"1564\" height=\"180\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.45.19\u202fAM.png\" alt=\"Mathematical equation for cross-entropy\" class=\"wp-image-11983\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.45.19\u202fAM.png 1564w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.45.19\u202fAM-300x35.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.45.19\u202fAM-1024x118.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.45.19\u202fAM-768x88.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-11.45.19\u202fAM-1536x177.png 1536w\" sizes=\"auto, (max-width: 1564px) 100vw, 1564px\" \/><figcaption class=\"wp-element-caption\">Mathematical equation for cross-entropy<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Here, <\/span><strong><i>p<\/i>(<i>x<\/i><i>i<\/i><\/strong><span style=\"font-weight: 400;\"><strong>)<\/strong> represents the true probability of outcome <\/span><strong><i>x<\/i><i>i<\/i><\/strong><span style=\"font-weight: 400;\"> and <\/span><strong><i>q<\/i>(<i>x<\/i><i>i<\/i><\/strong><span style=\"font-weight: 400;\"><strong>)<\/strong> denotes the predicted probability. Lower cross-entropy values indicate higher certainty, as they imply that the model&#8217;s probabilities are more closely aligned with the actual distribution.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">In the context of model training, cross-entropy is used as the backbone of many loss functions. It measures the difference between how likely the model is to select a given token, versus the true likelihood that a given token is correct.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Once you have a grasp of entropy and cross-entropy, perplexity follows intuitively.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-so-what-is-perplexity-really\">So, what is perplexity, really?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Like entropy and cross-entropy, perplexity also quantifies a model\u2019s uncertainty in predicting the next token in a sequence. So, why not just use entropy or cross-entropy? It turns out perplexity is far more intuitive in explaining model behavior.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11988\"><img loading=\"lazy\" decoding=\"async\" width=\"2046\" height=\"258\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-3.21.24\u202fPM.png\" alt=\"Mathematical equation for perplexity\" class=\"wp-image-11988\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-3.21.24\u202fPM.png 2046w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-3.21.24\u202fPM-300x38.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-3.21.24\u202fPM-1024x129.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-3.21.24\u202fPM-768x97.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-21-at-3.21.24\u202fPM-1536x194.png 1536w\" sizes=\"auto, (max-width: 2046px) 100vw, 2046px\" \/><figcaption class=\"wp-element-caption\">Mathematical equation for perplexity<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Mathematically speaking, perplexity is defined as the exponentiated average log-likelihood of the predicted words in a sequence. Or, less verbosely, perplexity is cross-entropy with the exponential function applied. This transformation might seem somewhat arbitrary at first, but it actually makes a big difference, especially in terms of interpretability.&nbsp;&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Because cross-entropy is a negative log measure, when we take its exponential, we &#8220;undo&#8221; the log, converting this measure back into a regular probability space. This value represents the tangible count of likely choices the model considers at each step, or the \u201ceffective <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Branching_factor\"><span style=\"font-weight: 400;\">branching factor<\/span><\/a><span style=\"font-weight: 400;\">.\u201d&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><b>Perplexity, then, is essentially a measure of how many options the model finds plausible on average, with lower values indicating fewer options (more confident predictions) and higher values indicating more options (greater uncertainty).<\/b><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Entropy<\/strong><\/td><td><span style=\"font-weight: 400;\">Measures the average uncertainty in a single probability distribution <em>P<\/em><\/span><\/td><\/tr><tr><td><strong>Cross-entropy<\/strong><\/td><td><span style=\"font-weight: 400;\">Measures how well a predicted distribution <em>Q<\/em> approximates the true distribution <em>P<\/em><\/span><\/td><\/tr><tr><td><strong>Perplexity<\/strong><\/td><td><span style=\"font-weight: 400;\">Exponentiation of cross-entropy; Measures how many likely candidate tokens the model is choosing between<\/span><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In summary, <i>entropy<\/i> measures the inherent uncertainty in a true probability distribution, reflecting the average unpredictability of outcomes, such as words in a language. <i>Cross-entropy<\/i> extends this concept by measuring the difference between the true distribution of the data and the predicted distribution from a model, penalizing inaccurate predictions. <i>Perplexity<\/i> builds on cross-entropy by transforming it into a more interpretable form, using the exponential function to express how many equally likely word choices the model is effectively considering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Note that the perplexity score of a language model on a sequence of tokens is the average of the perplexity scores for each predicted token. This means that if a language model has a perplexity of 10, on average, the model is selecting between 10 equally likely options for the next word.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-11993\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Copy-of-The.gif\" alt=\"An animated GIF depicting how the perplexity score of a language model on a sequence of tokens is the average of the perplexity scores for each predicted token.\" class=\"wp-image-11993\"\/><figcaption class=\"wp-element-caption\">The perplexity score of a language model on a sequence of tokens is the average of the perplexity scores for each predicted token.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Using this intuition, a lower perplexity score is better because it indicates that a model is effectively \u201cchoosing\u201d between fewer viable options for the next word and is \u201cless surprised.\u201d A higher perplexity score, on the other hand, indicates more \u201cuncertainty.\u201d&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Of course, it is entirely possible for a language model to be \u201cconfident\u201d <\/span><i><span style=\"font-weight: 400;\">and<\/span><\/i><span style=\"font-weight: 400;\"> \u201cincorrect,\u201d so perplexity should not be confused with an accuracy metric. But we\u2019ll dive into more of perplexity\u2019s limitations later on. First, let\u2019s explore some of its advantages.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-advantages-of-perplexity\">Advantages of Perplexity<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">As mentioned, one of the biggest advantages of perplexity is that it is highly intuitive and explainable in a field that is notoriously opaque. This is a notable advantage over learned metrics like <a href=\"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/\">BERTScore<\/a> and <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-as-a-judge\/\">LLM-as-a-Judge<\/a> metrics like <a href=\"https:\/\/www.comet.com\/site\/blog\/g-eval-for-llm-evaluation\/\">G-Eval<\/a>.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Having an estimate of a model\u2019s certainty is also especially useful when using an LLM to plan or guide actions. While high certainty suggests the model has strong backing for a given prediction, low certainty can prompt further human oversight or additional checks before execution.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity is also computationally straightforward to calculate, making it fast and efficient. This also allows practitioners to evaluate model performance in real-time during training, helping to identify issues and improvements promptly and leading to faster development cycles.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">As we\u2019ll see in the next section on perplexity\u2019s limitations, it is not an end-all evaluation metric for LLMs. However, given its explainability and low-overhead, perplexity is a quick and useful first-pass metric that works well when used in conjunction with other <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-metrics-every-developer-should-know\/\">LLM evaluation metrics<\/a>.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-limitations-of-perplexity\">Limitations of Perplexity<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">The most important limitation of perplexity is that it does not convey a model\u2019s \u201cunderstanding.\u201d Perplexity is strictly a measure of uncertainty, and a model being uncertain doesn\u2019t mean it is right or wrong. A model may be correct but unconfident or wrong but confident. So, a perplexity score isn\u2019t a measure of accuracy, just of confidence.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">It is also difficult to use perplexity as a benchmark between models. Perplexity scores are influenced by various model-specific factors, such as tokenization method, dataset, pre-processing steps, vocabulary size, and context length. For example, a character-level model may have a lower perplexity than a word-level model, but that doesn\u2019t necessarily mean the character-level model is better.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">A model can also achieve a low perplexity score by assigning high probabilities to common words, like articles and conjunctions, leading to a misleadingly low score. Overfit models can show low perplexity but lack true understanding. <\/span><a href=\"https:\/\/arxiv.org\/abs\/2405.06105\"><span style=\"font-weight: 400;\">Research also suggests that perplexity doesn&#8217;t correlate well with an LLM&#8217;s long-term understanding<\/span><\/a><span style=\"font-weight: 400;\">, likely because it fails to capture long-term dependencies. Additionally, <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2210.05892\"><span style=\"font-weight: 400;\">perplexity can be skewed by punctuation and repeated text spans<\/span><\/a><span style=\"font-weight: 400;\">, which lower scores but don\u2019t necessarily improve text quality.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">While perplexity has limitations, it remains a valuable first-pass metric when combined with other task-specific <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-guide\/\">LLM evaluation<\/a> metrics, offering both interpretability and efficiency.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-implementing-perplexity-from-scratch-in-python\">Implementing Perplexity From Scratch in Python<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Using what we\u2019ve learned so far about perplexity, let\u2019s implement it from scratch in Python so we can apply it directly to our LLM outputs. Note that because perplexity is such a common evaluation metric, there are several pre-built modules to implement it in Python, including <\/span><a href=\"https:\/\/huggingface.co\/spaces\/evaluate-metric\/perplexity\"><span style=\"font-weight: 400;\">Hugging Face\u2019s evaluate.metrics.perplexity<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/arxiv.org\/html\/2404.06634v1\"><span style=\"font-weight: 400;\">perplexed from Stability AI<\/span><\/a><span style=\"font-weight: 400;\">. But coding the metric from scratch will help build intuition for what perplexity is actually doing under the hood. Later, we\u2019ll test our function out on GPT-2 and learn how to automatically track the perplexity scores of our LLM using a custom metric in Opik.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Throughout our implementation, we\u2019ll be using PyTorch and HuggingFace\u2019s Transformers library.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Our basic perplexity function will take <\/span><b>logits<\/b><span style=\"font-weight: 400;\"> and <\/span><b>target<\/b><span style=\"font-weight: 400;\"> labels as inputs.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><b>Logits<\/b><span style=\"font-weight: 400;\"> are the raw scores output by the model for each token in the <\/span><b>vocabulary<\/b><span style=\"font-weight: 400;\"> for a given position in the <\/span><b>input sequence<\/b><span style=\"font-weight: 400;\">. For each position in the sequence, the model outputs a vector of logits, where each entry in that vector corresponds to a token in the vocabulary.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11996 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1616\" height=\"1178\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-05-at-6.36.04\u202fPM.png\" alt=\"Shape of logits vector in LLM output\" class=\"wp-image-11996\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-05-at-6.36.04\u202fPM.png 1616w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-05-at-6.36.04\u202fPM-300x219.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-05-at-6.36.04\u202fPM-1024x746.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-05-at-6.36.04\u202fPM-768x560.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-05-at-6.36.04\u202fPM-1536x1120.png 1536w\" sizes=\"auto, (max-width: 1616px) 100vw, 1616px\" \/><figcaption class=\"wp-element-caption\">Shape of logits vector in LLM output<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">The <\/span><b>targets<\/b><span style=\"font-weight: 400;\"> will be the ground truth label tensors.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">To calculate perplexity, we\u2019ll need to:<\/span><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\"> Convert the logits to probabilities using the log_softmax function, which normalizes the scores.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"> Gather the log probabilities of the correct target tokens.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"> As these results will be negatives, we\u2019ll multiply them by -1 to convert the probabilities into the range of 0-1. This value represents the <\/span><b>entropy<\/b><span style=\"font-weight: 400;\">.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"> Take the mean entropy of all tokens in the sequence. This value represents the <\/span><b>cross-entropy<\/b><span style=\"font-weight: 400;\">.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Take the exponential of the <\/span><b>cross-entropy<\/b><span style=\"font-weight: 400;\">. This value represents the <\/span><b>perplexity<\/b><span style=\"font-weight: 400;\">, or effective branching factor of each token in the sequence.<\/span><\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>\nimport torch\n\ndef calculate_perplexity(logits, target):\n    \"\"\"\n    Calculate perplexity from logits and target labels.\n\n    Args:\n    - logits (torch.Tensor): Logits output from the model (batch_size, seq_length, vocab_size).\n    - target (torch.Tensor): Ground truth labels (batch_size, seq_length).\n\n    Returns:\n    - perplexity (float): The perplexity score.\n    \"\"\"\n\n    # Convert logits to log probabilities\n    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)\n\n    # Gather the log probabilities for the correct target tokens\n    # log_probs has shape (batch_size, seq_length, vocab_size)\n    # target has shape (batch_size, seq_length)\n    # The gather method will pick the log probabilities of the true target tokens\n    target_log_probs = log_probs.gather(dim=-1, index=target.unsqueeze(-1)).squeeze(-1)\n\n    # Calculate the negative log likelihood\n    negative_log_likelihood = -target_log_probs\n\n    # Calculate the mean negative log likelihood over all tokens\n    mean_nll = negative_log_likelihood.mean()\n\n    # Calculate perplexity as exp(mean negative log likelihood)\n    perplexity = torch.exp(mean_nll)\n\n    return perplexity.item()\n\n# Example usage\n# Simulate a batch of logits (batch_size=2, seq_length=4, vocab_size=10)\nlogits = torch.randn(2, 4, 10)\n# Simulate ground truth target tokens\ntarget = torch.tensor(&#91;&#91;1, 2, 3, 4], &#91;4, 3, 2, 1]])\n\n# Calculate perplexity\nperplexity = calculate_perplexity(logits, target)\nprint(f'Perplexity: {perplexity}')\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">The function above calculates perplexity from a mathematical perspective, but it requires some adjustments to handle raw text, as you would encounter in real-world scenarios.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Now that we&#8217;ve covered the math behind perplexity, let&#8217;s modify the function to work with the inputs and outputs of a large language model.For this version of our function we\u2019ll want to:<\/span><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><b>Shift the logits and target tensors<\/b><span style=\"font-weight: 400;\"> so that each model prediction (logit) matches the actual token in the sequence (target\/input_ids). Since each token is predicted based on the previous tokens, the prediction for token \ud835\udc61 should be compared to the actual token at \ud835\udc61 + 1.<\/span><\/li>\n\n\n\n<li><b>Add batching<\/b><span style=\"font-weight: 400;\"> to handle texts longer than the model\u2019s context length by splitting them into smaller chunks for parallel processing.<\/span><\/li>\n\n\n\n<li><b>Use padding tokens<\/b><span style=\"font-weight: 400;\"> to standardize input lengths across sentences of varying lengths.<\/span><\/li>\n\n\n\n<li><b>Apply an attention mask<\/b><span style=\"font-weight: 400;\"> to exclude padding tokens from perplexity calculations.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Average the perplexity scores across tokens for a <\/span><b>sequence-level score<\/b><span style=\"font-weight: 400;\">.<\/span><\/li>\n\n\n\n<li>Average the sequence-level scores for an overall <b style=\"color: var(--wpex-text-2); font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\">batch-level perplexity score<\/b><span style=\"font-weight: 400;\">.<\/span><\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11992 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Copy-of-The-1.gif\" alt=\"An animated gif showing how to shift the logits and input id vectors to calculate perplexity\" class=\"wp-image-11992\"\/><figcaption class=\"wp-element-caption\">Since each token is predicted based on the previous tokens, the prediction for token \ud835\udc61 should be compared to the actual token at \ud835\udc61 + 1.<\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\n# Load the model and tokenizer (e.g., GPT-2)\nmodel_name = \"gpt2\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\n\n# Assign the EOS token as the padding token\ntokenizer.pad_token = tokenizer.eos_token\n\ndef calculate_batch_perplexity(input_texts):\n    \"\"\"\n    Calculate perplexity for a batch of input texts using a pretrained language model.\n\n    Args:\n    - input_texts (List&#91;str]): A list of input texts to evaluate.\n\n    Returns:\n    - List&#91;float]: A list of perplexity scores, one for each input text.\n    \"\"\"\n    # Tokenize the batch of texts with padding for uniform length\n    inputs = tokenizer(\n        input_texts, return_tensors=\"pt\", padding=True, truncation=True\n    )\n\n    input_ids = inputs&#91;\"input_ids\"]\n    attention_mask = inputs&#91;\"attention_mask\"]\n\n    # Pass the input batch through the model to get logits\n    with torch.no_grad():\n        outputs = model(input_ids, attention_mask=attention_mask)\n        logits = outputs.logits\n\n    # Shift the logits and input_ids to align targets correctly\n    # Logits dimensions are: (batch_size, seq_length, vocab_size)\n    shift_logits = logits&#91;:, :-1, :]  # Ignore the last token's logits\n    shift_labels = input_ids&#91;:, 1:]   # Skip the first token in the labels\n\n    # Compute log probabilities\n    log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)\n\n    # Gather the log probabilities for the correct tokens\n    target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)\n\n    # Mask out positions corresponding to padding tokens\n    target_log_probs = target_log_probs * attention_mask&#91;:, 1:].to(log_probs.dtype)\n\n    # Compute the mean negative log-likelihood for each sequence\n    negative_log_likelihood = -target_log_probs.sum(dim=-1) \/ attention_mask&#91;:, 1:].sum(dim=-1)\n\n    # Compute perplexity for each sequence\n    perplexities = torch.exp(negative_log_likelihood)\n    perplexities = perplexities.tolist()\n\n# Take mean of perplexities of each batch\n    mean_perplexity_score = torch.mean(perplexities)\n\n    return {\"perplexities\": perplexities, \"mean_perplexity\": mean_perplexity_score}\n\n# Example usage\ntexts = &#91;\n    \"The quick brown fox jumps over the lazy dog.\",\n    \"A journey of a thousand miles begins with a single step.\"\n]\nprint(f\"Perplexity scores: {calculate_batch_perplexity(texts)}\")\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">This function takes as input a list of texts, and outputs a dictionary containing a list of perplexity scores for each text in the input list, as well as the average perplexity score of text sequences in the input list.&nbsp;&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Note that taking the average of perplexity scores across texts of different lengths can lead to a skewed overall perplexity score for a couple of reasons.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">First, perplexity scores tend to be more stable for longer sequences, while shorter sequences may have higher variance, leading to outliers. Second, taking a simple arithmetic mean across scores for texts of varying lengths can give disproportionate weight to tokens in shorter sequences. Nevertheless, using the arithmetic mean is currently the most common approach to calculating overall perplexity, so we use it here for the sake of consistency.&nbsp;<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-implementing-perplexity-in-opik\">Implementing Perplexity in Opik<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">In the real world, you\u2019ll likely want to use an evaluation framework to implement LLM metrics. In this section, we\u2019ll implement perplexity in <\/span><a href=\"https:\/\/github.com\/comet-ml\/opik\"><span style=\"font-weight: 400;\">Opik<\/span><\/a><span style=\"font-weight: 400;\">, Comet\u2019s open source LLM evaluation framework.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Here, we use our original perplexity function and modify it slightly to implement it as a custom Opik metric with a `score` method that returns a `ScoreResult` object:<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\nfrom opik.evaluation.metrics import base_metric, score_result\n\nclass Perplexity(base_metric.BaseMetric):\n    \"\"\"\n    Perplexity (PPL) is a common LLM evaluation metric defined as the exponentiated average\n    negative log-likelihood of a sequence.\n\n    For more information on perplexity, see:\n    https:&#47;&#47;en.wikipedia.org\/wiki\/Perplexity\n\n    Args:\n        name: The name of the metric, perplexity.\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str = \"Perplexity\",\n    ):\n        super().__init__(name=name)\n\n    def score(\n        self, input_ids: torch.Tensor, logits: torch.Tensor, attention_mask: torch.Tensor\n    ) -&gt; score_result.ScoreResult:\n        \"\"\"\n        Calculate the perplexity score of each token give the previous tokens in the sequence.\n\n        Args:\n            input_ids: input ids of the text sequence input to the model (torch.Tensor)\n            logits: output logits of the model (torch.Tensor)\n            attention_mask: attention mask\n\n        Returns:\n            score_result.ScoreResult: A ScoreResult object\n        \"\"\"\n\n        # Shift the logits and input_ids to align targets correctly\n        shift_logits = logits&#91;:, :-1, :]  # Ignore the last token's logits\n        shift_labels = input_ids&#91;:, 1:]   # Skip the first token in the labels\n\n        # Compute log probabilities\n        log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)\n\n        # Gather the log probabilities for the correct tokens\n        target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)\n\n        # Mask out positions corresponding to padding tokens\n        target_log_probs = target_log_probs * attention_mask&#91;:, 1:].to(log_probs.dtype)\n\n        # Compute the mean negative log-likelihood for each sequence\n        negative_log_likelihood = -target_log_probs.sum(dim=-1) \/ attention_mask&#91;:, 1:].sum(dim=-1)\n\n        # Take the exp(negative_log_likelihood)\n        perplexities = torch.exp(negative_log_likelihood)\n\n        # Take the mean of perplexity scores\n        mean_perplexity_score = torch.mean(perplexities)\n\n        return score_result.ScoreResult(value=mean_perplexity_score, name=self.name)\n\nperplexity = Perplexity()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">After defining perplexity as a custom metric, we can use it by:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><b>Defining the model\u2019s forward pass<\/b><span style=\"font-weight: 400;\"> in your_llm_application.&nbsp;<\/span><\/li>\n\n\n\n<li><b>Calling our application<\/b><span style=\"font-weight: 400;\"> in evaluation_task and returning a dictionary with keys that match the parameters expected by our custom Perplexity metric above.<\/span><\/li>\n\n\n\n<li><b>Add tracking <\/b><span style=\"font-weight: 400;\">by decorating our functions with Opik\u2019s @track decorator to automatically log relevant data to the platform.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Pass the evaluation_task function to <\/span><b>Opik\u2019s evaluate function<\/b><span style=\"font-weight: 400;\">, which runs and logs the full evaluation process, including calculating perplexity scores for each call.<\/span><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">You can find the <\/span><a href=\"https:\/\/colab.research.google.com\/drive\/1EtrGnKij2OdXA23ty-4tSMuLDSSfGeen\"><span style=\"font-weight: 400;\">full code in the Colab<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\nfrom opik import track\nfrom opik.evaluation import evaluate\n\n@track\ndef your_llm_application(input: str) -&gt; str:\n\n    # Tokenize the batch of texts with padding for uniform length\n    inputs = tokenizer(\n        input, return_tensors=\"pt\", padding=True, truncation=True\n    )\n\n    input_ids = inputs&#91;\"input_ids\"]\n    attention_mask = inputs&#91;\"attention_mask\"]\n\n    # Pass the input batch through the model to get logits\n    with torch.no_grad():\n        outputs = model(input_ids, attention_mask=attention_mask)\n\n    return {\"input_ids\": input_ids,\n            \"logits\": outputs.logits,\n            \"attention_mask\": attention_mask}\n\n@track\ndef evaluation_task(x):\n    llm_outputs = your_llm_application(x&#91;'input'])\n    return {\n        \"input_ids\": llm_outputs&#91;'input_ids'],\n        \"logits\": llm_outputs&#91;'logits'],\n        \"attention_mask\": llm_outputs&#91;'attention_mask']\n    }\n\nevaluation = evaluate(\n    experiment_name=\"My ppl experiment\",\n    dataset=dataset,\n    task=evaluation_task,\n    scoring_metrics=&#91;perplexity],\n    experiment_config={\n        \"model\": model_name\n    }\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">And here is what the output of your evaluation should look like from within the Opik UI:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-11999 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1898\" height=\"891\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-22-at-11.11.01\u202fAM.png\" alt=\"A screenshot of a dashboard in Comet's Opik displaying the calculation of perplexity across various dataset items, as well as the dataset as a whole.\" class=\"wp-image-11999\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-22-at-11.11.01\u202fAM.png 1898w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-22-at-11.11.01\u202fAM-300x141.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-22-at-11.11.01\u202fAM-1024x481.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-22-at-11.11.01\u202fAM-768x361.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-10-22-at-11.11.01\u202fAM-1536x721.png 1536w\" sizes=\"auto, (max-width: 1898px) 100vw, 1898px\" \/><figcaption class=\"wp-element-caption\">Our perplexity metric calculations on our dataset, as stored in Opik<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-adding-perplexity-to-your-llm-evaluation-suite\">Adding Perplexity to Your LLM Evaluation Suite<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Perplexity is extremely popular for its intuitiveness and efficiency, but it only provides a partial picture of a language model\u2019s performance. It captures a model\u2019s certainty about its predictions but, notably, it does not convey a model\u2019s \u201cunderstanding.\u201d&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">For a more complete understanding of a model\u2019s behavior, perplexity should be used alongside other evaluation metrics, such as accuracy and fluency, as well as task-specific metrics like relevance, coherence, factuality, and <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-hallucination\/\">hallucination detection<\/a>. Because of its computational efficiency, perplexity is particularly useful as a first-pass metric, but has significant limitations that require additional evaluation methods to address.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">More nuanced evaluation methods include using an LLM-as-a-judge, but these methods are also often less interpretable. Especially when relying on the same language model being evaluated, they can lead to potential biases, circular reasoning, and high variability in results. These limitations make it essential to pair LLM-as-a-judge metrics with other evaluation methods, like <\/span><a href=\"https:\/\/arxiv.org\/html\/2408.08781v1\"><span style=\"font-weight: 400;\">perplexity, which has been shown to outperform the results of prompting the LLMs-as-a-judge with basic instructions at estimating text quality.<\/span><\/a><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12000 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"945\" height=\"451\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-15-at-4.28.04\u202fPM.png\" alt=\"Table showing that perplexity has been shown to outperform the results of prompting the LLMs-as-a-judge with basic instructions at estimating text quality.\" class=\"wp-image-12000\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-15-at-4.28.04\u202fPM.png 945w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-15-at-4.28.04\u202fPM-300x143.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Screenshot-2024-11-15-at-4.28.04\u202fPM-768x367.png 768w\" sizes=\"auto, (max-width: 945px) 100vw, 945px\" \/><figcaption class=\"wp-element-caption\">Image from Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., &amp; Sarkar, A. (2024). Evaluating the Evaluator: Measuring LLMs\u2019 Adherence to Task Evaluation Instructions. Retrieved from https:\/\/arxiv.org\/html\/2408.08781v1<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">Using perplexity as part of a \u201csuite\u201d of metrics is useful beyond just the \u201cextra cover\u201d of additional metrics, however. Seeing where these metrics diverge can help you identify problematic data and points of failure in your evaluation suite.&nbsp;<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">For example, a high perplexity and high accuracy score may indicate that while the model is correct in specific answers, it is uncertain overall and needs additional training. Likewise, a model with low perplexity and coherence may produce text it is confident in, but that doesn\u2019t flow logically, which may not be acceptable for your application and which could point to issues with sentence structure in the training data. Conversely, a model with high perplexity and high coherence suggests the model is uncertain about its predictions even when producing coherent text. As a final example, if both hallucination detection scores and perplexity scores are high, the model is both uncertain and likely producing fabricated content, suggesting potential weaknesses in grounding or fact-based reasoning within the training pipeline. Monitoring these divergences helps identify specific areas for model and data improvement to better align with your model\u2019s intended performance.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span style=\"font-weight: 400;\">In summary, perplexity is a valuable metric for evaluating language models by measuring their confidence and predicting text sequences. While it offers useful insights, perplexity should be used alongside other metrics to get a fuller picture of model performance. This approach helps highlight specific strengths and weaknesses, allowing for more targeted improvements and reliable assessments of model quality.<\/span><\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1EtrGnKij2OdXA23ty-4tSMuLDSSfGeen\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab!<\/a><\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>If you found this article useful, follow me on <a href=\"https:\/\/www.linkedin.com\/in\/anmorgan24\/\">LinkedIn<\/a> and <a href=\"https:\/\/x.com\/anmorgan2414\">Twitter<\/a> for more content!<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Perplexity is, historically speaking, one of the &#8220;standard&#8221; evaluation metrics for language models. And while recent years have seen a surge in more complex and robust metrics, including LLM-based evaluations, perplexity still has a lot of value as a component in your evaluation suite. If you want to build effective evaluation pipelines\u2014or just understand what [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":18416,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,65,7],"tags":[40,93,71,52,31,94],"coauthors":[133],"class_list":["post-11966","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-tutorials","tag-comet","tag-evaluation-metrics","tag-language-models","tag-llm","tag-llmops","tag-opik"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Perplexity for LLM Evaluation<\/title>\n<meta name=\"description\" content=\"Perplexity measures a language model&#039;s certainty in predicting text. Lower scores mean higher confidence but don&#039;t reflect true understanding\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Perplexity for LLM Evaluation\" \/>\n<meta property=\"og:description\" content=\"Perplexity measures a language model&#039;s certainty in predicting text. Lower scores mean higher confidence but don&#039;t reflect true understanding\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-21T22:33:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-17T21:20:19+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Perplexity for LLM Evaluation","description":"Perplexity measures a language model's certainty in predicting text. Lower scores mean higher confidence but don't reflect true understanding","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/","og_locale":"en_US","og_type":"article","og_title":"Perplexity for LLM Evaluation","og_description":"Perplexity measures a language model's certainty in predicting text. Lower scores mean higher confidence but don't reflect true understanding","og_url":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-11-21T22:33:47+00:00","article_modified_time":"2025-11-17T21:20:19+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg","type":"image\/jpeg"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"Perplexity for LLM Evaluation","datePublished":"2024-11-21T22:33:47+00:00","dateModified":"2025-11-17T21:20:19+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/"},"wordCount":3242,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg","keywords":["Comet","Evaluation metrics","Language Models","LLM","LLMOps","Opik"],"articleSection":["Comet Community Hub","LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/","url":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/","name":"Perplexity for LLM Evaluation","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg","datePublished":"2024-11-21T22:33:47+00:00","dateModified":"2025-11-17T21:20:19+00:00","description":"Perplexity measures a language model's certainty in predicting text. Lower scores mean higher confidence but don't reflect true understanding","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg","width":1920,"height":1080,"caption":"Perplexity for LLM Evaluation Blog title image"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Perplexity for LLM Evaluation"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/11\/Perplexity.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/11966","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=11966"}],"version-history":[{"count":3,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/11966\/revisions"}],"predecessor-version":[{"id":18488,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/11966\/revisions\/18488"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/18416"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=11966"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=11966"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=11966"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=11966"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}