{"id":12303,"date":"2024-12-19T09:10:38","date_gmt":"2024-12-19T17:10:38","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=12303"},"modified":"2025-11-13T19:58:08","modified_gmt":"2025-11-13T19:58:08","slug":"bertscore-for-llm-evaluation","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/","title":{"rendered":"BERTScore For LLM Evaluation"},"content":{"rendered":"\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1ti9yl-0ynl9eFPKJnIzkpUxbyXNtsyFA\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab!<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\">Introduction<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">BERTScore represents a pivotal shift in <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-guide\/\">LLM evaluation<\/a>, moving beyond traditional heuristic-based metrics like BLEU and ROUGE to a learned approach that captures complex linguistic nuances. Unlike older n-gram-based methods, BERTScore excels at evaluating paraphrasing, coherence, relevance, and polysemy\u2014essential features for modern AI applications.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-1024x576.jpg\" alt=\"BERTScore For LLM Evaluation Featured Image\" class=\"wp-image-18423\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-1024x576.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-300x169.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-768x432.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-1536x864.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-2048x1152.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">BERTScore leverages transformer-based contextual embeddings and compares them using cosine similarity to assess the quality of model outputs. Its popularity endures due to its relatively low computational cost and greater interpretability compared to black-box methods like <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-as-a-judge\/\">LLM-as-a-judge<\/a> metrics.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In this article, I\u2019ll explore how BERTScore improves upon traditional evaluation methods, explain its key components, and discuss its role in the broader hierarchy of language model evaluation metrics. Finally, I\u2019ll guide you through implementing BERTScore in Python and show how to integrate it into your evaluation suite using Opik, our open-source <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-frameworks\/\">LLM evaluation framework<\/a>.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-basics-of-bertscore\">The Basics of BERTScore<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">On the surface, BERTScore is, pretty easy to explain: it measures the similarity between tokens in two text sequences by representing them as BERT embeddings and calculating their cosine similarity.&nbsp; For example, given the target sentence, \u201cThe red shoes cost $20.00,\u201d BERTScore would rate the candidate sentence &#8220;The rouge slippers cost $20&#8221; as more similar than &#8220;The blue socks cost $20,&#8221; even though they have roughly the same number of incorrect tokens.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12306\"><img loading=\"lazy\" decoding=\"async\" width=\"2944\" height=\"718\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM.png\" alt=\"Illustration of the computation of the BERTRecall metric from BERTScore\" class=\"wp-image-12306\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM.png 2944w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM-300x73.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM-1024x250.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM-768x187.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM-1536x375.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-18-at-2.07.11\u202fPM-2048x499.png 2048w\" sizes=\"auto, (max-width: 2944px) 100vw, 2944px\" \/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/pdf\/1904.09675\">BERTScore: Evaluating Text Generation with BERT<\/a><\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">What makes BERTScore particularly compelling is how it combines different approaches to evaluation. Broadly, <a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-metrics-every-developer-should-know\/\">LLM evaluation metrics<\/a> can generally be broken down into three hierarchical categories: heuristic metrics, learned metrics, and LLM-as-a-judge metrics, with BERTScore occupying a unique position within this framework.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12309\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/opik-BertScore-diagram-scaled.jpg\" alt=\"Diagram of the hierarchy of LLM evaluation metrics including heuristic metrics, learned metrics, and llm-as-a-judge metrics\" class=\"wp-image-12309\"\/><figcaption class=\"wp-element-caption\">Broadly speaking, LLM evaluation metrics can generally be broken down into three hierarchical categories: heuristic metrics, learned metrics, and LLM-as-a-judge metrics, with BERTScore occupying a unique position within this framework.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-\"><\/h3>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-heuristic-metrics\">Heuristic Metrics<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Heuristic metrics are evaluation <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Measure_(mathematics)\"><span style=\"font-weight: 400;\">measures<\/span><\/a><span style=\"font-weight: 400;\"> that are based on <\/span><b>predefined<\/b><span style=\"font-weight: 400;\">, <\/span><b>rules-based formulas<\/b><span style=\"font-weight: 400;\"> that <\/span><b>quantify<\/b><span style=\"font-weight: 400;\"> specific aspects of model outputs. They are deterministic, interpretable, and computationally efficient. But because they rely on measurable surface-level features like token overlap or exact matches, they often fail to account for the more complex aspects of language, like context, complex semantics, or creativity.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Heuristic metrics include distance metrics, statistical metrics, and overlap or n-gram-based metrics. Popular examples include accuracy, <\/span><a href=\"https:\/\/www.comet.com\/site\/blog\/perplexity-for-llm-evaluation\/\"><span style=\"font-weight: 400;\">perplexity<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/BLEU\"><span style=\"font-weight: 400;\">BLEU<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/ROUGE_(metric)\"><span style=\"font-weight: 400;\">ROUGE<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\"><span style=\"font-weight: 400;\">Levenshtein distance<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><b>cosine similarity<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-learned-metrics\">Learned Metrics<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">While heuristic metrics rely on fixed, rules-based formulas, <\/span><b>learned metrics use machine learning models to score text quality<\/b><span style=\"font-weight: 400;\">. Typically, these models will represent the evaluated text as some kind of learned embedding.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Because embeddings capture semantic and contextual information, learned metrics provide more depth and nuance than heuristic metrics alone and are effectively able to capture aspects like paraphrasing, coherence, and relevance.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Learned metrics tend to be more aligned with human judgment, but are also more computationally expensive and less interpretable. Examples of learned metrics include BERTScore, <\/span><a href=\"https:\/\/arxiv.org\/abs\/2004.04696\"><span style=\"font-weight: 400;\">BLEURT<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/arxiv.org\/abs\/2310.10482\"><span style=\"font-weight: 400;\">COMET<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><a href=\"https:\/\/arxiv.org\/abs\/2210.07197\"><span style=\"font-weight: 400;\">UniEval<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-llm-as-a-judge-metrics\">LLM-as-a-judge Metrics<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">LLM-as-a-judge metrics are probably the most popular evaluation metrics for evaluating generative language models, and are able to capture the deepest levels of nuance in language. However, they are also the most computationally expensive and present unique interpretability challenges.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">LLM-as-a-judge metrics use large language models themselves to act as a &#8220;judge&#8221; and provide feedback or a quality score based on an evaluation criteria. They are especially useful for open-ended and complex tasks, such as creative writing or reasoning, where predefined metrics may fall short.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">BERTScore has the robustness of a learned metric, as it uses BERT\u2019s learned embeddings, but because it is \u201conly\u201d measuring the cosine similarity of token embeddings, it also benefits from the computational efficiency and repeatability of heuristic metrics. If you\u2019re not sure what any of this means, don\u2019t worry, we\u2019ll cover it in the next section!&nbsp;<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-theory-behind-bertscore\">Theory Behind BERTScore<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">As we established earlier, BERTScore evaluates the similarity between a reference (ground truth) sentence and a candidate (prediction) sentence by representing their tokens with contextual embeddings and comparing them using cosine similarity. Let\u2019s break that down, starting with a little background.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Prior to BERTScore, the most popular evaluation metrics for text generation were heuristic metrics like n-gram or overlap-based metrics.&nbsp;<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-the-problem-with-n-grams\">The Problem With N-grams<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">N-grams count the number of continuous sequences of <em>n<\/em> tokens that occur in both the reference and candidate sentences. It\u2019s highly intuitive, but poses some major challenges. Smaller n values often fail to capture context, such as word order, while larger n values quickly become overly restrictive. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">More critically, n-grams cannot account for linguistic nuances like paraphrasing, dependencies, and polysemy. This means they score words with multiple meanings identically and fail to recognize synonyms or paraphrases with similar meaning. These limitations make n-grams inadequate for evaluating the depth and complexity of modern language models.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12426\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/n-gram-gif-with-logo.gif\" alt=\"n-gram metrics or overlap metrics for bertscore llm evaluation gif with opik logo\" class=\"wp-image-12426\"\/><figcaption class=\"wp-element-caption\">N-grams cannot account for linguistic nuances like paraphrasing, dependencies, and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Polysemy\">polysemy<\/a>.<\/figcaption><\/figure>\n\n\n\n<p>[br][br]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-contextual-embeddings\">Contextual Embeddings<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">To address these issues, BERTScore leverages contextual embeddings. Unlike static embeddings, such as those from Word2Vec or GloVe, contextual embeddings are generated by transformer models, which use attention mechanisms to capture the relationships between all words in a sentence. This approach provides greater flexibility and nuance, making it more suitable for complex tasks like language understanding.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">While a deeper exploration of embeddings is beyond the scope of this article, in simple terms, embeddings are vectors of floating-point numbers that capture the semantic context of individual tokens.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-cosine-similarity\">Cosine Similarity<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">After BERTScore has used a transformer-based model (originally a BERT model) to generate contextual embeddings, how does it use them to quantify sentence similarity?<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">As mentioned, contextual embeddings are high-dimensional vectors of floating-point numbers. To measure their similarity, we apply basic linear algebra by calculating the cosine of the angle between two vectors. Vectors that are closer in the embedding space have a higher semantic similarity. This measure is <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cosine_similarity\">the cosine similarity<\/a>.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12313\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-8.10.34\u202fPM.png\" alt=\"A graphic of a 3D axis with the vector embeddings of various words plotted and the cosine similarity calculated between each.\" class=\"wp-image-12313\"\/><figcaption class=\"wp-element-caption\">The <a href=\"https:\/\/docs.oracle.com\/en\/database\/oracle\/oracle-database\/23\/vecse\/cosine-similarity.html\">cosine of any two normalized vectors is equal to their dot product<\/a>.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Once the cosine similarity has been calculated between each token in the candidate sentence and each token in the reference sentence, greedy matching is used to select the highest cosine similarity score for each token.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The core benefit of BERTScore is that it gives you the richness of a learned metric via contextual embeddings, with the computational efficiency of a heuristic metric like cosine similarity.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The final steps include using the maximum similarity scores of each token to compute BERTRecall, BERTPrecision, and BERTF1, and optional importance weighting and baseline rescaling, which we\u2019ll cover in the next section.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-implement-bertscore-from-scratch-in-python\">Implement BERTScore From Scratch in Python<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Using what we\u2019ve learned so far about BERTScore, let\u2019s implement it from scratch in Python to help build intuition for what it\u2019s actually doing under the hood. Later, we\u2019ll implement BERTScore as a custom metric in Opik, and test it out on an image-captioning dataset.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">First, we\u2019ll need to choose our BERT-based model. Here we choose a medium-sized BERT model for English texts and load its accompanying tokenizer:<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch\nfrom transformers import BertTokenizer, BertModel\n\n# Load BERT model and tokenizer\nMODEL_NAME = \"bert-base-uncased\"\n\ntokenizer = BertTokenizer.from_pretrained(MODEL_NAME)\nmodel = BertModel.from_pretrained(MODEL_NAME, device_map=\"auto\")\n<\/code><\/pre>\n\n\n\n<p>You can find <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1RKOVpselB98Nnh_EOC4A2BYn8_201tmPODpNWu4w7xI\/edit?gid=0#gid=0\">a full list of supported models, along with their performance scores and best representation layers here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-bert-embeddings-and-cosine-similarity\">BERT Embeddings and Cosine Similarity<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Next, to calculate BERTScore, we\u2019ll define several functions to help out with each step of the process. Leaving out the optional steps mentioned above, this includes:<\/span><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Getting embeddings of each sentence&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Calculating cosine similarity between embeddings<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Using greedy matching to select highest score<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Calculating BERTRecall, BERTPrecision, BERTF1 and return in a dictionary.<\/span><\/li>\n<\/ol>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s start with a function to compute the embeddings of each reference and candidate sentence. We\u2019ll need to tokenize the text, create embeddings of the tokens, and return the first dimension of the model\u2019s output, which corresponds to the last hidden state.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def get_embeddings(text):\n    \"\"\"\n    Generate token embeddings for the input text using BERT.\n\n    Args:\n        text (str): Input text or batch of sentences.\n\n    Returns:\n        torch.Tensor: Token embeddings with shape (batch_size, seq_len, hidden_dim).\n    \"\"\"\n    # Tokenize input text\n    inputs = tokenizer(text, return_tensors=\"pt\", padding=True, truncation=True)\n    # Move inputs to GPU if available\n    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n    inputs = inputs.to(device)\n\n    # Compute embeddings without gradient calculation\n    with torch.no_grad():\n        outputs = model(**inputs, output_hidden_states=True)\n    # Return last hidden states (token-level embeddings)\n    return outputs.last_hidden_state<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">Next, we\u2019ll create a function to calculate the cosine similarity between generated embeddings. These embeddings will need to be reshaped and then normalized. Once normalized the cosine similarity between two vectors equals their dot product, so we can use basic matrix multiplication to create a matrix of cosine similarity scores.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def cosine_similarity(generated_embeddings, reference_embeddings):\n    \"\"\"\n    Compute cosine similarity between two sets of embeddings.\n\n    Args:\n        generated_embeddings (torch.Tensor): Embeddings of candidate tokens with shape (batch_size, seq_len, hidden_dim).\n        reference_embeddings (torch.Tensor): Embeddings of reference tokens with shape (batch_size, seq_len, hidden_dim).\n\n    Returns:\n        torch.Tensor: Cosine similarity matrix with shape (seq_len_generated, seq_len_reference).\n    \"\"\"\n    # Normalize embeddings along the hidden dimension\n    generated_embeddings = torch.nn.functional.normalize(generated_embeddings, dim=-1)\n    reference_embeddings = torch.nn.functional.normalize(reference_embeddings, dim=-1)\n\n    # Compute similarity using batched matrix multiplication\n    return torch.bmm(generated_embeddings, reference_embeddings.transpose(1, 2))\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">We now have matrices containing the cosine similarity scores of each token in the candidate sentence with each token in the reference sentence. But we need a way to aggregate these values for a sentence-level representation of similarity. The original authors of the BERTScore paper proposed three measures: BERTRecall, BERTPrecision, and BERTF1 to do just this.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12325\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/similarity_matrix.png\" alt=\"Similarity matrix of candidate and reference embeddings using bertscore\" class=\"wp-image-12325\"\/><figcaption class=\"wp-element-caption\">BERTScore provides <a href=\"https:\/\/github.com\/Tiiiger\/bert_score\/blob\/master\/example\/Demo.ipynb\">a convenient function to visualize the cosine similarity matrix of any two sentences.<\/a><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-bertprecision-bertrecall-bertf1\">BERTPrecision, BERTRecall, BERTF1<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Traditionally, precision, recall, and F1 scores evaluate a classifier\u2019s ability to distinguish between positive and negative samples. While BERTScore isn\u2019t designed for classification, its authors adapted these metrics for evaluating LLM-generated text, preserving their original intent in this new context. <\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12328\"><img loading=\"lazy\" decoding=\"async\" width=\"1262\" height=\"1298\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.19.57\u202fPM.png\" alt=\"A comparison of the greedy matching done for each candidate sentence with each reference sentence vs each reference sentence with each candidate sentence for bertscore\" class=\"wp-image-12328\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.19.57\u202fPM.png 1262w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.19.57\u202fPM-292x300.png 292w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.19.57\u202fPM-996x1024.png 996w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.19.57\u202fPM-768x790.png 768w\" sizes=\"auto, (max-width: 1262px) 100vw, 1262px\" \/><figcaption class=\"wp-element-caption\">To calculate BERTPrecision, BERTRecall, and BERTF1, we first use greedy matching to gather the maximum similarity scores for each candidate with each reference, as well as for each reference with each candidate.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s start with precision. Precision measures a model\u2019s accuracy in identifying true positives. In BERTScore, \u201ctrue positives\u201d are candidate tokens that align with reference tokens. <\/span><b>BERTprecision quantifies how much of the candidate\u2019s content is semantically meaningful relative to the reference<\/b><span style=\"font-weight: 400;\">. It is calculated as the average of the maximum cosine similarities between each candidate token&#8217;s embedding and the embeddings of all reference tokens. A high BERTPrecision indicates that the candidate is concise and relevant.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12333\"><img loading=\"lazy\" decoding=\"async\" width=\"1636\" height=\"708\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.06\u202fPM.png\" alt=\"Mathematical equation for BERTPrecision\" class=\"wp-image-12333\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.06\u202fPM.png 1636w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.06\u202fPM-300x130.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.06\u202fPM-1024x443.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.06\u202fPM-768x332.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.06\u202fPM-1536x665.png 1536w\" sizes=\"auto, (max-width: 1636px) 100vw, 1636px\" \/><figcaption class=\"wp-element-caption\">Mathematical equation for BERTPrecision<\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>def get_precision(similarity_matrix):\n    \"\"\"\n    Calculate BERT precision as the mean of the maximum similarity scores from the candidate to the reference.\n\n    Args:\n        similarity_matrix (torch.Tensor): Cosine similarity matrix.\n\n    Returns:\n        torch.Tensor: Precision score.\n    \"\"\"\n    return similarity_matrix.max(dim=2)&#91;0].mean()\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">Next, let\u2019s define BERTRecall. Recall measures the proportion of actual positive instances that a model identifies correctly. For BERTScore, <\/span><b>BERTRecall reflects how much of the reference\u2019s meaning is captured by the candidate<\/b><span style=\"font-weight: 400;\">. It is calculated as the average of the maximum cosine similarity scores between each reference token\u2019s embedding and the embeddings of all candidate tokens. \u200b\u200bA high BERTRecall suggests that the candidate does not miss key information present in the reference.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12341\"><img loading=\"lazy\" decoding=\"async\" width=\"1724\" height=\"684\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.04.30\u202fPM.png\" alt=\"Mathematical formula for BERTRecall\" class=\"wp-image-12341\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.04.30\u202fPM.png 1724w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.04.30\u202fPM-300x119.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.04.30\u202fPM-1024x406.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.04.30\u202fPM-768x305.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.04.30\u202fPM-1536x609.png 1536w\" sizes=\"auto, (max-width: 1724px) 100vw, 1724px\" \/><figcaption class=\"wp-element-caption\">Mathematical formula for BERTRecall<\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>def get_recall(similarity_matrix):\n    \"\"\"\n    Calculate BERT recall as the mean of the maximum similarity scores from the reference to the candidate.\n\n    Args:\n        similarity_matrix (torch.Tensor): Cosine similarity matrix.\n\n    Returns:\n        torch.Tensor: Recall score.\n    \"\"\"\n    return similarity_matrix.max(dim=1)&#91;0].mean()\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">The BERTF1 score is the harmonic mean of BERTPrecision and BERTRecall, balancing these two metrics when there is a trade-off. It provides a single summary value of overall semantic alignment between the candidate and the reference.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12346\"><img loading=\"lazy\" decoding=\"async\" width=\"1602\" height=\"826\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.30\u202fPM.png\" alt=\"Mathematical formula for BERTF1\" class=\"wp-image-12346\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.30\u202fPM.png 1602w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.30\u202fPM-300x155.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.30\u202fPM-1024x528.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.30\u202fPM-768x396.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.05.30\u202fPM-1536x792.png 1536w\" sizes=\"auto, (max-width: 1602px) 100vw, 1602px\" \/><figcaption class=\"wp-element-caption\">Mathematical formula for BERTF1<\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>def get_f1_score(precision, recall):\n    \"\"\"\n    Compute the F1 score given precision and recall.\n\n    Args:\n        precision (torch.Tensor): Precision score.\n        recall (torch.Tensor): Recall score.\n\n    Returns:\n        torch.Tensor: F1 score.\n    \"\"\"\n    return 2 * (precision * recall) \/ (precision + recall)\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">Finally, BERTScore outputs BERTPrecision, BERTRecall, and BERTF1 as a dictionary, which we\u2019ll cover in the next coding section. <\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-importance-weighting-and-baseline-rescaling-with-bertscore\">Importance Weighting and Baseline Rescaling with BERTScore<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Additionally, BERTScore includes some optional processes including importance weighting with IDF and baseline rescaling.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Since rare words are often more indicative of sentence meaning than common words or <a href=\"https:\/\/www.coursera.org\/articles\/what-are-stop-words\">stop words<\/a>, BERTScore allows for frequency penalization using Inverse Document Frequency (IDF) of the test corpus (body of reference sentences).&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">IDF is based on the principle that words appearing in many documents are less informative than words that appear in fewer documents. It\u2019s calculated by taking the logarithm of the total number of documents, <code>N<\/code>, divided by the number of documents containing a given term, <\/span><code><i><span style=\"font-weight: 400;\">t<\/span><\/i><\/code><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12349\"><img loading=\"lazy\" decoding=\"async\" width=\"762\" height=\"236\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.52.28\u202fPM.png\" alt=\"Mathematical formula for Inverse Document Frequency (IDF)\" class=\"wp-image-12349\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.52.28\u202fPM.png 762w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.52.28\u202fPM-300x93.png 300w\" sizes=\"auto, (max-width: 762px) 100vw, 762px\" \/><figcaption class=\"wp-element-caption\">Mathematical formula for Inverse Document Frequency (IDF)<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Additionally, to normalize the scores to a range of -1 to 1 and make them more readable, the original BERTScore paper suggests rescaling BERTScore with respect to its empirical lower bound <\/span><i><span style=\"font-weight: 400;\">b<\/span><\/i><span style=\"font-weight: 400;\"> as a baseline. This calculation does not affect score ranking however, and is solely meant to improve readability. A full list of baseline scores for BERT models in 12 languages can be found in <\/span><a href=\"https:\/\/github.com\/Tiiiger\/bert_score\/tree\/master\/bert_score\/rescale_baseline\"><span style=\"font-weight: 400;\">this directory<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12345\"><img loading=\"lazy\" decoding=\"async\" width=\"1294\" height=\"472\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.55.12\u202fPM.png\" alt=\"Mathematical formula for BERTScore baseline rescoring\" class=\"wp-image-12345\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.55.12\u202fPM.png 1294w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.55.12\u202fPM-300x109.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.55.12\u202fPM-1024x374.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-8.55.12\u202fPM-768x280.png 768w\" sizes=\"auto, (max-width: 1294px) 100vw, 1294px\" \/><figcaption class=\"wp-element-caption\">Mathematical formula for BERTScore baseline rescoring<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12350\"><img loading=\"lazy\" decoding=\"async\" width=\"2442\" height=\"1238\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM.png\" alt=\"A comparison of the similarity matrices before and after baseline rescaling for bertscore\" class=\"wp-image-12350\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM.png 2442w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM-300x152.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM-1024x519.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM-768x389.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM-1536x779.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-17-at-7.26.00\u202fPM-2048x1038.png 2048w\" sizes=\"auto, (max-width: 2442px) 100vw, 2442px\" \/><figcaption class=\"wp-element-caption\">After baseline rescaling, the cosine similarity scores range from -1 to 1.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Both of these processes are set to <\/span><b>False<\/b><span style=\"font-weight: 400;\"> by default in Hugging Face\u2019s implementation of BERTScore, so we won\u2019t include them when we code BERTScore from scratch. Each can be set to <\/span><b>True<\/b><span style=\"font-weight: 400;\"> with the <\/span><b>idf<\/b><span style=\"font-weight: 400;\"> and <\/span><b>rescale_with_baseline<\/b><span style=\"font-weight: 400;\"> parameters of <\/span><b>evaluate.bertscore<\/b><span style=\"font-weight: 400;\">, respectively.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-putting-it-all-together\">Putting It All Together<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Now that we have all of our helper functions, let\u2019s put them together to create our BERTScore function. In this function we will:<\/span><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Generate embeddings of the candidate and reference tokens using the model we instantiated above.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Calculate the cosine similarity matrix (or, after normalization, the dot product) between each candidate embedding and each reference embedding.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Using greedy matching, calculate the precision, recall, and f1 scores of each sentence<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Return the dictionary of values.<\/span><\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>def bert_score(candidate, reference):\n    \"\"\"\n    Compute BERTScore (Precision, Recall, F1) between a candidate and a reference sentence.\n\n    Args:\n        candidate (str): Candidate sentence.\n        reference (str): Reference sentence.\n\n    Returns:\n        dict: Dictionary containing precision, recall, and F1 scores.\n    \"\"\"\n    # Get token embeddings for candidate and reference\n    candidate_embeddings = get_embeddings(candidate)\n    reference_embeddings = get_embeddings(reference)\n\n    # Compute cosine similarity matrix\n    similarity_matrix = cosine_similarity(candidate_embeddings, reference_embeddings)\n\n    # Calculate precision, recall, and F1 scores\n    precision = get_precision(similarity_matrix)\n    recall = get_recall(similarity_matrix)\n    f1_score = get_f1_score(precision, recall)\n\n    # Return scores as a dictionary\n    return {\n        \"precision\": precision.item(),\n        \"recall\": recall.item(),\n        \"f1_score\": f1_score.item(),\n    }\n\n# Example usage\nif __name__ == \"__main__\":\n    candidate_sentence = \"The cat sat on the mat.\"\n    reference_sentence = \"A cat rested on a mat.\"\n\n    scores = bert_score(candidate_sentence, reference_sentence)\n    print(\"BERTScore:\", scores)\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">Feel free to test this function out for yourself! Note that the intention of this exercise is to help build intuition around what BERTScore does under the hood, and it is significantly simplified from the HuggingFace.evaluate version, which incorporates IDF, baseline rescaling, batching, padding, attention mask shifting, and more.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">For these reasons, we will be using the Hugging Face implementation of BERTScore to build a custom metric in Opik below.&nbsp;<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-implement-bertscore-in-opik\">Implement BERTScore in Opik<\/h2>\n\n\n\n<p>Now let&#8217;s try a real-life end-to-end example. If you aren&#8217;t already, you can <a href=\"https:\/\/colab.research.google.com\/drive\/1ti9yl-0ynl9eFPKJnIzkpUxbyXNtsyFA\">follow along with the Colab here<\/a>.<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In this section, we\u2019ll use BLIP, an image-captioning model, along with a small subset of the <\/span><a href=\"https:\/\/huggingface.co\/datasets\/google-research-datasets\/conceptual_captions\/viewer\"><span style=\"font-weight: 400;\">Conceptual Captions<\/span><\/a><span style=\"font-weight: 400;\"> dataset from Google Research, which pairs images with captions sourced from the internet. Notably, image captioning and machine translation were the original use cases proposed by BERTScore\u2019s authors.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To do this, we\u2019ll implement BERTScore in Opik, Comet\u2019s open-source <a href=\"https:\/\/www.comet.com\/site\/products\/opik\/\">LLM evaluation framework<\/a>. We\u2019ll leverage Hugging Face\u2019s evaluate implementation of BERTScore, modifying it slightly to create a custom Opik metric with a score method that returns a ScoreResult object:<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from evaluate import load\n\nbertscore = load(\"bertscore\")\n\nfrom opik.evaluation.metrics import base_metric, score_result\nfrom typing import List, Union\n\nclass BERTScore(base_metric.BaseMetric):\n    \"\"\"\n    BERTScore is a semantic similarity evaluation metric for text generation tasks.\n    It measures the similarity between predicted (candidate) and reference texts\n    by comparing their contextual embeddings using a pre-trained language model.\n\n    This implementation leverages the Hugging Face Evaluate library for computing BERTScore.\n\n    For more details:\n    - Original BERTScore paper: https:\/\/arxiv.org\/abs\/1904.09675\n    - Hugging Face implementation: https:\/\/github.com\/huggingface\/evaluate\/blob\/main\/metrics\/bertscore\/README.md\n\n    Args:\n        name (str): The name of the metric, defaults to \"BERTScore\".\n        language (str): The language of the model, defaults to \"en\" (English).\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str = \"BERTScore\",\n        language: str = \"en\"\n    ):\n        self.name=name\n        self.language = language\n\n    def score(\n        self, candidate: str, reference: str, **kwargs\n    ) -&gt; List&#91;score_result.ScoreResult]:\n        \"\"\"\n        Computes the BERTScore between a candidate (predicted) text and a reference (ground truth) text.\n\n        This method calculates recall, precision, and F1 score based on token-level\n        contextual embeddings, using a pre-trained transformer model.\n\n        Args:\n            candidate (str or List&#91;str]): The candidate text or list of texts to evaluate.\n                Must not be empty or contain only whitespace.\n            reference (str or List&#91;str]): The reference text or list of texts to compare against.\n                Must not be empty or contain only whitespace.\n            **kwargs: Additional keyword arguments for the Hugging Face BERTScore computation.\n\n        Returns:\n            List&#91;score_result.ScoreResult]: A list of `ScoreResult` objects containing:\n                - BERTRecall: The BERTScore recall score.\n                - BERTPrecision: The BERTScore precision score.\n                - BERTF1: The BERTScore F1 score.\n\n        Raises:\n            ValueError: If candidate or reference inputs are empty strings or lists.\n            TypeError: If candidate or reference inputs are not strings or lists of strings.\n        \"\"\"\n        # Validate and normalize inputs\n        def validate_and_normalize(text: Union&#91;str, List&#91;str]]) -&gt; List&#91;str]:\n            if isinstance(text, str):\n                if not text.strip():\n                    raise ValueError(\"Input text cannot be empty or whitespace.\")\n                return &#91;text]\n            if isinstance(text, list):\n                if not all(isinstance(t, str) and t.strip() for t in text):\n                    raise ValueError(\"All elements in the input list must be non-empty strings.\")\n                return text\n            raise TypeError(\"Input must be a string or a list of strings.\")\n\n        candidate = validate_and_normalize(candidate)\n        reference = validate_and_normalize(reference)\n\n        results_dict = bertscore.compute(predictions=candidate, references=reference, lang=self.language)\n\n        # Create score results\n        return &#91;\n            score_result.ScoreResult(value=results_dict&#91;\"recall\"]&#91;0], name=\"BERTRecall\"),\n            score_result.ScoreResult(value=results_dict&#91;\"precision\"]&#91;0], name=\"BERTPrecision\"),\n            score_result.ScoreResult(value=results_dict&#91;\"f1\"]&#91;0], name=\"BERTF1\"),\n        ]\n\nbscore = BERTScore()\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">After defining BERTScore as a custom metric, we can use it by:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Defining the model\u2019s forward pass in your_llm_application (after some minor image pre-processing).&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Calling our application in evaluation_task and returning a dictionary with keys that match the parameters expected by our custom BERTScore metric above.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Add tracking by decorating our functions with Opik\u2019s @track decorator to automatically log relevant data to the platform.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Pass the evaluation_task function to Opik\u2019s evaluate function, which runs and logs the full evaluation process, including calculating perplexity scores for each call.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Note that we also pass our model\u2019s configuration details to the evaluate function to log them to Opik.<\/span><\/li>\n<\/ul>\n\n\n\n<p><span style=\"font-weight: 400;\">You can find <\/span><a href=\"https:\/\/colab.research.google.com\/drive\/1ti9yl-0ynl9eFPKJnIzkpUxbyXNtsyFA\"><span style=\"font-weight: 400;\">the full code in the Colab<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from opik import track\nimport requests\nfrom PIL import Image\nfrom opik.evaluation import evaluate\n\n# Configuration constants for text generation\nMAX_LENGTH = 50\nMIN_LENGTH = 10\nLENGTH_PENALTY = 1.0\nREPETITION_PENALTY = 1.2\nNUM_BEAMS = 5\nEARLY_STOPPING = True\n\n# Model name\nMODEL_NAME = \"your_model_name_here\"  # Replace with your actual model name\n\n\n@track\ndef generate_caption(image_url: str) -&gt; dict:\n    \"\"\"\n    Generates a caption for an image using a pre-trained LLM.\n\n    Args:\n        image_url (str): The URL of the image to caption.\n\n    Returns:\n        dict: A dictionary containing the generated caption as 'candidate'.\n    \"\"\"\n    # Load image from the provided URL\n    try:\n        response = requests.get(image_url, stream=True)\n        response.raise_for_status()\n        image = Image.open(response.raw)\n    except requests.exceptions.RequestException as e:\n        raise ValueError(f\"Error fetching image from URL: {e}\")\n\n    # Preprocess the image\n    inputs = processor(images=image, return_tensors=\"pt\").to(\"cuda\")\n\n    # Generate text using the model\n    outputs = model.generate(\n        **inputs,\n        max_length=MAX_LENGTH,         # Maximum length of generated text\n        min_length=MIN_LENGTH,         # Minimum length of generated text\n        length_penalty=LENGTH_PENALTY, # Length penalty to control verbosity\n        repetition_penalty=REPETITION_PENALTY, # Penalty to avoid repetition\n        num_beams=NUM_BEAMS,           # Number of beams for beam search\n        early_stopping=EARLY_STOPPING  # Stop generation early when appropriate\n    )\n\n    # Decode and return the caption\n    caption = processor.decode(outputs&#91;0], skip_special_tokens=True)\n    return {\"candidate\": caption}\n\n\n@track\ndef evaluation_task(data: dict) -&gt; dict:\n    \"\"\"\n    Evaluation task to compare generated captions with reference captions.\n\n    Args:\n        data (dict): A dictionary containing 'image_url' and 'reference' keys.\n\n    Returns:\n        dict: A dictionary with 'reference' and 'candidate' captions.\n    \"\"\"\n    # Generate LLM output (caption)\n    llm_output = generate_caption(data&#91;'image_url'])\n\n    # Return the reference and candidate captions\n    return {\n        \"reference\": data&#91;'reference'],\n        \"candidate\": llm_output&#91;'candidate']\n    }\n\n\n# Run evaluation\nevaluation = evaluate(\n    experiment_name=\"My BERTScore Experiment\", # Name of the experiment\n    dataset=dataset,                           # Dataset for evaluation\n    task=evaluation_task,                      # Evaluation task\n    scoring_metrics=&#91;bscore],                  # Scoring metrics to use\n    experiment_config={                        # Configuration for the experiment\n        \"model\": MODEL_NAME,\n        \"max_length\": MAX_LENGTH,\n        \"min_length\": MIN_LENGTH,\n        \"length_penalty\": LENGTH_PENALTY,\n        \"repetition_penalty\": REPETITION_PENALTY,\n        \"num_beams\": NUM_BEAMS,\n        \"early_stopping\": EARLY_STOPPING\n    },\n    task_threads=1,                            # Number of threads for the task\n)\n<\/code><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">And here is what the output of your evaluation should look like from within the Opik UI:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12359\"><img loading=\"lazy\" decoding=\"async\" width=\"2982\" height=\"1644\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM.png\" alt=\"bertscore demo experiment in Comet Opik UI screenshot\" class=\"wp-image-12359\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM.png 2982w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM-300x165.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM-1024x565.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM-768x423.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM-1536x847.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.20.15\u202fPM-2048x1129.png 2048w\" sizes=\"auto, (max-width: 2982px) 100vw, 2982px\" \/><figcaption class=\"wp-element-caption\">Compare individual samples across experiments<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12361 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2986\" height=\"1630\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM.png\" alt=\"Screenshot of bertscore demo experiment in Comet Opik UI\" class=\"wp-image-12361\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM.png 2986w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM-300x164.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM-1024x559.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM-768x419.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM-1536x838.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-14-at-9.18.08\u202fPM-2048x1118.png 2048w\" sizes=\"auto, (max-width: 2986px) 100vw, 2986px\" \/><figcaption class=\"wp-element-caption\">View details of individuals trace spans of your LLM application<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-bertscore-towards-llm-as-a-judge-metrics\">BERTScore: Towards LLM-as-a-judge Metrics<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">BERTScore was among the first widely adopted evaluation metrics to incorporate large language models for assessing output quality. It operates by using a pre-trained transformer-based model, such as BERT, to generate contextual embeddings, or, dense, learned representations of tokens that encode semantic and syntactic information.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">While innovative for its time, BERTScore represents an earlier stage in the progression of modern LLM evaluation methods. Unlike modern &#8220;LLM-as-a-judge&#8221; approaches, which rely on language models to generate comprehensive, nuanced feedback for another model&#8217;s outputs, BERTScore focuses solely on token-level comparisons without producing holistic judgments. This distinction underscores a shift toward evaluation techniques that prioritize coherence, reasoning, and context on a broader scale.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">However, LLM-as-a-judge methods, while powerful, remain opaque, non-deterministic, and computationally expensive, making them less accessible and harder to interpret. In contrast, metrics like BERTScore remain essential for their efficiency, transparency, and utility in providing actionable insights into model behavior.<\/span><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>If you found this article useful, follow me on <a href=\"https:\/\/www.linkedin.com\/in\/anmorgan24\/\">LinkedIn<\/a> and <a href=\"https:\/\/x.com\/anmorgan2414\">Twitter<\/a> for more content!<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like BLEU and ROUGE to a learned approach that captures complex linguistic nuances. Unlike older n-gram-based methods, BERTScore excels at evaluating paraphrasing, coherence, relevance, and polysemy\u2014essential features for modern AI applications. BERTScore leverages transformer-based contextual embeddings and compares them using cosine [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":18423,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,65,7],"tags":[71,95,31,96,94],"coauthors":[133],"class_list":["post-12303","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-tutorials","tag-language-models","tag-llm-evaluation","tag-llmops","tag-llms","tag-opik"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>BERTScore For LLM Evaluation<\/title>\n<meta name=\"description\" content=\"BERTScore leverages transformer-based contextual embeddings and compares them using cosine similarity to assess the quality of model outputs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"BERTScore For LLM Evaluation\" \/>\n<meta property=\"og:description\" content=\"BERTScore leverages transformer-based contextual embeddings and compares them using cosine similarity to assess the quality of model outputs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-12-19T17:10:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-13T19:58:08+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"BERTScore For LLM Evaluation","description":"BERTScore leverages transformer-based contextual embeddings and compares them using cosine similarity to assess the quality of model outputs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/","og_locale":"en_US","og_type":"article","og_title":"BERTScore For LLM Evaluation","og_description":"BERTScore leverages transformer-based contextual embeddings and compares them using cosine similarity to assess the quality of model outputs.","og_url":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-12-19T17:10:38+00:00","article_modified_time":"2025-11-13T19:58:08+00:00","og_image":[{"width":2560,"height":1440,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-scaled.jpg","type":"image\/jpeg"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"BERTScore For LLM Evaluation","datePublished":"2024-12-19T17:10:38+00:00","dateModified":"2025-11-13T19:58:08+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/"},"wordCount":2535,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-scaled.jpg","keywords":["Language Models","LLM Evaluation","LLMOps","LLMs","Opik"],"articleSection":["Comet Community Hub","LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/","url":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/","name":"BERTScore For LLM Evaluation","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-scaled.jpg","datePublished":"2024-12-19T17:10:38+00:00","dateModified":"2025-11-13T19:58:08+00:00","description":"BERTScore leverages transformer-based contextual embeddings and compares them using cosine similarity to assess the quality of model outputs.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-scaled.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/12\/BertScore-scaled.jpg","width":2560,"height":1440,"caption":"BERTScore For LLM Evaluation Featured Image"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/bertscore-for-llm-evaluation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"BERTScore For LLM Evaluation"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12303","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=12303"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12303\/revisions"}],"predecessor-version":[{"id":18424,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/12303\/revisions\/18424"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/18423"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=12303"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=12303"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=12303"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=12303"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}