{"id":6713,"date":"2023-07-16T14:37:03","date_gmt":"2023-07-16T22:37:03","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6713"},"modified":"2025-06-18T10:28:01","modified_gmt":"2025-06-18T10:28:01","slug":"explainable-ai-for-transformers","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/","title":{"rendered":"Explainable AI: Visualizing Attention in Transformers"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-large size-full wp-image-6695\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"607\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large-1024x607.jpeg\" alt=\"pink and blue robot on an orange background\" class=\"wp-image-17100\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large-1024x607.jpeg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large-300x178.jpeg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large-768x455.jpeg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@jefferyho?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Jeffery Ho<\/a> on <a href=\"https:\/\/unsplash.com\/photos\/x22UAIdif_k?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a>, edited by author.<\/figcaption><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1WvIHAaXjWK-kRzmB_lLjNx8wJYuUnhCn#scrollTo=k6FQL8UuKXd_\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"\/signup\" target=\"_blank\" rel=\"noreferrer noopener\">Create a free Comet account<\/a><\/div>\n<\/div>\n\n\n\n<p><span style=\"font-weight: 400;\">In this article we explore one of the most popular tools for visualizing the core distinguishing feature of transformer architectures: the attention mechanism. Keep reading to learn more about BertViz and how you can incorporate this attention visualization tool into your NLP and MLOps workflow with Comet.&nbsp;&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Feel free to follow along with the <\/span><a href=\"https:\/\/colab.research.google.com\/drive\/1WvIHAaXjWK-kRzmB_lLjNx8wJYuUnhCn#scrollTo=k6FQL8UuKXd_\"><span style=\"font-weight: 400;\">full-code tutorial here<\/span><\/a><span style=\"font-weight: 400;\">, or, if you can\u2019t wait, check out <\/span><a href=\"https:\/\/www.comet.com\/examples\/demo-visualizing-attention-bertviz\/view\/vyr6Nk6Y1cQIklggSZ4A3zSrf\/panels\"><span style=\"font-weight: 400;\">the final project here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\"><span style=\"font-weight: 400;\">Introduction<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Transformers have been described as the single most important technological development to NLP in recent years, but their processes remain largely opaque. This is a problem because, as we continue to make major machine learning advancements, we can\u2019t always explain how or why\u2013 which can lead to issues like undetected model bias, model collapse, and other ethical and reproducibility issues. Especially as models are more frequently deployed to sensitive areas like healthcare, law, finance, and security, model explainability is critical. <\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6743\"><img loading=\"lazy\" decoding=\"async\" width=\"2550\" height=\"1040\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM.png\" alt=\"Horizontal bar chart showing gender and race projections for different professions, as predicted by a Word2Vec model (pre-transformer)\" class=\"wp-image-6743\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM.png 2550w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM-300x122.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM-1024x418.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM-768x313.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM-1536x626.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-15-at-7.54.01-PM-2048x835.png 2048w\" sizes=\"auto, (max-width: 2550px) 100vw, 2550px\" \/><figcaption class=\"wp-element-caption\">Gender and race projections across professions, as calculated by Word2Vec. These learned biases could have a variety of negative consequences depending on the application of such a model. Image from <a href=\"https:\/\/medium.com\/institute-for-applied-computational-science\/bias-in-nlp-embeddings-b1dabb8bbe20\">Bias in NLP Embeddings<\/a> by Simon Warchal.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-is-bertviz\"><span style=\"font-weight: 400;\">What is BertViz?<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">BertViz is an open source tool that visualizes the attention mechanism of transformer models at multiple scales, including model-level, attention head-level, and neuron-level. But BertViz isn\u2019t new. In fact, early versions of BertViz have been around since as early as 2017.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">So, why are we still talking about BertViz?&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">BertViz is an explainability tool in a field (NLP) that is otherwise notoriously opaque. And, despite its name, BertViz doesn\u2019t only work on BERT. The BertViz API supports many transformer language models, including GPT2, T5, and <\/span><a href=\"https:\/\/huggingface.co\/models\"><span style=\"font-weight: 400;\">most HuggingFace models<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6731\"><img loading=\"lazy\" decoding=\"async\" width=\"1640\" height=\"924\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1.gif\" alt=\"BertViz visualization in the Comet UI for two different types of transformer models: an encoder-only distilbert transformer for question-answering and a decoder-only gpt-2 transformer for text generation\" class=\"wp-image-6731\"\/><figcaption class=\"wp-element-caption\">Despite its name, BertViz supports a wide variety of models. On the left, we visualize a question-answering task using an encoder-only model, and on the right, a text generation task using a decoder-only model. GIF by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">As transformer architectures have increasingly dominated the machine learning landscape in recent years, they\u2019ve also revived an old but important debate regarding interpretability and transparency in AI. So, while BertViz may not be new, its application as an explainability tool in the AI space is more relevant now than ever. <\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-but-first-transformers\"><span style=\"font-weight: 400;\">But first, transformers<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">To explain BertViz, it helps to have a basic understanding of transformers and self-attention. If you\u2019re already familiar with these concepts, feel free to skip ahead to the section where we start coding.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">We won\u2019t go into the nitty gritty details of transformers here, as that\u2019s a little beyond the scope of this article, but we will cover some of the basics. I also encourage you to check out the additional resources at the end of the article. <\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-in-the-beginning-the-prehistoric-era-of-nlp\"><span style=\"font-weight: 400;\">In the beginning (the prehistoric era of NLP)<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">So, how, exactly, does a computer \u201clearn\u201d natural language? In short, they can\u2019t\u2013 at least not directly. Computers can only understand and process numerical data, so the first step of NLP is to break down sentences into \u201ctokens,\u201d which are assigned numerical values. The question driving NLP then becomes \u201chow can we accurately reduce language and communication processes to computations?\u201d&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Some of the first NLP models included feed-forward neural networks like the Multi-Layer Perceptron (MLP) and even CNNs, which are more popularly used today for computer vision. These models worked for some simple classification tasks (like sentiment analysis) but had a major drawback: their feed-forward nature meant that at each point in time, the network only saw one word as its input. Imagine trying to predict the word that follows \u201cthe\u201d in a sentence. How many possibilities are there?<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6745\"><img loading=\"lazy\" decoding=\"async\" width=\"2030\" height=\"396\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-4.37.40-PM.png\" alt=\"A visualization of the difficulty next-word sentence prediction for sequence models that don't &quot;remember&quot; any context\" class=\"wp-image-6745\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-4.37.40-PM.png 2030w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-4.37.40-PM-300x59.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-4.37.40-PM-1024x200.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-4.37.40-PM-768x150.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-4.37.40-PM-1536x300.png 1536w\" sizes=\"auto, (max-width: 2030px) 100vw, 2030px\" \/><figcaption class=\"wp-element-caption\">Without much context, next word prediction can become extremely difficult. Graphic by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">To solve this problem, Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs like Seq2Seq) allowed for feedback, or cycles. This meant that each computation was informed by the previous computation, allowing for more context.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This context was still limited, however. If the input sequence was very long, the model would tend to forget the beginning of the sequence by the time it got to the end of the sequence. Also, their sequential nature didn\u2019t allow for parallelization, making them extremely inefficient. RNNs also suffered notoriously from exploding gradients.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-introducing-transformers\"><span style=\"font-weight: 400;\">Introducing transformers<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Transformers are sequence models that abandon the sequential structure of RNNs and LSTMs and adopt a fully attention-based approach. Transformers were initially developed for text processing, and are central to relatively all state-of-the-art NLP neural networks today, but they can also be used with image, video, audio, or virtually any other sequential data.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The key differentiating feature of transformers from previous NLP models was the attention mechanism, as popularized in the <\/span><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\"><span style=\"font-weight: 400;\">Attention Is All You Need<\/span><\/a><span style=\"font-weight: 400;\"> paper. This allowed for parallelization, which meant faster training and optimized performance. Attention also allowed for much larger contexts than recurrence, meaning transformers could craft more coherent, relevant, and complex outputs.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6751\"><img loading=\"lazy\" decoding=\"async\" width=\"2420\" height=\"1006\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM.png\" alt=\"The original transformer architecture, as visualized in the 2017 paper that made them famous, Attention Is All You Need.\" class=\"wp-image-6751\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM.png 2420w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM-300x125.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM-1024x426.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM-768x319.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM-1536x639.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-4.37.05-PM-2048x851.png 2048w\" sizes=\"auto, (max-width: 2420px) 100vw, 2420px\" \/><figcaption class=\"wp-element-caption\">The original transformer architecture, as visualized in the 2017 paper that made them famous, <a href=\"https:\/\/arxiv.org\/pdf\/1706.03762.pdf\">Attention Is All You Need<\/a>.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Transformers are made up of encoders and decoders, and the tasks we can perform with them depend on whether we use either or both of these components. Some common transformer tasks for NLP include text classification, named entity recognition, question-answering, text summarization, fill-in-the-blanks, next word prediction, translation, and text generation.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6749\"><img loading=\"lazy\" decoding=\"async\" width=\"1676\" height=\"1258\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-9.48.50-PM.png\" alt=\"Chart showing the three different types of transformers: encoder-only, decoder-only, and encoder-decoder models. Chart also lists tasks specific to each type of transformer, as well as examples and alternative names.\" class=\"wp-image-6749\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-9.48.50-PM.png 1676w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-9.48.50-PM-300x225.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-9.48.50-PM-1024x769.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-9.48.50-PM-768x576.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-9.48.50-PM-1536x1153.png 1536w\" sizes=\"auto, (max-width: 1676px) 100vw, 1676px\" \/><figcaption class=\"wp-element-caption\">Transformers are made up of encoders and decoders, and the tasks we can perform with them depend on whether we use either or both of these components. Graphic by author.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-do-transformers-fit-into-the-larger-ecosystem-of-nlp-models\"><span style=\"font-weight: 400;\">How do transformers fit into the larger ecosystem of NLP models?<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">You\u2019ve probably heard of Large Language Models (LLMs) like ChatGPT or LLaMA. The transformer architecture is a fundamental building block of LLMs, which use self-supervised learning on vast amounts of unlabelled data. These models are also sometimes referred to as \u201cfoundation models\u201d because they tend to generalize well to a wide range of tasks, and in some cases are also available for more specific fine-tuning. BERT is an example of this category of model.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6754\"><img loading=\"lazy\" decoding=\"async\" width=\"2778\" height=\"968\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM.png\" alt=\"A graphic showing the relationship between transformer architectures, foundation models, and large language models. Graphic includes (as examples): ViT, BLOOM, BERT, Falcon, LLaMA, ChatGPT, and SAM (Segment Anything Model).\" class=\"wp-image-6754\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM.png 2778w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM-300x105.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM-1024x357.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM-768x268.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM-1536x535.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-6.37.03-PM-2048x714.png 2048w\" sizes=\"auto, (max-width: 2778px) 100vw, 2778px\" \/><figcaption class=\"wp-element-caption\">Not all LLMs or foundation models use transformers, but they usually do. Likewise, not all foundation models are LLMs, but they usually are. Finally, not all transformers are LLMs or FMs. The important takeaway is that all transformer models use attention. Graphic by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">That\u2019s a lot of information but the important takeaway here is that the key differentiating feature of the transformer model (and by extension all transformer-based foundational LLMs) is the concept of <\/span><b>self-attention<\/b><span style=\"font-weight: 400;\">, which we\u2019ll go over next. <\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-attention\"><span style=\"font-weight: 400;\">Attention<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Generally speaking, attention describes the ability of a model to pay attention to the important parts of a sentence (or image, or any other sequential input). It does this by assigning weights to input features based on their importance and their position in the sequence.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Remember that attention was the concept that improved the performance of previous NLP models (like RNNs and LSTMs) by lending itself to parallelization. But attention isn\u2019t just about optimization. It also plays a pivotal role in broadening the context a language model is able to consider while processing and generating language. This enables a model to produce contextually appropriate and coherent texts in much longer sequences.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6757\"><img loading=\"lazy\" decoding=\"async\" width=\"1756\" height=\"1186\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/theanimaldidntcrossthestreetbecauseitwastooscared.png\" alt=\"A graphic showing the BertViz representation of the sentence &quot;the animal didn't cross the street because it was too scared.&quot; The last word, scared, was predicted by the GPT-2 model. The graphic shows that GPT-2 correlates &quot;it&quot; to the animal.\" class=\"wp-image-6757\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/theanimaldidntcrossthestreetbecauseitwastooscared.png 1756w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/theanimaldidntcrossthestreetbecauseitwastooscared-300x203.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/theanimaldidntcrossthestreetbecauseitwastooscared-1024x692.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/theanimaldidntcrossthestreetbecauseitwastooscared-768x519.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/theanimaldidntcrossthestreetbecauseitwastooscared-1536x1037.png 1536w\" sizes=\"auto, (max-width: 1756px) 100vw, 1756px\" \/><figcaption class=\"wp-element-caption\">In this example, GPT-2 finished the input sequence with the word \u201cscared.\u201d How did the model know what \u201cit\u201d was? By examining the attention heads, we learn the model associated \u201cit\u201d with \u201cthe animal\u201d (instead of, for example, \u201cthe street\u201d). Image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">If we break transformers down into a&nbsp; \u201ccommunication\u201d phase and a \u201ccomputation\u201d phase, attention would represent the \u201ccommunication\u201d phase. In another analogy, attention is a lot like a search-retrieval problem, where given a <\/span><b>query<\/b><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">q<\/span><\/i><span style=\"font-weight: 400;\">, we want to find the set of <\/span><b>keys<\/b><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\">, most similar to <\/span><i><span style=\"font-weight: 400;\">q<\/span><\/i><span style=\"font-weight: 400;\"> and return the corresponding <\/span><b>values<\/b><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">v<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><b>Query:<\/b><span style=\"font-weight: 400;\"> What are the things I am looking for?<\/span><\/li>\n\n\n\n<li><b>Key:<\/b><span style=\"font-weight: 400;\"> What are the things that I have?<\/span><\/li>\n\n\n\n<li><b>Value:<\/b><span style=\"font-weight: 400;\"> What are the things that I will communicate?&nbsp;<\/span><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6759\"><img loading=\"lazy\" decoding=\"async\" width=\"1924\" height=\"602\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/attentioncalc-ericstorrs.png\" alt=\"A visualization of how to calculate attention for transformers\" class=\"wp-image-6759\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/attentioncalc-ericstorrs.png 1924w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/attentioncalc-ericstorrs-300x94.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/attentioncalc-ericstorrs-1024x320.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/attentioncalc-ericstorrs-768x240.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/attentioncalc-ericstorrs-1536x481.png 1536w\" sizes=\"auto, (max-width: 1924px) 100vw, 1924px\" \/><figcaption class=\"wp-element-caption\">Visualization of attention calculation. Image from <a href=\"https:\/\/storrs.io\/attention\/\">Erik Storrs<\/a>.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-types-of-attention\">Types of attention<\/h3>\n\n\n\n<p><b>Self-attention<\/b><span style=\"font-weight: 400;\"> refers to the fact that every node produces a key, query, and a value from that individual node. <\/span><b>Multi-headed attention<\/b><span style=\"font-weight: 400;\"> is just self-attention that is applied multiple times in parallel with different initialized weights. <\/span><b>Cross-attention<\/b><span style=\"font-weight: 400;\"> means that the queries are still produced from a given decoder node, but the keys and the values are produced as a function of the nodes in the encoder.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This is an oversimplified summary of transformer architectures, and we\u2019ve glossed over quite a few details (like <\/span><a href=\"https:\/\/machinelearningmastery.com\/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1\/\"><span style=\"font-weight: 400;\">positional encodings<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/lukesalamone.github.io\/posts\/what-are-attention-masks\/\"><span style=\"font-weight: 400;\">attention masks<\/span><\/a><span style=\"font-weight: 400;\">). For more information, check out the additional resources below.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-visualizing-attention-before-bertviz\"><span style=\"font-weight: 400;\">Visualizing attention before BertViz<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Transformers are not <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2203.17081.pdf\"><span style=\"font-weight: 400;\">inherently interpretable<\/span><\/a><span style=\"font-weight: 400;\">, but there have been many attempts to contribute <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2203.17081.pdf\"><span style=\"font-weight: 400;\">post-hoc explainability<\/span><\/a><span style=\"font-weight: 400;\"> tools to attention-based models.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Previous attempts to visualize attention were often overly complicated and didn\u2019t translate well to non-technical audiences. They could also vary greatly from project to project and use-case to use-case.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6760 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2142\" height=\"1290\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz.png\" alt=\"A compliation of some very confusing and complicated previous attempts to visualize attention\" class=\"wp-image-6760\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz.png 2142w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz-300x181.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz-1024x617.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz-768x463.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz-1536x925.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/ugly-old-viz-2048x1233.png 2048w\" sizes=\"auto, (max-width: 2142px) 100vw, 2142px\" \/><figcaption class=\"wp-element-caption\">Previous attempts to visualize attention weren\u2019t standardized and were often overly confusing. Graphic compiled by author from <a href=\"https:\/\/aclanthology.org\/D17-2021\/\">Interactive visualization and manipulation of attention-based neural machine translation<\/a> (2017) and <a href=\"https:\/\/arxiv.org\/abs\/1804.09299\">A Visual Debugging Tool for Sequence-to-Sequence Models<\/a> (2018).<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Some successful attempts to explain attention behavior included attention-matrix heat maps and bi-partite graph representations, both of which are still used today. But these methods also have some major limitations. <\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6763\"><img loading=\"lazy\" decoding=\"async\" width=\"1908\" height=\"1134\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-5.43.47-PM.png\" alt=\"A graphic showing some methods of visualizing transformer attention other than BertViz\" class=\"wp-image-6763\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-5.43.47-PM.png 1908w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-5.43.47-PM-300x178.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-5.43.47-PM-1024x609.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-5.43.47-PM-768x456.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-5.43.47-PM-1536x913.png 1536w\" sizes=\"auto, (max-width: 1908px) 100vw, 1908px\" \/><figcaption class=\"wp-element-caption\">The attention-matrix heatmap (left) shows us that the model is not translating word-for-word, but considering a larger context for word order. But it\u2019s missing a lot of the finer details of the attention mechanism.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">BertViz ultimately gained popularity for its ability to illustrate low-level, granular details of self-attention, while still remaining remarkably simple and intuitive to use.&nbsp;<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6737\"><img loading=\"lazy\" decoding=\"async\" width=\"678\" height=\"599\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/simple-pretty-gif.gif\" alt=\"GIF of BertViz Attention Head View, selecting transformer later and attention format type, and selecting specific attention heads, as visualized in Comet ML\" class=\"wp-image-6737\"\/><figcaption class=\"wp-element-caption\">BertViz ultimately gained popularity for its ability to illustrate low-level, granular details of self-attention, while still remaining remarkably simple and intuitive to use. GIF by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">That\u2019s a nice, clean visualization. But, what are we actually looking at?<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-bertviz-breaks-it-all-down\"><span style=\"font-weight: 400;\">How BertViz Breaks It All Down<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">BertViz visualizes the attention mechanism at multiple <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2203.17081.pdf\"><span style=\"font-weight: 400;\">local scales<\/span><\/a><span style=\"font-weight: 400;\">: the neuron-level, attention head-level, and model-level. Below we break down what that means, starting from the lowest, most granular level, and making our way up.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6765\"><img loading=\"lazy\" decoding=\"async\" width=\"2498\" height=\"832\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM.png\" alt=\"A graphic showing the model view, attention head view, and neuron view of a transformer model using BertViz\" class=\"wp-image-6765\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM.png 2498w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM-300x100.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM-1024x341.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM-768x256.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM-1536x512.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-7.50.18-PM-2048x682.png 2048w\" sizes=\"auto, (max-width: 2498px) 100vw, 2498px\" \/><figcaption class=\"wp-element-caption\">BertViz visualizes attention at multiple scales, including the model level, attention head level, and neuron layer. Graphic by author.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-visualizing-bertviz-with-comet\"><span style=\"font-weight: 400;\">Visualizing BertViz With Comet<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">We\u2019ll log our BertViz plots to Comet, an experiment tracking tool, so we can compare our results later on. To get started with Comet, <\/span><a href=\"\/signup\/?utm_source=Comet_blog&amp;utm_medium=referral&amp;utm_content=VisualizingAttention_blog\"><span style=\"font-weight: 400;\">create a free account here<\/span><\/a><span style=\"font-weight: 400;\">, grab your API key, and run the following code:<\/span><\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/2d8d1614c9c8d8298d42e0d9f302e3b3.js\"><\/script><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Visualizing attention in Comet will help us interpret our models\u2019 decisions by showing how they attend to different parts of the input. In this tutorial, we\u2019ll use these visualizations to compare and dissect the performance of several pre-trained LLMs. But these visualizations can also be used during fine-tuning for debugging purposes.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To add BertViz to your dashboard, navigate to Comet\u2019s public panels and select either \u2018Transformers Model Viewer\u2019 or \u2018Transformers Attention Head Viewer.\u2019<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6741\"><img loading=\"lazy\" decoding=\"async\" width=\"1493\" height=\"711\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/add-bertviz-to-dashboard2.gif\" alt=\"GIF showing how to add transformer model view of BertViz visualization to Comet UI dashboard.\" class=\"wp-image-6741\"\/><figcaption class=\"wp-element-caption\">To add BertViz to your Comet dashboard, select it from the public panels and adjust your view to your liking.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">We\u2019ll define some functions to parse our models results and log the attention information to Comet. See the <\/span><a href=\"https:\/\/colab.research.google.com\/drive\/1WvIHAaXjWK-kRzmB_lLjNx8wJYuUnhCn#scrollTo=k6FQL8UuKXd_\"><span style=\"font-weight: 400;\">Colab tutorial<\/span><\/a><span style=\"font-weight: 400;\"> to get the full code used. Then, we\u2019ll run the following commands to start logging our data to Comet:<\/span><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-text-generation-example\"><span style=\"font-weight: 400;\">Text generation example<\/span><\/h4>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/056d55a66e9f06f2e47e716450cf3d15.js\"><\/script><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-question-answering-example\"><span style=\"font-weight: 400;\">Question-answering example<\/span><\/h4>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/f028d21b670fea029f16a6e6dfde1d3b.js\"><\/script><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-sentiment-analysis-example\"><span style=\"font-weight: 400;\">Sentiment analysis example<\/span><\/h4>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/d9e53bf886c5e5eb591be5acde992207.js\"><\/script><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-neuron-view\"><span style=\"font-weight: 400;\">Neuron View<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">At the lowest level, BertViz visualizes the query, key, and value embeddings used to compute attention in a neuron. Given a selected token, this view traces the computation of attention from that token to the other tokens in the sequence.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In the GIF below, positive values are colored blue and negative values are colored orange, with color intensity reflecting the magnitude of the value. Connecting lines are weighted based on the attention score between respective words.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6732\"><img loading=\"lazy\" decoding=\"async\" width=\"1040\" height=\"553\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4neuron-view2.gif\" alt=\"A short GIF demonstrating how to use BertViz to visualize the computations on a neuron-level of the attention layer for our transformer experiment in Comet ML.\" class=\"wp-image-6732\"\/><figcaption class=\"wp-element-caption\">The neuron view breaks down the calculations used to predict each token, including the <a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\">key and query weights<\/a>.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Whereas the views in the following two sections will show <\/span><i><span style=\"font-weight: 400;\">what<\/span><\/i><span style=\"font-weight: 400;\"> attention patterns the model learns, this neuron view shows <\/span><i><span style=\"font-weight: 400;\">how<\/span><\/i><span style=\"font-weight: 400;\"> those patterns are learned. The neuron view is a bit more granular than we need to get for this particular tutorial, but for a deeper dive, we could use this view to link neurons to specific attention patterns and, more generally, to model behavior.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">It\u2019s important to note that it isn\u2019t entirely clear what relationships exist between attention weights and model outputs. Some, like Jain et al. in <\/span><a href=\"https:\/\/paperswithcode.com\/paper\/attention-is-not-explanation\"><span style=\"font-weight: 400;\">Attention Is Not Explanation<\/span><\/a><span style=\"font-weight: 400;\">, claim that standard attention modules should not be treated as though they provide meaningful explanations for predictions. They propose no alternatives, however, and BertViz remains one of the most popular attention visualization tools today.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-head-view\"><span style=\"font-weight: 400;\">Head View<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The attention-head view shows how attention flows between tokens within the same transformer layer by uncovering patterns between attention heads. In this view, the tokens on the left are attending to the tokens on the right and attention is represented as a line connecting each token pair. Colors correspond to attention heads and line thickness represents the attention weight.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In the drop-down menu, we can select the experiment we\u2019d like to visualize, and if we logged more than one asset to our experiment, we can also select our asset. We can then choose which attention layer we\u2019d like to visualize and, optionally, we can choose any combination of attention heads we\u2019d like to see. Note that color intensity of the lines connecting tokens corresponds to the attention weights between tokens.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6734\"><img loading=\"lazy\" decoding=\"async\" width=\"1053\" height=\"553\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/select-exp-asset-layer2.gif\" alt=\"BertViz interactive visualization, as plotted within the Comet UI. Select experiment, asset, transformer model layer, and attention format.\" class=\"wp-image-6734\"\/><figcaption class=\"wp-element-caption\">Users have the option to specify the experiment, asset, layer, and attention format within the Comet UI.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">We can also specify how we\u2019d like our tokens to be formatted. For the question-answering example below, we\u2019ll select \u201cSentence A \u2192 Sentence B\u201d so we can examine the attention between question and answer:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6767\"><img loading=\"lazy\" decoding=\"async\" width=\"1508\" height=\"1386\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-7.45.59-PM.png\" alt=\"A BertViz visualization of attention with different sentence structure comparisons\" class=\"wp-image-6767\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-7.45.59-PM.png 1508w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-7.45.59-PM-300x276.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-7.45.59-PM-1024x941.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-11-at-7.45.59-PM-768x706.png 768w\" sizes=\"auto, (max-width: 1508px) 100vw, 1508px\" \/><figcaption class=\"wp-element-caption\">Three different ways to visualize the attention output of BertViz. Graphic by author<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-attention-head-patterns\">Attention head patterns<\/h4>\n\n\n\n<p><span style=\"font-weight: 400;\">Attention heads do not share parameters, so each head learns a unique attention mechanism. In the graphic below, attention heads are examined across layers of the same model given one input. We can see that different attention heads seem to focus on very unique patterns.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">On the top left, attention is strongest between identical words (note the crossover where the two instances of \u201cthe\u201d intersect). In the top center, there\u2019s a focus on the next word in the sentence. On the top right and bottom left, the attention heads are focusing on each of the delimiters ([SEP] and [CLS], respectively). The bottom center places emphasis on the comma. And the bottom right is almost a bag-of-words pattern.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6771\"><img loading=\"lazy\" decoding=\"async\" width=\"1990\" height=\"1326\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-14-at-7.37.21-PM.png\" alt=\"BertViz shows that transformer attention captures various patterns in language, including positional patterns, delimiter patterns, and bag-of-words. \" class=\"wp-image-6771\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-14-at-7.37.21-PM.png 1990w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-14-at-7.37.21-PM-300x200.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-14-at-7.37.21-PM-1024x682.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-14-at-7.37.21-PM-768x512.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-14-at-7.37.21-PM-1536x1023.png 1536w\" sizes=\"auto, (max-width: 1990px) 100vw, 1990px\" \/><figcaption class=\"wp-element-caption\">BertViz shows that attention captures various patterns in language, including positional patterns, delimiter patterns, and bag-of-words. Image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Attention heads also capture lexical patterns. In the following graphic, we can see examples of attention heads that focus on list items (left), verbs (center), and acronyms (on the right).&nbsp;<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6772\"><img loading=\"lazy\" decoding=\"async\" width=\"2214\" height=\"792\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM.png\" alt=\"BertViz shows transformer attention heads capture lexical patterns like list items, verbs, and acronyms. \" class=\"wp-image-6772\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM.png 2214w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM-300x107.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM-1024x366.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM-768x275.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM-1536x549.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-9.51.33-PM-2048x733.png 2048w\" sizes=\"auto, (max-width: 2214px) 100vw, 2214px\" \/><figcaption class=\"wp-element-caption\">BertViz shows attention heads capture lexical patterns like list items, verbs, and acronyms. Image by author.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-attention-head-biases\">Attention head biases<\/h4>\n\n\n\n<p><span style=\"font-weight: 400;\">One application of the head view is detecting model bias. If we provide our model (in this case GPT-2) with two inputs that are identical except for the final pronouns, we get very different generated outputs:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6773\"><img loading=\"lazy\" decoding=\"async\" width=\"2430\" height=\"984\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM.png\" alt=\"BertViz can help capture model bias in transformer attention mechanisms\" class=\"wp-image-6773\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM.png 2430w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM-300x121.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM-1024x415.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM-768x311.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM-1536x622.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-11.38.08-AM-2048x829.png 2048w\" sizes=\"auto, (max-width: 2430px) 100vw, 2430px\" \/><figcaption class=\"wp-element-caption\">On the left, the model assumes \u201cshe\u201d is the nurse. On the right, it assumes \u201che\u201d is the doctor asking the question. Once we\u2019ve detected model bias, how might we augment our training data to counteract it? Image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">The model is assuming that \u201che\u201d refers to the doctor, and \u201cshe\u201d to the nurse, which might suggest that the co-reference mechanism is encoding gender bias. We would hope that by identifying a source of bias, we can potentially work to counteract it (perhaps with additional training data).<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-model-view\"><span style=\"font-weight: 400;\">Model View<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The model view is a bird\u2019s-eye perspective of attention across all layers and heads. Here we may notice attention patterns across layers, illustrating the evolution of attention patterns from input to output. Each row of figures represents an attention layer and each column represents individual attention heads. To enlarge the figure for any particular head, we can simply click on it. Note that you can find the same line pattern in the model view as in the head view.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6779\"><img loading=\"lazy\" decoding=\"async\" width=\"1004\" height=\"717\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/enlarging-attention-head.gif\" alt=\"A GIF showing how to enlarge the attention head view in the Comet UI using the model view. \" class=\"wp-image-6779\"\/><figcaption class=\"wp-element-caption\">To enlarge an attention head in the model view, simply click on it. Notice how the attention pattern evolves across layers. Image by author.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-model-view-applications\">Model view applications<\/h4>\n\n\n\n<p><span style=\"font-weight: 400;\">So, how might we use the model view? Firstly, because each layer is initialized with separate, independent weights, the layers that focus on specific patterns for one sentence may focus on different patterns for another sentence. So we can\u2019t necessarily look at the same attention heads for the same patterns across experiment runs. With the model view we can more generally identify which layers may be focusing on areas of interest for a given sentence. Note that this is a very inexact science and, as many have mentioned, \u201cif you look for it, you will find it.\u201d Nonetheless, this view does give us some interesting insight as to what the model <\/span><i><span style=\"font-weight: 400;\">may<\/span><\/i><span style=\"font-weight: 400;\"> be focusing on.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In the image below, we use the same example from earlier in the tutorial (left). On the right, a slightly different version of the sentence. In both cases, GPT-2 generated the last word in the sentence. At first, it may seem silly to think the dog had too many plans to go to the park. But examining the attention heads shows us the model was probably referring to the \u201cpark\u201d as \u201ctoo busy.\u201d<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6774\"><img loading=\"lazy\" decoding=\"async\" width=\"2392\" height=\"1220\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM.png\" alt=\"BertViz helps unravel how a transformer understands language.\" class=\"wp-image-6774\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM.png 2392w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM-300x153.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM-1024x522.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM-768x392.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM-1536x783.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-13-at-10.33.56-AM-2048x1045.png 2048w\" sizes=\"auto, (max-width: 2392px) 100vw, 2392px\" \/><figcaption class=\"wp-element-caption\">On the left, GPT-2 likely refers to \u201cthe animal\u201d when finishing the sentence with \u201cscared.\u201d On the right, it likely refers to \u201cthe park\u201d when it finishes the sentence with \u201cbusy.\u201d Image by author.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-explainability-in-ai\"><span style=\"font-weight: 400;\">Explainability in AI<\/span><\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6775\"><img loading=\"lazy\" decoding=\"async\" width=\"2300\" height=\"1002\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM.png\" alt=\"A horizontal bar chart showing gender discrepancies in Amazon's hiring practices\" class=\"wp-image-6775\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM.png 2300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM-300x131.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM-1024x446.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM-768x335.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM-1536x669.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-16-at-5.12.34-PM-2048x892.png 2048w\" sizes=\"auto, (max-width: 2300px) 100vw, 2300px\" \/><figcaption class=\"wp-element-caption\">In 2018, Amazon scrapped a job applicant recommender system they had spent four years building, after realizing the model exhibited significant gender bias. The model had learned existing gender discrepancies in hiring practices and learned to perpetuate them. Image from <a href=\"https:\/\/www.reuters.com\/article\/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G\">Reuters.<\/a><\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">As AI becomes more advanced, model calculations can become nearly impossible to interpret, even by the engineers and researchers that create them. This can lead to a whole host of unintended consequences, including, but not limited to: perpetuation of bias and stereotypes, distrust in organizational decision-making, and even legal ramifications. Explainable Artificial Intelligence (XAI) is a set of processes used to describe a model\u2019s expected impact and potential biases. A commitment to XAI helps:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Organizations adopt a responsible approach to AI development<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Developers ensure a model is working as expected and meets regulatory requirements&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Researchers characterize accuracy, fairness, and transparency for decision-making<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Organizations build trust and confidence<\/span><\/li>\n<\/ul>\n\n\n\n<p><span style=\"font-weight: 400;\">So how can practitioners incorporate XAI practices into their workflows, when the most popular ML architectures today\u2013 transformers\u2013 are notoriously opaque? The answer to this question isn\u2019t simple, and explainability must be approached from many different angles. But we hope this tutorial gives you one more tool in your XAI tool box by helping you visualize attention in transformers.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\"><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Thanks for making it all the way to the end, and we hope you enjoyed this article. Feel free to connect with us on our <\/span><a href=\"https:\/\/cometml.slack.com\/join\/shared_invite\/enQtMzM0OTMwNTQ0Mjc5LWE4NzcxMzdiMmFjYzEzM2E5OTczOTk1MDZmZDg2MGJmODUwYWI0YWQ0YWMyMjlmMjQ5YmVmNzEyYjNlNzFhNjQ#\/shared-invite\/email\"><span style=\"font-weight: 400;\">Community Slack channel<\/span><\/a><span style=\"font-weight: 400;\"> with any questions, comments, or suggestions!<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-additional-resources\"><span style=\"font-weight: 400;\">Additional Resources<\/span><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\"><a href=\"http:\/\/jalammar.github.io\/illustrated-transformer\/\">The Illustrated Transformer<\/a>&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"http:\/\/jalammar.github.io\/illustrated-bert\/\">The Illustrated BERT<\/a><\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.topbots.com\/deconstructing-bert-part-1\/\">Deconstructing BERT part 1<\/a><\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.topbots.com\/deconstructing-bert-part-2\/\">Deconstructing BERT part 2<\/a><\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/aclanthology.org\/P19-3007.pdf\">A Multi-scale Visualization of Attention in the Transformer Model<\/a>&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/debug-ml-iclr2019.github.io\/cameraready\/DebugML-19_paper_2.pdf\">BertViz: A Tool For Visualizing Multi-Head Self-Attention in the BERT Model<\/a>&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.youtube.com\/watch?v=XfpMkf4rD6E\">Stanford\u2019s CS25: Introduction to Transformers with Andrej Karpathy<\/a>&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.youtube.com\/watch?v=qGkzHFllWDY\">Stanford\u2019s CS25: Transformers in Language with Mark Chen<\/a>&nbsp;<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.coursera.org\/specializations\/natural-language-processing\">DeepLearning AI\u2019s Natural Language Processing Specialization<\/a> <\/span><\/li>\n\n\n\n<li><a href=\"https:\/\/www.oreilly.com\/library\/view\/natural-language-processing\/9781491978221\/?_gl=1*14hv7ni*_ga*MTY1MzAzODY1MS4xNjg4NTAxMDYz*_ga_092EL089CH*MTY4ODUwMTA2My4xLjEuMTY4ODUwMTE3OC41OS4wLjA\">Natural Language Processing with PyTorch<\/a> by Delip Rao, Brian McMahan<span style=\"font-weight: 400;\">&nbsp;<\/span><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this article we explore one of the most popular tools for visualizing the core distinguishing feature of transformer architectures: the attention mechanism. Keep reading to learn more about BertViz and how you can incorporate this attention visualization tool into your NLP and MLOps workflow with Comet.&nbsp;&nbsp; Feel free to follow along with the full-code [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":17100,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,6,7],"tags":[49,40,14,30,15,50,51,52,31,16,53,32,54,55],"coauthors":[133],"class_list":["post-6713","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-machine-learning","category-tutorials","tag-attention-mechanism","tag-comet","tag-comet-ml","tag-deep-learning","tag-deep-learning-experiment-management","tag-explainable-ai","tag-huggingface","tag-llm","tag-llmops","tag-ml-experiment-management","tag-mlops","tag-nlp","tag-self-attention","tag-transformers"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Explainable AI: Visualizing Attention in Transformers<\/title>\n<meta name=\"description\" content=\"Learn how to visualize the attention of transformers and log your results to Comet, as we work towards explainability in AI.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Explainable AI: Visualizing Attention in Transformers\" \/>\n<meta property=\"og:description\" content=\"Learn how to visualize the attention of transformers and log your results to Comet, as we work towards explainability in AI.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-16T22:37:03+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-18T10:28:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"759\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Explainable AI: Visualizing Attention in Transformers","description":"Learn how to visualize the attention of transformers and log your results to Comet, as we work towards explainability in AI.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/","og_locale":"en_US","og_type":"article","og_title":"Explainable AI: Visualizing Attention in Transformers","og_description":"Learn how to visualize the attention of transformers and log your results to Comet, as we work towards explainability in AI.","og_url":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-07-16T22:37:03+00:00","article_modified_time":"2025-06-18T10:28:01+00:00","og_image":[{"width":1280,"height":759,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg","type":"image\/jpeg"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"Explainable AI: Visualizing Attention in Transformers","datePublished":"2023-07-16T22:37:03+00:00","dateModified":"2025-06-18T10:28:01+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/"},"wordCount":3323,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg","keywords":["Attention Mechanism","Comet","Comet ML","Deep Learning","Deep Learning Experiment Management","Explainable AI","HuggingFace","LLM","LLMOps","ML Experiment Management","MLOps","NLP","Self-attention","Transformers"],"articleSection":["Comet Community Hub","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/","url":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/","name":"Explainable AI: Visualizing Attention in Transformers","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg","datePublished":"2023-07-16T22:37:03+00:00","dateModified":"2025-06-18T10:28:01+00:00","description":"Learn how to visualize the attention of transformers and log your results to Comet, as we work towards explainability in AI.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/explainable-ai-visualizing-attention-in-transformers-Large.jpeg","width":1280,"height":759,"caption":"pink and blue robot on an orange background"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/explainable-ai-for-transformers\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Explainable AI: Visualizing Attention in Transformers"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6713","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6713"}],"version-history":[{"count":3,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6713\/revisions"}],"predecessor-version":[{"id":17102,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6713\/revisions\/17102"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/17100"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6713"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6713"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6713"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6713"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}