{"id":6366,"date":"2023-06-25T13:30:08","date_gmt":"2023-06-25T21:30:08","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6366"},"modified":"2025-04-29T14:03:44","modified_gmt":"2025-04-29T14:03:44","slug":"sam-stable-diffusion-for-text-to-image-inpainting","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/","title":{"rendered":"SAM + Stable Diffusion for Text-to-Image Inpainting"},"content":{"rendered":"\n<p>In this article, we\u2019ll leverage the power of SAM, the first foundational model for computer vision, along with Stable Diffusion, a popular generative AI tool, to create a text-to-image inpainting pipeline that we&#8217;ll track in Comet. Feel free to follow along with the full code tutorial in <strong><a href=\"https:\/\/colab.research.google.com\/drive\/1B7L4cork9UFTtIB02EntjiZRLYuqJS2b\">this Colab<\/a><\/strong> and get the <strong><a href=\"https:\/\/www.kaggle.com\/datasets\/abbymorgan\/animals-toy-dataset\">Kaggle dataset here<\/a><\/strong>.<\/p>\n\n\n\n<p>Or, if you can\u2019t wait, check out the&nbsp;<a class=\"af nz\" href=\"https:\/\/www.comet.com\/examples\/demo-text-to-inpainting-sam-stablediffusion\/view\/bRnI022tXQUdKGsVCFmjFRRtT\/panels?utm_source=Medium&amp;utm_medium=referral&amp;utm_content=SAM+SD_blog\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"nf fx\">full public project here<\/strong><\/a>!<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6477 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1080\" height=\"1080\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Untitled-design.png\" alt=\"Two side-by-side images. The lefthand image is the original image of a frog clinging to a tropical-colored flower with a bright, blurry background. The image on the right is the same image, only the frog has been replaced by an AI-generated koala bear with SAM + Stable Diffusion.\" class=\"wp-image-6477\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Untitled-design.png 1080w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Untitled-design-300x300.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Untitled-design-1024x1024.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Untitled-design-150x150.png 150w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Untitled-design-768x768.png 768w\" sizes=\"auto, (max-width: 1080px) 100vw, 1080px\" \/><figcaption class=\"wp-element-caption\">On the right, our original image of a frog. On the left, our output image where the frog has been replaced by a koala bear; image by author.<\/figcaption><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/colab.research.google.com\/drive\/1B7L4cork9UFTtIB02EntjiZRLYuqJS2b#scrollTo=LtZghyHoJabf\" target=\"_blank\" rel=\"noreferrer noopener\">Follow along with the Colab!<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"\/signup\/?utm_source=Comet_blog&amp;utm_medium=referral&amp;utm_content=SAM+SD_blog\" target=\"_blank\" rel=\"noreferrer noopener\">Create a free Comet account!<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\">What is SAM?<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Earlier this year, Meta AI caused another huge stir in the computer vision community with the release of their new open-source project: the Segment Anything Model (SAM). But, what makes SAM so special?&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">SAM is a prompt-able segmentation system with results that are simply stunning. It excels in zero-shot generalization to unfamiliar objects and images without the need for additional training. It\u2019s also considered the first foundational model for computer vision, which is big news! We\u2019ll talk a little more about foundational models next.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">SAM was trained on a massive dataset of 11 million images with 1.1 billion segmentation masks, which Meta has also released publicly. But perhaps the best way to showcase SAM\u2019s groundbreaking capabilities is with a short demo:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6380\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"394\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/1680787945839.gif\" alt=\"Meta AI Segment Anything Model (SAM) demo GIF\" class=\"wp-image-6380\"\/><figcaption class=\"wp-element-caption\"><span class=\"wpex-text-sm\">The Segment Anything Model (SAM) is famous for correctly identifying up to hundreds of individual masks per image. It can also generate multiple valid masks for ambiguous prompts; GIF from Meta AI.<\/span><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">What are foundational models?<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Foundation models are neural networks trained on massive unlabeled datasets to handle a wide variety of tasks. These powerful machine learning algorithms power many of the most popular Generative AI tools used today, including ChatGPT and BERT.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Foundation models have made major strides in natural language processing, but until recently, haven\u2019t gained much traction in computer vision applications. That\u2019s because computer vision has struggled to find a task with semantically rich unsupervised pre-training, akin to predicting masked tokens for NLP. With SAM, Meta set out to change this.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to use SAM<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">The Segment Anything Model requires no additional training, so all we need to do is provide a prompt that tells the model what to segment in a given input image. SAM accepts a variety of input prompt types, but some of the most common ones include:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Prompting <\/span><a href=\"https:\/\/huggingface.co\/spaces\/yizhangliu\/Grounded-Segment-Anything\"><span style=\"font-weight: 400;\">interactively within a UI<\/span><\/a><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Prompting <\/span><a href=\"https:\/\/github.com\/facebookresearch\/segment-anything\/blob\/main\/notebooks\/predictor_example.ipynb\"><span style=\"font-weight: 400;\">programmatically with points or boxes<\/span><\/a><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Prompting with the <\/span><a href=\"https:\/\/github.com\/IDEA-Research\/Grounded-Segment-Anything\"><span style=\"font-weight: 400;\">bounding box coordinates generated from an object detection model<\/span><\/a><\/li>\n\n\n\n<li>Automatically <a style=\"font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\" href=\"https:\/\/github.com\/facebookresearch\/segment-anything\/blob\/main\/notebooks\/automatic_mask_generator_example.ipynb\">segmenting everything in an image<\/a><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6389\"><img loading=\"lazy\" decoding=\"async\" width=\"1694\" height=\"1232\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-8.56.53-PM.png\" alt=\"Two side-by-side images. The lefthand image is the original image of a frog clinging to a tropical-colored flower with a bright, blurry background. The image on the right is the same image, only the frog has been replaced by an AI-generated koala bear with SAM + Stable Diffusion.\" class=\"wp-image-6389\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-8.56.53-PM.png 1694w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-8.56.53-PM-300x218.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-8.56.53-PM-1024x745.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-8.56.53-PM-768x559.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-8.56.53-PM-1536x1117.png 1536w\" sizes=\"auto, (max-width: 1694px) 100vw, 1694px\" \/><figcaption class=\"wp-element-caption\">SAM accepts various types of input including points, bounding boxes, and text; image by author<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Project Overview: GroundingDINO + SAM + Stable Diffusion<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">SAM doesn\u2019t just integrate well with different input types, however. SAM\u2019s output masks can also be used as inputs to other AI systems for even more complicated pipelines! In this tutorial, we\u2019ll demonstrate how to use SAM in conjunction with GroundingDINO and Stable Diffusion to create a pipeline that accepts text as input to perform image inpainting and outpainting with generative AI.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><a href=\"http:\/\/https:\/\/colab.research.google.com\/drive\/1B7L4cork9UFTtIB02EntjiZRLYuqJS2b#scrollTo=3djVqDbQz4RO\"><img loading=\"lazy\" decoding=\"async\" width=\"2002\" height=\"888\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-19-at-2.40.49-PM.png\" alt=\"A diagram of our text-to-image inpainting pipeline, including GroundingDINO for object detection, Segment Anything for segmentation masks, and Stable Diffusion for image inpainting\/generative AI.\" class=\"wp-image-6387\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-19-at-2.40.49-PM.png 2002w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-19-at-2.40.49-PM-300x133.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-19-at-2.40.49-PM-1024x454.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-19-at-2.40.49-PM-768x341.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-19-at-2.40.49-PM-1536x681.png 1536w\" sizes=\"auto, (max-width: 2002px) 100vw, 2002px\" \/><\/a><figcaption class=\"wp-element-caption\">We\u2019ll create a pipeline using GroundingDINO, Segment Anything, and Stable Diffusion to perform image inpainting with text prompts; Image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">To do this, we\u2019ll be leveraging three separate models. First, we&#8217;ll use Grounding DINO to interpret our text input prompt and perform object detection for those input labels. Next, we&#8217;ll use SAM to segment the masks within those bounding box predictions. Finally, we&#8217;ll use the masks generated from SAM to isolate regions of the image for either inpainting or outpainting with Stable Diffusion. We\u2019ll also use Comet to log the images at each step in the pipeline so we can track exactly how we got from our input image to our output image.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In the end, we should be able to provide an input image, a few input text prompts specifying what we\u2019d like the model to do, and end up with a transformation like the one below:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/6a08b7700ee2ca54aeba732e3e1d95c3.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6394 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2344\" height=\"902\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM.png\" alt=\"Two images side-by-side. The one of the left is a picture of an orange fox on a snowy hill looking out into the distance and is captioned with word &quot;input.&quot; The image on the right is the same image, but generative artificial intelligence has been used to replace the fox with an image of a brown bulldog. The bulldog is realistic, but the viewer can tell it isn't real. Image made with SAM + Stable Diffusion.\" class=\"wp-image-6394\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM.png 2344w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM-300x115.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM-1024x394.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM-768x296.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM-1536x591.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.34.41-AM-2048x788.png 2048w\" sizes=\"auto, (max-width: 2344px) 100vw, 2344px\" \/><figcaption class=\"wp-element-caption\">Our goal is to provide our pipeline with an image like the one on the left, and text prompt like the one above, and generate an output image like the one on the right; image by author.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Object detection with \ud83e\udd95GroundingDINO<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">We\u2019ll use four example images in this tutorial, which can be downloaded from <\/span><a href=\"https:\/\/www.kaggle.com\/datasets\/abbymorgan\/animals-toy-dataset\"><span style=\"font-weight: 400;\">Kaggle here<\/span><\/a><span style=\"font-weight: 400;\">. These images were all taken from Unsplash and links to the original photographers can be found at the bottom of this blog.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6414\"><img loading=\"lazy\" decoding=\"async\" width=\"1306\" height=\"946\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.41.36-AM.png\" alt=\"\" class=\"wp-image-6414\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.41.36-AM.png 1306w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.41.36-AM-300x217.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.41.36-AM-1024x742.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-17-at-8.41.36-AM-768x556.png 768w\" sizes=\"auto, (max-width: 1306px) 100vw, 1306px\" \/><figcaption class=\"wp-element-caption\">Our toy dataset consists of four animal images; image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Once our environment is set up, we start by defining our input image and providing a text prompt that specifies which objects we want to detect. Note the format of the text prompt and make sure to separate each object with a period. We don\u2019t have to choose from any particular categories here, so feel free to experiment with this prompt and add more categories if you\u2019d like.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/adf2d866e68be253d4156e19ed26455d.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">After some very simple preprocessing, we use the GroundingDINO model to predict bounding boxes for our input labels. We log these results to Comet to examine later. This way we\u2019ll be able to see the images at each step in the pipeline, which will not only help us understand the process, but will also help us debug if anything goes wrong.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/731097336d6544d7725c0653d69de8ba.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6396\"><img loading=\"lazy\" decoding=\"async\" width=\"1910\" height=\"1286\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-7.48.14-PM.png\" alt=\"An image of two dogs sitting in front of a gray background, looking obediently at the photographer as if he is their owner. The dog on the left is a tan and white French Bulldog wearing a lavish gold and black bomber jacket. The dog on the right is a tan, brown, and white Shih Tzu wearing a flamboyant hot pink and black bomber jacket with a heavy gold chain necklace. \" class=\"wp-image-6396\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-7.48.14-PM.png 1910w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-7.48.14-PM-300x202.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-7.48.14-PM-1024x689.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-7.48.14-PM-768x517.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-7.48.14-PM-1536x1034.png 1536w\" sizes=\"auto, (max-width: 1910px) 100vw, 1910px\" \/><figcaption class=\"wp-element-caption\">Our original image with the predicted bounding boxes, as visualized in Comet; image by author<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">We will now use these bounding box coordinates to indicate which items we would like to segment in SAM. <\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Masks with SAM<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">As mentioned, SAM can either detect all masks automatically within an image, or it can accept prompts that guide it to only detect specific masks within an image. Now that we have our bounding box predictions, we\u2019ll use these coordinates as input prompts to SAM and plot the resulting list of binary masks:&nbsp;<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/96586d19ee8159a8269fbc49fc9ce444.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6397\"><img loading=\"lazy\" decoding=\"async\" width=\"1866\" height=\"1070\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-3.40.43-PM.png\" alt=\"A grid of six black and white binary masks corresponding to the six objects detected in the original picture of two dogs. From the top left is the lefthand dog, labeled &quot;mask 0,&quot; the righthand dog, labeled &quot;mask 1&quot;, the shirt on the righthand dog, labeled &quot;mask 2&quot;, the necklace on the righthand dog, marked &quot;mask 3,&quot; the background of the picture, marked &quot;mask 4,&quot; and finally the shirt of the lefthand dog, marked &quot;mask 5.&quot;\" class=\"wp-image-6397\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-3.40.43-PM.png 1866w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-3.40.43-PM-300x172.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-3.40.43-PM-1024x587.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-3.40.43-PM-768x440.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-3.40.43-PM-1536x881.png 1536w\" sizes=\"auto, (max-width: 1866px) 100vw, 1866px\" \/><figcaption class=\"wp-element-caption\">The binary masks generated from SAM. Note that the number corresponds to the position of the mask in the list of masks; image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Note that by default, SAM has performed instance segmentation, rather than semantic segmentation, which gives us a lot more flexibility when it comes time for inpainting. Let\u2019s also visualize these masks within the Comet UI:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/a77161709592716a336021f19a9aa3e9.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6398\"><img loading=\"lazy\" decoding=\"async\" width=\"1322\" height=\"622\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/dogs_sam_masks4.gif\" alt=\"A GIF of our image of the two dogs, as seen in the Comet UI, with each individual segmentation mask plotted.\" class=\"wp-image-6398\"\/><figcaption class=\"wp-element-caption\">Examining the segmentation masks generated by SAM, as logged to Comet; GIF by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Finally, let\u2019s isolate the masks we want to use for our next task: image inpainting. We\u2019ll be replacing the dog on the right with an old man, so we\u2019ll need the following three masks (we can grab their indices from the binary mask plot above):<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/6064634cca81ac5f825f5e77edb5d008.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Isolating part of a mask with SAM<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Now, let\u2019s say we\u2019ve decided we want to replace the dog on the right with an old man, but just the head. If we were detecting masks with points (either interactively or programmatically), we could isolate just the dog\u2019s face from the rest of his body using a positive and negative prompt like so:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6399\"><img loading=\"lazy\" decoding=\"async\" width=\"1554\" height=\"1038\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.14.23-PM.png\" alt=\"A picture of our two dogs using a positive point prompt and a negative point prompt to indicate we just want to segment the dog's face.\" class=\"wp-image-6399\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.14.23-PM.png 1554w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.14.23-PM-300x200.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.14.23-PM-1024x684.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.14.23-PM-768x513.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.14.23-PM-1536x1026.png 1536w\" sizes=\"auto, (max-width: 1554px) 100vw, 1554px\" \/><figcaption class=\"wp-element-caption\">The green star represents a \u201cpositive\u201d input point and the red star represents a \u201cnegative\u201d input point. This combination indicates to SAM that we want to segment the dog on the right, but not the body (just the face); image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">But since we already have our masks arrays, we\u2019ll isolate the dog\u2019s face using <code>np.where<\/code>. Below, we start with the mask of the dog on the right and subtract the masks for its shirt and necklace. Then we convert the array back to a PIL Image.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/4792d7b5a4159d50c9ec0b72ebfacd37.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Image generation with Stable Diffusion<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">For our final step we\u2019ll be using Stable Diffusion, a latent text-to-image deep learning model, capable of generating photo-realistic images given any text input. Specifically we\u2019ll be using the Stable Diffusion Inpainting Pipeline, which takes as input a prompt, an image, and a binary mask image. This pipeline will generate an image from the text prompt only for the white pixels (\u201c1\u201ds) of the mask image.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/10c4a149a0f1799f8e2952618ae7b7a6.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What is inpainting?<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Image inpainting refers to the process of filling-in missing data in a designated region of an image. Originally, image inpainting was used to restore damaged regions of a photo to look more like the original, but is now commonly used with masks to intentionally alter regions of an image.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Like SAM, the Stable Diffusion Inpainting Pipeline accepts both a positive and negative input prompt. Here, we instruct it to use the mask corresponding to the right dog\u2019s face and generate \u201can old man with curly hair\u201d in its place. Our negative prompt instructs the model to disclude specific objects or characteristics in the image it generates. Finally, we set the random seed so we can reproduce our results later on. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Pro tip: Stable Diffusion can be hit or miss. If you don\u2019t like your results the first time, try adjusting the random seed and running the model again. If you still don\u2019t like your results, try adjusting your prompts. For more on prompt engineering, <\/span><a href=\"https:\/\/www.comet.com\/site\/blog\/prompt-engineering\/\"><span style=\"font-weight: 400;\">read here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/1991f259aaee454e6040873703d706bb.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6401\"><img loading=\"lazy\" decoding=\"async\" width=\"1852\" height=\"1234\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.42.59-PM.png\" alt=\"An image of our two dogs but the righthand dog's face has been replaced with the face of an old man.\" class=\"wp-image-6401\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.42.59-PM.png 1852w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.42.59-PM-300x200.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.42.59-PM-1024x682.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.42.59-PM-768x512.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-4.42.59-PM-1536x1023.png 1536w\" sizes=\"auto, (max-width: 1852px) 100vw, 1852px\" \/><figcaption class=\"wp-element-caption\">Our final output image with the dog&#8217;s face replaced by an old man; image by author<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">That was simple! Now let\u2019s try outpainting.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What is outpainting?<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Image outpainting is the process of using generative AI to extend images beyond their original borders, thereby generating parts of the image that didn\u2019t exist before. We\u2019ll effectively do this by masking the original background and using the same Stable Diffusion Inpainting Pipeline.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The only difference here will be the input mask (now the background), and the input prompt. Let\u2019s bring the dogs to Las Vegas!<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/b3d530b5bc3fa06782a5374c666bd198.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6402\"><img loading=\"lazy\" decoding=\"async\" width=\"1850\" height=\"1224\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.22.34-PM.png\" alt=\"A picture of our two dogs but the background has been replaced with a casino in Las Vegas. Image made with with SAM + Stable Diffusion.\" class=\"wp-image-6402\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.22.34-PM.png 1850w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.22.34-PM-300x198.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.22.34-PM-1024x678.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.22.34-PM-768x508.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.22.34-PM-1536x1016.png 1536w\" sizes=\"auto, (max-width: 1850px) 100vw, 1850px\" \/><figcaption class=\"wp-element-caption\">Our input image after the background has been replaced by &#8220;a casino in Las Vegas&#8221;; image by author<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Inpainting multiple objects with Stable Diffusion<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Now let\u2019s try segmenting more than one object in an image. In the next image we\u2019ll ask the model to detect both the frog and the flower. We\u2019ll then instruct the model to replace the frog with a koala bear, and replace the flower with the Empire State Building.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/b3d530b5bc3fa06782a5374c666bd198.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6403 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1612\" height=\"1210\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.29.03-PM.png\" alt=\"Two side-by-side images. The lefthand image is the original image of a frog clinging to a tropical-colored flower with a bright, blurry background. The image on the right is the same image, only the frog has been replaced by an AI-generated koala bear. Image made with with SAM + Stable Diffusion.\" class=\"wp-image-6403\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.29.03-PM.png 1612w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.29.03-PM-300x225.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.29.03-PM-1024x769.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.29.03-PM-768x576.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-5.29.03-PM-1536x1153.png 1536w\" sizes=\"auto, (max-width: 1612px) 100vw, 1612px\" \/><figcaption class=\"wp-element-caption\">On the left, our original image of a frog. On the right, our output image where the frog has been replaced by a koala bear; image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">The model thinks the flower includes the frog, but we can work around that by subtracting the frog mask and then converting the new mask to a PIL image.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6405\"><img loading=\"lazy\" decoding=\"async\" width=\"2264\" height=\"1086\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM.png\" alt=\"Three binary masks in a row showing the result of subtracting one mask from another.\" class=\"wp-image-6405\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM.png 2264w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM-300x144.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM-1024x491.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM-768x368.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM-1536x737.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.50.51-PM-2048x982.png 2048w\" sizes=\"auto, (max-width: 2264px) 100vw, 2264px\" \/><figcaption class=\"wp-element-caption\">From left to right: the original flower mask, minus the frog mask, equals our new, corrected flower mask; image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Once we\u2019ve separated the flower, let\u2019s replace it with the Empire State Building:<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/8169634acb423f71322708ae7ac363e6.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6406 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2452\" height=\"1300\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM.png\" alt=\"Three images in a row. The first is a tropical frog, the second replaces the frog with a koala bear using generative AI, and the third rreplaces the flower with a skyscraper (the Empire State Building) using generative AI. Image made with with SAM + Stable Diffusion.\" class=\"wp-image-6406\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM.png 2452w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM-300x159.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM-1024x543.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM-768x407.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM-1536x814.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-16-at-10.20.06-PM-2048x1086.png 2048w\" sizes=\"auto, (max-width: 2452px) 100vw, 2452px\" \/><figcaption class=\"wp-element-caption\">On the left, our original image. In the center, we&#8217;ve used SAM + Stable Diffusion to replace the frog with a koala bear, and on the right we&#8217;ve also replaced the flower with a skyscraper; image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Our model isn\u2019t perfect; it looks like our koala may have a fifth leg, and there\u2019s still some remnants of frog on the skyscraper, but generally, our pipeline performed pretty well!<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Defining the background for Stable Diffusion<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Sometimes our object detector, GroundingDINO won\u2019t detect the background. But we can still easily perform outpainting!<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To create a background mask when one isn\u2019t detected, we can just take the inverse of the object mask. If multiple objects are in the image, we would just add these masks together, and then take the inverse of this sum.<\/span><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/8fa00ad837b737c327d94938308ecbc6.js\"><\/script><\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">We can then follow the same process as in the previous examples.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Viewing our SAM + Stable Diffusion results in Comet<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">As you can probably imagine, keeping track of which input images, prompts, masks, and random seeds were used to create which output images can get confusing, fast! That\u2019s why we logged all of our images to Comet as we went.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s head on over to the Comet UI now and take a look at each of our input images and the resulting output images after inpainting and outpainting:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6408 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2320\" height=\"1438\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM.png\" alt=\"A grid of input and output images as represented in the Comet UI. Images made with with SAM + Stable Diffusion.\" class=\"wp-image-6408\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM.png 2320w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM-300x186.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM-1024x635.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM-768x476.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM-1536x952.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-18-at-6.00.44-PM-2048x1269.png 2048w\" sizes=\"auto, (max-width: 2320px) 100vw, 2320px\" \/><figcaption class=\"wp-element-caption\">We create a clean, simple dashboard to track our input images and the final output images. Which is your favorite? Image by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">That\u2019s a nice, clean dashboard, but sometimes we want to get a deeper understanding of how we went from point A to point B. Or, maybe, something has gone wrong and we need to take a&nbsp; deeper look at each step of the process to debug. For this, we\u2019ll check our custom Debugging dashboard:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6410\"><img loading=\"lazy\" decoding=\"async\" width=\"1015\" height=\"770\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/step_by_step_dashboard.gif\" alt=\"Our debugging dashboard for our text-to-image inpainting pipeline, as seen in the Comet UI\" class=\"wp-image-6410\"\/><figcaption class=\"wp-element-caption\">Our dashboard showing each step of each image in our pipeline; GIF by author.<\/figcaption><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">We can also take a closer look at each step of an individual experiment:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6409\"><img loading=\"lazy\" decoding=\"async\" width=\"1309\" height=\"617\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/closer_look_dogs.gif\" alt=\"A screenshot of all steps of the image pipeline, as visualized in the Comet UI.\" class=\"wp-image-6409\"\/><figcaption class=\"wp-element-caption\">We also create a second dashboard detailing each step of the process for each image. This is helpful for debugging if something goes wrong in our pipeline; GIF by author.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading oa ob fw be oc od oe gw of og oh gz oi oj ok ol om on oo op oq or os ot ou ov bj\" id=\"4a4a\"><strong class=\"al\">Tracking our prompts with Comet<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph nd ne fw nf b gu ow nh ni gx ox nk nl nm oy no np nq oz ns nt nu pa nw nx ny fp bj\" id=\"d0a9\">We\u2019ll also want to make sure to keep track of how we created each output so we can reproduce any of the results later on. Maybe we\u2019ve run different versions of the same prompt multiple times. Or maybe we\u2019ve tried different random seeds and want to pick our favorite result. By logging our prompts to&nbsp;<a class=\"af nz\" href=\"https:\/\/www.comet.com\/examples\/demo-text-to-inpainting-sam-stablediffusion\/view\/bRnI022tXQUdKGsVCFmjFRRtT\/panels?utm_source=Medium&amp;utm_medium=referral&amp;utm_content=SAM+SD_blog\" target=\"_blank\" rel=\"noopener ugc nofollow\">Comet\u2019s Data Panel<\/a>, we can easily retrieve all the relevant information to recreate any of our image outputs.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-6582\"><img loading=\"lazy\" decoding=\"async\" width=\"1129\" height=\"305\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/data_panel_SAM.gif\" alt=\"A GIF of Comet's Data Panel used to track text prompts and random seeds used to create Stable Diffusion image outputs.\" class=\"wp-image-6582\"\/><figcaption class=\"wp-element-caption\">Tracking and organizing our prompt and seed information with Comet\u2019s Data Panels; GIF by author<\/figcaption><\/figure>\n\n\n\n<p>Now that you&#8217;re an inpainting pro, <a href=\"https:\/\/colab.research.google.com\/drive\/1B7L4cork9UFTtIB02EntjiZRLYuqJS2b#scrollTo=LtZghyHoJabf\">try out the pipeline on your own images<\/a>!<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6483 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2018\" height=\"1052\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-26-at-3.06.15-PM.png\" alt=\"Three pictures; on the left, a picture of a seal underwater looking straight at the camera. In the center, the seal has been replaced by a bright purple octopus. On the right, the background has been replaced by a comfortable bed. Image made with with SAM + Stable Diffusion.\" class=\"wp-image-6483\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-26-at-3.06.15-PM.png 2018w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-26-at-3.06.15-PM-300x156.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-26-at-3.06.15-PM-1024x534.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-26-at-3.06.15-PM-768x400.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-06-26-at-3.06.15-PM-1536x801.png 1536w\" sizes=\"auto, (max-width: 2018px) 100vw, 2018px\" \/><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/colab.research.google.com\/drive\/1B7L4cork9UFTtIB02EntjiZRLYuqJS2b#scrollTo=LtZghyHoJabf\">Try out the pipeline on your own images!<\/a><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Conclusion<\/h3>\n\n\n\n<p class=\"graf graf--p\">Thanks for making it all the way to the end, and I hope you found this SAM + Stable Diffusion tutorial helpful! For questions, comments, or feedback, feel free to drop a note in the comments below. Happy coding!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Where can I find the full code for this tutorial?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Find the full code tutorial in <\/span><a href=\"https:\/\/colab.research.google.com\/drive\/1B7L4cork9UFTtIB02EntjiZRLYuqJS2b#scrollTo=LtZghyHoJabf\"><span style=\"font-weight: 400;\">this Colab here<\/span><\/a><span style=\"font-weight: 400;\">.&nbsp;<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Where can I find the images used in this tutorial?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Download the dataset on <\/span><a href=\"https:\/\/www.kaggle.com\/datasets\/abbymorgan\/animals-toy-dataset\"><span style=\"font-weight: 400;\">Kaggle here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is Segment Anything open source?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Yes, the Segment Anything model and dataset are both open source! Find the code for the <\/span><a href=\"https:\/\/github.com\/facebookresearch\/segment-anything\"><span style=\"font-weight: 400;\">Segment Anything Model here<\/span><\/a><span style=\"font-weight: 400;\"> and download the full <\/span><a href=\"https:\/\/ai.facebook.com\/datasets\/segment-anything\/\"><span style=\"font-weight: 400;\">Segment Anything dataset here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">How many images were used to train SAM?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The Segment Anything dataset consists of 100 million images with 1.1 billion segmentation masks. It\u2019s roughly 400x larger than the next largest segmentation mask dataset and 6x larger than OpenImages V5.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is the Segment Anything Model a foundational model?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Yes, the Segment Anything model is considered the first foundational model for computer vision.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is there a Segment Anything paper?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Yes, there is a Segment Anything paper available for free on <\/span><a href=\"https:\/\/arxiv.org\/abs\/2304.02643?ref=blog.roboflow.com\"><span style=\"font-weight: 400;\">arxiv.org here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is there a Segment Anything demo?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">There is a free interactive <\/span><a href=\"https:\/\/segment-anything.com\/\"><span style=\"font-weight: 400;\">Segment Anything demo<\/span><\/a><span style=\"font-weight: 400;\"> available online here<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is there a Segment Anything API?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">There is not currently a Segment Anything API, but the code is open source and <\/span><a href=\"https:\/\/github.com\/facebookresearch\/segment-anything\"><span style=\"font-weight: 400;\">available on GitHub<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is there a Stable Diffusion repo?<\/span><\/h3>\n\n\n\n<p>Yes, there is a <a href=\"https:\/\/github.com\/Stability-AI\/stablediffusion\">Stable Diffusion GitHub repository.<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Is there a Stable Diffusion API?<\/span><\/h3>\n\n\n\n<p>Yes, there is a Stable Diffusion and Dreambooth API. Learn more <a href=\"https:\/\/stablediffusionapi.com\/\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Where can I create a Comet account?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Sign up for a <\/span><a href=\"\/signup\/?utm_source=Comet_blog&amp;utm_medium=referral&amp;utm_content=SAM_blog\"><span style=\"font-weight: 400;\">free Comet account here<\/span><\/a><span style=\"font-weight: 400;\">.&nbsp;<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Image Credits<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">The sample images used in this tutorial were all downloaded originally from <\/span><a href=\"https:\/\/unsplash.com\/\"><span style=\"font-weight: 400;\">Unsplash<\/span><\/a><span style=\"font-weight: 400;\">:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Dog image by <\/span><a href=\"https:\/\/unsplash.com\/fr\/@karsten116?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\"><span style=\"font-weight: 400;\">Karsten Winegeart<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Fox image by <\/span><a href=\"https:\/\/unsplash.com\/@rayhennessy?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\"><span style=\"font-weight: 400;\">Ray Hennessy<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Frog image by <\/span><a href=\"https:\/\/unsplash.com\/@sleblanc01?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\"><span style=\"font-weight: 400;\">Stephanie LeBlanc<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n\n\n\n<li>Panda image by <a href=\"https:\/\/unsplash.com\/@jasonsung?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Jason Sung<\/a>.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we\u2019ll leverage the power of SAM, the first foundational model for computer vision, along with Stable Diffusion, a popular generative AI tool, to create a text-to-image inpainting pipeline that we&#8217;ll track in Comet. Feel free to follow along with the full code tutorial in this Colab and get the Kaggle dataset here. [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":6985,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,6,7],"tags":[40,14,29,27,30,15,42,36,16,37,38,43,44,45,46,39],"coauthors":[133],"class_list":["post-6366","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-machine-learning","category-tutorials","tag-comet","tag-comet-ml","tag-computer-vision","tag-data-panels","tag-deep-learning","tag-deep-learning-experiment-management","tag-image-inpainting","tag-image-panels","tag-ml-experiment-management","tag-object-detection","tag-pytorch","tag-sam","tag-segment-anything","tag-stable-diffusion","tag-text-to-image","tag-torchvision"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>SAM + Stable Diffusion for Text-to-Image Inpainting<\/title>\n<meta name=\"description\" content=\"In this full-code tutorial learn how to use SAM + Stable Diffusion to create an image inpainting pipeline for your next generative AI project\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"SAM + Stable Diffusion for Text-to-Image Inpainting\" \/>\n<meta property=\"og:description\" content=\"In this full-code tutorial learn how to use SAM + Stable Diffusion to create an image inpainting pipeline for your next generative AI project\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-25T21:30:08+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T14:03:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-07-28-at-11.58.03-AM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"304\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Abby Morgan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@anmorgan2414\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abby Morgan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"SAM + Stable Diffusion for Text-to-Image Inpainting","description":"In this full-code tutorial learn how to use SAM + Stable Diffusion to create an image inpainting pipeline for your next generative AI project","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/","og_locale":"en_US","og_type":"article","og_title":"SAM + Stable Diffusion for Text-to-Image Inpainting","og_description":"In this full-code tutorial learn how to use SAM + Stable Diffusion to create an image inpainting pipeline for your next generative AI project","og_url":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-06-25T21:30:08+00:00","article_modified_time":"2025-04-29T14:03:44+00:00","og_image":[{"width":300,"height":304,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-07-28-at-11.58.03-AM.png","type":"image\/png"}],"author":"Abby Morgan","twitter_card":"summary_large_image","twitter_creator":"@anmorgan2414","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abby Morgan","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/"},"author":{"name":"Abby Morgan","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2"},"headline":"SAM + Stable Diffusion for Text-to-Image Inpainting","datePublished":"2023-06-25T21:30:08+00:00","dateModified":"2025-04-29T14:03:44+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/"},"wordCount":2624,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-07-28-at-11.58.03-AM.png","keywords":["Comet","Comet ML","Computer Vision","Data Panels","Deep Learning","Deep Learning Experiment Management","Image inpainting","Image Panels","ML Experiment Management","Object Detection","PyTorch","SAM","Segment Anything","Stable Diffusion","Text-to-image","TorchVision"],"articleSection":["Comet Community Hub","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/","url":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/","name":"SAM + Stable Diffusion for Text-to-Image Inpainting","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-07-28-at-11.58.03-AM.png","datePublished":"2023-06-25T21:30:08+00:00","dateModified":"2025-04-29T14:03:44+00:00","description":"In this full-code tutorial learn how to use SAM + Stable Diffusion to create an image inpainting pipeline for your next generative AI project","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-07-28-at-11.58.03-AM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/06\/Screen-Shot-2023-07-28-at-11.58.03-AM.png","width":300,"height":304,"caption":"a side by side picture of a koala and a frog holding a leaf"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/sam-stable-diffusion-for-text-to-image-inpainting\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"SAM + Stable Diffusion for Text-to-Image Inpainting"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/826ee39a2e30cf9d8d73155de09bb7b2","name":"Abby Morgan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/dbbf1ae921ee179c768f508340415946","url":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28d4934d14261b4afe12e226f0eaa57c4fb0c2761ad4586eb9a5bec3b8160bc9?s=96&d=mm&r=g","caption":"Abby Morgan"},"description":"AI\/ML Growth Engineer @ Comet","sameAs":["https:\/\/www.comet.com\/","https:\/\/www.linkedin.com\/in\/anmorgan24\/","https:\/\/x.com\/anmorgan2414"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/abigailmcomet-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6366","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6366"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6366\/revisions"}],"predecessor-version":[{"id":15812,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6366\/revisions\/15812"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/6985"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6366"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6366"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6366"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6366"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}