{"id":4572,"date":"2022-11-14T10:44:49","date_gmt":"2022-11-14T18:44:49","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=4572"},"modified":"2025-04-24T17:16:25","modified_gmt":"2025-04-24T17:16:25","slug":"text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/","title":{"rendered":"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png\" alt=\"\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-text-align-center\">Example input and output from the Gradio app built using the Text to Color model. Moving from left to right, we can see how each progressive training step updates the color to match the prompt \u201cthe color of a banana\u201d.<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<div class=\"ir is it iu iv\">\n<h1 id=\"5c1a\" class=\"kj kk iy bm kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg ga\" data-selectable-paragraph=\"\">Introduction<\/h1>\n<p id=\"f0ba\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">This is&nbsp;<strong class=\"bm me\">part two&nbsp;<\/strong>in a series on using CLIP from scratch to evaluate and manipulate images by comparing them to text prompts. Part one can be found&nbsp;<a class=\"au mf\" href=\"https:\/\/heartbeat.comet.ml\/using-clip-and-gradio-to-assess-similarity-between-text-prompts-and-ranges-of-colors-a9a8fc0b0a08\" target=\"_blank\" rel=\"noopener ugc nofollow\">here.<\/a><\/p>\n<p id=\"7201\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">In the last post, I demonstrated how to compare a text prompt across a range of colors and visualize how well each individual shade matched the text prompt. In this tutorial, I\u2019ll demonstrate how we can optimize a color to match text as well as possible. To do so, we\u2019ll write a custom&nbsp;<em class=\"nc\">Module<\/em>&nbsp;using PyTorch.<\/p>\n<p id=\"4c90\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">You can follow&nbsp;<a class=\"au mf\" href=\"https:\/\/colab.research.google.com\/drive\/11fHW5-U-3BPOft-7MCFZNAHWzdtBb8Rm#scrollTo=C1hkDT38hSaP\" target=\"_blank\" rel=\"noopener ugc nofollow\">this Colab notebook<\/a>&nbsp;to work with the code interactively, and you can also try the model in action at this&nbsp;<a class=\"au mf\" href=\"https:\/\/huggingface.co\/spaces\/miccull\/clip-text-to-rgb\" target=\"_blank\" rel=\"noopener ugc nofollow\">Hugging Face Space<\/a>, which I built using&nbsp;<a class=\"au mf\" href=\"http:\/\/gradio.app\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Gradio<\/a>. In this post, I\u2019ll provide some commentary and explanation on the code needed to write the model and training loop.<\/p>\n<h1 id=\"2225\" class=\"kj kk iy bm kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg ga\" data-selectable-paragraph=\"\">Subclass from&nbsp;<code class=\"fp ne nf ng nh b\">torch.nn.Module<\/code>:<\/h1>\n<p id=\"c34b\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">The first thing we do is create a new class,&nbsp;<code class=\"fp ne nf ng nh b\">RGBModel<\/code>, as a&nbsp;<em class=\"nc\">subclass<\/em>&nbsp;of PyTorch&#8217;s&nbsp;<code class=\"fp ne nf ng nh b\">Module<\/code>&nbsp;class. If you&#8217;re not familiar with the idea of classes and inheritance in Python (or another language), this is like creating our own recipe for a model by adapting from some fundamental building blocks.<\/p>\n<p id=\"a8cd\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">The&nbsp;<code class=\"fp ne nf ng nh b\">Module<\/code>&nbsp;class takes care of a lot of low-level functionality in PyTorch, and we just add a few custom things on top of it.<\/p>\n<pre class=\"ni nj nk nl gx nm bs nn no dz nh\"><span id=\"e958\" class=\"ga np kk iy nh b dm nq nr l ns nt\" data-selectable-paragraph=\"\">class RGBModel(torch.nn.Module):\n    pass<\/span><\/pre>\n<h2 id=\"d5ca\" class=\"np kk iy bm kl nu nv nw kp nx ny nz kt lr oa ob kx lv oc od lb lz oe of lf og ga\" data-selectable-paragraph=\"\">Define&nbsp;<code class=\"fp ne nf ng nh b\">__init__<\/code>&nbsp;method:<\/h2>\n<p id=\"99a7\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">First, we need to define our&nbsp;<em class=\"nc\">initializer,<\/em>&nbsp;which gets called whenever we create a new instance of this class, i.e. when we write something like&nbsp;<code class=\"fp ne nf ng nh b\">model = RGBModel()<\/code>.<\/p>\n<pre class=\"ni nj nk nl gx nm bs nn no dz nh\"><span id=\"8e0c\" class=\"ga np kk iy nh b dm nq nr l ns nt\" data-selectable-paragraph=\"\">class RGBModel(torch.nn.Module):\n    def __init__(self, device):\n      # Call nn.Module.__init__() to instantiate typical torch.nn.Module stuff\n      super(RGBModel, self).__init__()\n\t\t\tcolor = torch.ones(size=(1,3,1,1), device=device) \/ 2\n      self.color = torch.nn.Parameter(color)<\/span><\/pre>\n<p id=\"15cb\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">The first thing our&nbsp;<code class=\"fp ne nf ng nh b\">__init__<\/code>&nbsp;method does is call the standard&nbsp;<code class=\"fp ne nf ng nh b\">__init__<\/code>&nbsp;method from&nbsp;<code class=\"fp ne nf ng nh b\">torch.nn.Module<\/code>, which is our &#8220;parent&#8221; class or superclass. That&#8217;s what&nbsp;<code class=\"fp ne nf ng nh b\">super(RGBModel, self).__init__()<\/code>&nbsp;is doing. That handles all sorts of standard PyTorch initialization stuff that we need to get off the ground.<\/p>\n<p id=\"a57c\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Then, we define a&nbsp;<code class=\"fp ne nf ng nh b\">Parameter<\/code>&nbsp;for our model. This will hold the RGB value that we optimize in the training loop. We first create a tensor of all ones, and of shape (1,3,1,1), using&nbsp;<code class=\"fp ne nf ng nh b\">torch.ones<\/code>. Remember that PyTorch typically expects images in the&nbsp;<code class=\"fp ne nf ng nh b\">NCHW<\/code>&nbsp;format. So that means we&#8217;re setting our tensor up as a stack of images containing one RGB image with a width and height of a single pixel. We could handle reshaping this parameter later, but this will be more convenient for us downstream when the time comes to resize the pixel to the input resolution for CLIP\u2019s image encoder.<\/p>\n<p id=\"4cd5\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Next, we pass this tensor into the&nbsp;<code class=\"fp ne nf ng nh b\">torch.nn.Parameter<\/code>&nbsp;and store this object as an attribute. That way, it will persist over time and we can access it via other methods.<\/p>\n<h2 id=\"5d51\" class=\"np kk iy bm kl nu nv nw kp nx ny nz kt lr oa ob kx lv oc od lb lz oe of lf og ga\" data-selectable-paragraph=\"\">Define&nbsp;<code class=\"fp ne nf ng nh b\">forward<\/code>&nbsp;pass:<\/h2>\n<pre class=\"ni nj nk nl gx nm bs nn no dz nh\"><span id=\"5adf\" class=\"ga np kk iy nh b dm nq nr l ns nt\" data-selectable-paragraph=\"\">class RGBModel(torch.nn.Module):\n    def __init__(self, device):\n      # Call nn.Module.__init__() to instantiate typical torch.nn.Module stuff\n      super(RGBModel, self).__init__()\n\t\t\tcolor = torch.ones(size=(1,3,1,1), device=device) \/ 2\n      self.color = torch.nn.Parameter(color)<\/span><span id=\"84cb\" class=\"ga np kk iy nh b dm oh nr l ns nt\" data-selectable-paragraph=\"\">    def forward(self):\n      # Clamp numbers to the closed interval [0,1]\n      self.color.data = self.color.data.clamp(0,1)<\/span><span id=\"da35\" class=\"ga np kk iy nh b dm oh nr l ns nt\" data-selectable-paragraph=\"\">      return self.color<\/span><\/pre>\n<p id=\"94e5\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Next, we define what the model actually&nbsp;<em class=\"nc\">does<\/em>&nbsp;when it\u2019s called. If&nbsp;<code class=\"fp ne nf ng nh b\">__init__<\/code>&nbsp;is what happens when we write&nbsp;<code class=\"fp ne nf ng nh b\">model = RGBModel()<\/code>, then&nbsp;<code class=\"fp ne nf ng nh b\">forward<\/code>&nbsp;dictates what happens when we then call&nbsp;<code class=\"fp ne nf ng nh b\">model()<\/code>. We might think of this as a &#8220;prediction&#8221; or &#8220;generation&#8221; step, in many cases, but ultimately this is what the model actually outputs.<\/p>\n<p id=\"f7d4\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">For us, the forward pass is quite simple. The model should simply output its color. We do&nbsp;<em class=\"nc\">not<\/em>&nbsp;want&nbsp;<code class=\"fp ne nf ng nh b\">forward<\/code>&nbsp;to handle turning that color into an image or anything like that. The only thing we need to do is ensure that our color stays within an appropriate range during the training process. As such, we&#8217;re writing&nbsp;<code class=\"fp ne nf ng nh b\">self.color.data = self.color.data.clamp(0, 1)<\/code>&nbsp;to restrict our model to the closed interval&nbsp;<code class=\"fp ne nf ng nh b\">[0, 1]<\/code>.<\/p>\n<p id=\"c5d6\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">There are some issues we could run into with the&nbsp;<code class=\"fp ne nf ng nh b\">clamp<\/code>&nbsp;method during training, but this is a toy model, so we&#8217;re going to ignore that for now.<\/p>\n<\/div>\n\n\n\n<div class=\"o dx oi oj id ok\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<blockquote class=\"op\"><p id=\"7de3\" class=\"oq or iy bm os ot ou ov ow ox oy md cn\" data-selectable-paragraph=\"\">Want to see the evolution of AI-generated art projects?&nbsp;<a class=\"au mf\" href=\"https:\/\/www.comet.com\/team-comet-ml\/clipdraw\/view\/Y4aT3gy6IrPQKBi5wncFXCYLR?utm_campaign=clipdraw-gradio&amp;utm_source=blog&amp;utm_medium=summary\" target=\"_blank\" rel=\"noopener ugc nofollow\">Visit our public project<\/a>&nbsp;to see time-lapses, experiment evolutions, and more!<\/p><\/blockquote>\n<\/div>\n\n\n\n<div class=\"o dx oi oj id ok\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<h1 id=\"9b6e\" class=\"kj kk iy bm kl km oz ko kp kq pa ks kt ku pb kw kx ky pc la lb lc pd le lf lg ga\" data-selectable-paragraph=\"\">Create an Optimizer<\/h1>\n<p id=\"0cb1\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">With our model ready to go, it\u2019s time to create an optimizer object. We\u2019ll use the&nbsp;<code class=\"fp ne nf ng nh b\">AdamW<\/code>&nbsp;optimizer. For more information, this&nbsp;<a class=\"au mf\" href=\"https:\/\/towardsdatascience.com\/why-adamw-matters-736223f31b5d\" target=\"_blank\" rel=\"noopener\">blog post<\/a>&nbsp;is a great rundown of the AdamW algorithm and its predecessor, Adam.<\/p>\n<pre class=\"ni nj nk nl gx nm bs nn no dz nh\"><span id=\"e527\" class=\"ga np kk iy nh b dm nq nr l ns nt\" data-selectable-paragraph=\"\"># Create optimizer\nopt = torch.optim.AdamW([rgb_model()],\n\t\t\t\t\t\t\t\t\t\t\t   r=adam_learning_rate,\n\t\t\t\t\t\t\t\t\t\t\t\t weight_decay=adam_weight_decay)<\/span><\/pre>\n<p id=\"d6f4\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Basically, what we need to know is that&nbsp;<code class=\"fp ne nf ng nh b\">AdamW<\/code>&nbsp;defines a strategy for running incremental, iterative updates to our&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>&nbsp;parameter during the training process.<\/p>\n<p id=\"e247\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Here, we provide two&nbsp;<em class=\"nc\">hyperparameters<\/em>&nbsp;to the optimizer when we create it: a learning rate and a weight decay value. Broadly speaking, the learning rate describes the magnitude of updates each training step should make (higher rate = bigger increments), and the weight decay drives a process by which those update steps shrink over time.<\/p>\n<p id=\"1783\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">In the context of our model, the optimizer will help tell us something like \u201cif you want to make your&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>&nbsp;match this prompt, you should turn up the red value.&#8221; Or more specifically, it would tell us something like &#8220;if you add something to your color in the&nbsp;<em class=\"nc\">direction<\/em>&nbsp;of, say,&nbsp;<code class=\"fp ne nf ng nh b\">(0.1, -0.1, 0.1)<\/code>&nbsp;, it would increase the similarity the fastest.&#8221; Then, the learning rate comes into play by modifying how large that increment is. Over time, we want to take smaller, more precise steps, so the optimizer implements&nbsp;<em class=\"nc\">weight decay<\/em>&nbsp;to do just that.<\/p>\n<h1 id=\"04c1\" class=\"kj kk iy bm kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg ga\" data-selectable-paragraph=\"\">Generate a Target Embedding<\/h1>\n<p id=\"21a7\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">We have a model and an optimizer. What do we optimize towards? Let\u2019s set up our target.<\/p>\n<pre class=\"ni nj nk nl gx nm bs nn no dz nh\"><span id=\"bb90\" class=\"ga np kk iy nh b dm nq nr l ns nt\" data-selectable-paragraph=\"\"># Create target embedding\nwith torch.no_grad():\n    tokenized_text = clip.tokenize(text_prompt).to(device=DEVICE)\n    target_embedding = model.encode_text(tokenized_text).detach().clone()<\/span><\/pre>\n<p id=\"cbc9\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">This should look familiar if you\u2019ve read&nbsp;<a class=\"au mf\" href=\"https:\/\/heartbeat.comet.ml\/using-clip-and-gradio-to-assess-similarity-between-text-prompts-and-ranges-of-colors-a9a8fc0b0a08\" target=\"_blank\" rel=\"noopener ugc nofollow\">part one<\/a>&nbsp;of this series. But I want to point out an optional step we\u2019ve taken here by computing this encoded text using a&nbsp;<code class=\"fp ne nf ng nh b\">torch.no_grad<\/code>&nbsp;<em class=\"nc\">context handler.<\/em>&nbsp;What&#8217;s that all about?<\/p>\n<p id=\"6c24\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Basically, PyTorch and other deep learning libraries use something called&nbsp;<em class=\"nc\">automatic differentiation<\/em>&nbsp;to keep track of the gradients\/derivatives of tensors as they move through a&nbsp;<em class=\"nc\">computational graph<\/em>. Automatic differentiation simplifies&nbsp;<em class=\"nc\">a lot<\/em>&nbsp;of computation when necessary, but it uses more memory in the process.<\/p>\n<p id=\"cda3\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">We absolutely need this to be enabled for the&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>&nbsp;parameter of our&nbsp;<code class=\"fp ne nf ng nh b\">RGBModel<\/code>, since we need to compute the gradient of the (not yet defined) loss function to update the color during training. However, we don&#8217;t need to take the gradient of anything with respect to our target, so we can save some memory by creating it in an indented block under&nbsp;<code class=\"fp ne nf ng nh b\">with torch.no_grad()<\/code>.<\/p>\n<p id=\"a9c1\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">For a model this simple, we almost surely are not that concerned with how much memory we have, but it will be a helpful trick in future projects when we start pushing the limits of our machines.<\/p>\n<h1 id=\"53d7\" class=\"kj kk iy bm kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg ga\" data-selectable-paragraph=\"\">Define the Training Step<\/h1>\n<p id=\"f3e7\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">Now, we define the actual&nbsp;<em class=\"nc\">training<\/em>&nbsp;process. What happens during each iteration of our training loop? At the heart of it, we need to encode our color as an image, then compare its CLIP embedding to the embedding for our text prompt. But there are a few more things going on in here that you may or may not have seen before.<\/p>\n<pre class=\"ni nj nk nl gx nm bs nn no dz nh\"><span id=\"16bb\" class=\"ga np kk iy nh b dm nq nr l ns nt\" data-selectable-paragraph=\"\">def training_step():\n\t# Clear out any existing gradients\n  opt.zero_grad()\n\n\t# Get color parameters from rgb model instance\n\tcolor = rgb_model()\n  color_img = resizer(color)\n  image_embedding = model.encode_image(color_img)\n\n\t# Using negative cosine similarity as loss\n\tloss = -1 * torch.cosine_similarity(target_embedding, image_embedding, dim=-1)\n\n  # Compute the gradient of the loss function and backpropagate to other tensors\n\tloss.backward()<\/span><span id=\"d413\" class=\"ga np kk iy nh b dm oh nr l ns nt\" data-selectable-paragraph=\"\">  # Perform parameter update on parameters defined in optimizer\n\topt.step()<\/span><\/pre>\n<h2 id=\"f0f8\" class=\"np kk iy bm kl nu nv nw kp nx ny nz kt lr oa ob kx lv oc od lb lz oe of lf og ga\" data-selectable-paragraph=\"\"><code class=\"fp ne nf ng nh b\">Notes: opt.zero_grad()<\/code><\/h2>\n<p id=\"3e83\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">We want to compute the gradient for&nbsp;<em class=\"nc\">each<\/em>&nbsp;step of the training loop separately, which is the standard way of doing things, but not the only way. It turns out that PyTorch optimizers store or&nbsp;<em class=\"nc\">accumulate<\/em>&nbsp;gradients until we flush those values out with&nbsp;<code class=\"fp ne nf ng nh b\">opt.zero_grad()<\/code>.<\/p>\n<p id=\"02c4\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">It may seem like this step should be automatic after performing an update, but there are many techniques that benefit from accumulating gradients. Making this process manual in PyTorch gives us lots of transparency and flexibility in defining how models train.<\/p>\n<h2 id=\"aa50\" class=\"np kk iy bm kl nu nv nw kp nx ny nz kt lr oa ob kx lv oc od lb lz oe of lf og ga\" data-selectable-paragraph=\"\"><code class=\"fp ne nf ng nh b\">Notes: loss.backward()<\/code><\/h2>\n<p id=\"13b2\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">We compute our loss tensor&nbsp;<code class=\"fp ne nf ng nh b\">loss<\/code>&nbsp;as the&nbsp;<em class=\"nc\">negative cosine similarity<\/em>&nbsp;between the CLIP embeddings of our text prompt and of our model&#8217;s current&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>&nbsp;parameter. With loss functions, we want something where smaller is better, which is why we&#8217;re using the negative cosine similarity.<\/p>\n<p id=\"fde1\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Once we compute the loss, we need to compute its gradient. Don\u2019t be fooled; despite the term \u201cautomatic differentiation,\u201d this doesn\u2019t actually happen automatically!<\/p>\n<p id=\"de0a\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">Automatic differentiation refers to the accumulation of symbolic steps that can be combined using the&nbsp;<em class=\"nc\">chain rule<\/em>&nbsp;to produce the gradient of a function\/tensor. Thus, calling&nbsp;<code class=\"fp ne nf ng nh b\">loss.backward()<\/code>&nbsp;will compute the gradient with respect to the graph&#8217;s leaves (in this case, the&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>&nbsp;parameter of our model) so the optimizer can use it.<\/p>\n<h2 id=\"4b29\" class=\"np kk iy bm kl nu nv nw kp nx ny nz kt lr oa ob kx lv oc od lb lz oe of lf og ga\" data-selectable-paragraph=\"\"><code class=\"fp ne nf ng nh b\">Notes: opt.step()<\/code><\/h2>\n<p id=\"71c3\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">So now we have our loss, and we\u2019ve computed its gradient with respect to&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>. It&#8217;s time we updated our color. Calling&nbsp;<code class=\"fp ne nf ng nh b\">opt.step()<\/code>&nbsp;will do just that. If we leave this out, then&nbsp;<code class=\"fp ne nf ng nh b\">color<\/code>&nbsp;will never change.<\/p>\n<h1 id=\"30ad\" class=\"kj kk iy bm kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf lg ga\" data-selectable-paragraph=\"\">What\u2019s Next?<\/h1>\n<p id=\"4d3d\" class=\"pw-post-body-paragraph lh li iy bm b lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md ir ga\" data-selectable-paragraph=\"\">In this post, we used CLIP to drive the direct optimization of RGB values to match text prompts. Along the way, we covered some PyTorch fundamentals, working with the&nbsp;<code class=\"fp ne nf ng nh b\">Module<\/code>&nbsp;class to create models and unpacking some aspects of the training process. How do we build from here?<\/p>\n<p id=\"c8a5\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">We could iterate on this work in any number of ways. For one, we could move to optimizing more than one pixel at a time. Maybe we try to directly optimize an 8&#215;8 RGB image with CLIP. If we simply use CLIP-driven cosine similarity as our loss function, we will find that we get increasingly unstable results if we just try to optimize pixel values directly. Instead, we could try swapping our&nbsp;<code class=\"fp ne nf ng nh b\">RGBModel<\/code>&nbsp;with another image-generating mechanism. For instance, we could use the generator from a GAN, and use CLIP to optimize the latent vectors, implicitly capturing changes in features that extend beyond individual pixels. In fact, that appears to be the most popular approach in CLIP-guided image generation. Not sure what all of that means? Then stay tuned to learn more in the next installation in this series.<\/p>\n<p id=\"6f6a\" class=\"pw-post-body-paragraph lh li iy bm b lj mx ll lm ln my lp lq lr mz lt lu lv na lx ly lz nb mb mc md ir ga\" data-selectable-paragraph=\"\">For now, you can&nbsp;<a class=\"au mf\" href=\"https:\/\/huggingface.co\/spaces\/miccull\/clip-text-to-rgb\" target=\"_blank\" rel=\"noopener ugc nofollow\">try this model live<\/a>&nbsp;on Hugging Face Spaces, and also read through the&nbsp;<a class=\"au mf\" href=\"https:\/\/huggingface.co\/spaces\/miccull\/clip-text-to-rgb\/tree\/main\" target=\"_blank\" rel=\"noopener ugc nofollow\">code<\/a>&nbsp;that drives the demo. You can also find out more about Hugging Face and Gradio&nbsp;<a class=\"au mf\" href=\"https:\/\/huggingface.co\/docs\/hub\/spaces#using-spaces\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Example input and output from the Gradio app built using the Text to Color model. Moving from left to right, we can see how each progressive training step updates the color to match the prompt \u201cthe color of a banana\u201d. &nbsp; Introduction This is&nbsp;part two&nbsp;in a series on using CLIP from scratch to evaluate and [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[7],"tags":[],"coauthors":[141],"class_list":["post-4572","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces - Comet<\/title>\n<meta name=\"description\" content=\"This is\u00a0part two\u00a0in a series on using CLIP from scratch to evaluate and manipulate images by comparing them to text prompts.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces\" \/>\n<meta property=\"og:description\" content=\"This is\u00a0part two\u00a0in a series on using CLIP from scratch to evaluate and manipulate images by comparing them to text prompts.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2022-11-14T18:44:49+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:16:25+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png\" \/>\n<meta name=\"author\" content=\"Michael Cullan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Michael Cullan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces - Comet","description":"This is\u00a0part two\u00a0in a series on using CLIP from scratch to evaluate and manipulate images by comparing them to text prompts.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/","og_locale":"en_US","og_type":"article","og_title":"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces","og_description":"This is\u00a0part two\u00a0in a series on using CLIP from scratch to evaluate and manipulate images by comparing them to text prompts.","og_url":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2022-11-14T18:44:49+00:00","article_modified_time":"2025-04-24T17:16:25+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png","type":"","width":"","height":""}],"author":"Michael Cullan","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Michael Cullan","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces","datePublished":"2022-11-14T18:44:49+00:00","dateModified":"2025-04-24T17:16:25+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/"},"wordCount":1701,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png","articleSection":["Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/","url":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/","name":"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png","datePublished":"2022-11-14T18:44:49+00:00","dateModified":"2025-04-24T17:16:25+00:00","description":"This is\u00a0part two\u00a0in a series on using CLIP from scratch to evaluate and manipulate images by comparing them to text prompts.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#primaryimage","url":"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png","contentUrl":"https:\/\/miro.medium.com\/max\/700\/1*ncJodqmMQphnD-IyY9N71A.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/text-to-color-from-scratch-with-clip-pytorch-and-hugging-face-spaces\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"\u201cText-to-Color\u201d from Scratch with CLIP, PyTorch, and Hugging Face Spaces"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4572","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=4572"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4572\/revisions"}],"predecessor-version":[{"id":15644,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4572\/revisions\/15644"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=4572"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=4572"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=4572"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=4572"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}