{"id":4428,"date":"2022-10-31T09:35:21","date_gmt":"2022-10-31T17:35:21","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=4428"},"modified":"2025-04-24T17:16:48","modified_gmt":"2025-04-24T17:16:48","slug":"7-optimization-methods-used-in-deep-learning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/","title":{"rendered":"7 Optimization Methods Used In Deep Learning"},"content":{"rendered":"\n<div class=\"ir is it iu iv\" style=\"text-align: left;\">\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<div class=\"kt ku do kv ce kw\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k\" alt=\"\" width=\"700\" height=\"467\"><\/figure><div class=\"gl gm kn\" style=\"text-align: center;\"><picture>Photo by <\/picture><a class=\"au lc\" href=\"https:\/\/unsplash.com\/@jo_coenen?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Jo Coenen &#8211; Studio Dries 2.6<\/a><picture>&nbsp;on&nbsp;<\/picture><a class=\"au lc\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/div>\n<\/div>\n<\/figure>\n<p id=\"27ac\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Optimization plays a vital role in the development of machine learning and deep learning algorithms \u2014 without it, our model would not have the best design. The procedure refers to finding the set of input parameters or arguments to an objective function that results in the minimum or maximum output of the function \u2014 usually the minimum in a machine learning\/deep learning context.<\/p>\n<p id=\"ba83\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">In this article, we are going to cover the top seven most common optimization methods used in deep learning.<\/p>\n<h3 id=\"dd18\" class=\"ly lz iy bm ma mb mc md me mf mg mh mi ke mj kf mk kh ml ki mm kk mn kl mo mp ga\">#1 Gradient Descent<\/h3>\n<p id=\"3ce5\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">Getting a sound understanding of the inner workings of&nbsp;<a class=\"au lc\" href=\"https:\/\/heartbeat.comet.ml\/using-the-gradient-descent-algorithm-in-machine-learning-b9d2175b8012\" target=\"_blank\" rel=\"noopener ugc nofollow\">gradient descent<\/a>&nbsp;is one of the best things you could do for your career in ML\/DL. It\u2019s one of the most popular optimization algorithms and comes up constantly in the field. Gradient descent is a first-order, iterative optimization method \u2014 first-order means we calculate only the first-order derivative.<\/p>\n<p id=\"4156\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">To find the minimum of the function, steps are taken in proportion to the negative gradient of the function at its current point. The initial point on the function is random.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<div class=\"kt ku do kv ce kw\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*7vqz4D2WIOc95LiI.png\" alt=\"\" width=\"700\" height=\"380\"><\/figure><div class=\"gl gm mv\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/0*7vqz4D2WIOc95LiI.png 640w, https:\/\/miro.medium.com\/max\/720\/0*7vqz4D2WIOc95LiI.png 720w, https:\/\/miro.medium.com\/max\/750\/0*7vqz4D2WIOc95LiI.png 750w, https:\/\/miro.medium.com\/max\/786\/0*7vqz4D2WIOc95LiI.png 786w, https:\/\/miro.medium.com\/max\/828\/0*7vqz4D2WIOc95LiI.png 828w, https:\/\/miro.medium.com\/max\/1100\/0*7vqz4D2WIOc95LiI.png 1100w, https:\/\/miro.medium.com\/max\/1400\/0*7vqz4D2WIOc95LiI.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\">Example of minimizing J(w); [Source:&nbsp;<\/picture><a class=\"au lc\" href=\"http:\/\/rasbt.github.io\/mlxtend\/user_guide\/general_concepts\/gradient-optimization\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">MLextend<\/a><picture>]<\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"d352\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Gradient descent isn\u2019t perfect though. An update to our parameters can only be made once we\u2019ve iterated through the whole dataset which makes learning extremely slow when the data is large. Various&nbsp;<a class=\"au lc\" href=\"https:\/\/towardsdatascience.com\/gradient-descent-811efcc9f1d5#\" target=\"_blank\" rel=\"noopener\">adaptions to gradient descent<\/a>&nbsp;were made to overcome this problem. They include&nbsp;<strong class=\"bm mw\">mini-batch gradient descent<\/strong>&nbsp;and&nbsp;<strong class=\"bm mw\">stochastic gradient descent.<\/strong><\/p>\n<h3 id=\"56bd\" class=\"ly lz iy bm ma mb mc md me mf mg mh mi ke mj kf mk kh ml ki mm kk mn kl mo mp ga\">#2 Momentum<\/h3>\n<p id=\"75a5\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">The two variations of gradient descent mentioned above \u2014 stochastic and mini-batch \u2014 oscillate as they step towards the minimum of the objective function. These oscillations occur because variance is introduced at each step since an update to the parameters is made after iterating through every&nbsp;<em class=\"mx\">n<\/em>&nbsp;number of instances.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<div class=\"kt ku do kv ce kw\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/1*cS43adjBgvndVd1gg-6BsQ.jpeg\" alt=\"\" width=\"700\" height=\"252\"><\/figure><div class=\"gl gm my\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 640w, https:\/\/miro.medium.com\/max\/720\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 720w, https:\/\/miro.medium.com\/max\/750\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 750w, https:\/\/miro.medium.com\/max\/786\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 786w, https:\/\/miro.medium.com\/max\/828\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 828w, https:\/\/miro.medium.com\/max\/1100\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 1100w, https:\/\/miro.medium.com\/max\/1400\/1*cS43adjBgvndVd1gg-6BsQ.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\">Conceptual view of how different variations of gradient descent step towards the minimum of an objective function; [Source:&nbsp;<\/picture><a class=\"au lc\" href=\"https:\/\/www.i2tutorials.com\/explain-brief-about-mini-batch-gradient-descent\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">i2tutorials<\/a><picture>]<\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"df4f\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\"><a class=\"au lc\" href=\"https:\/\/heartbeat.comet.ml\/exploring-optimizers-in-machine-learning-7f18d94cd65b\" target=\"_blank\" rel=\"noopener ugc nofollow\">Momentum extends on gradient descent<\/a>, hence we typically refer to it as gradient descent with momentum. The technique seeks to overcome the oscillation problem by adding history to the parameter update equation. The idea behind momentum is that with an understanding of the direction required to reach the minimum of the function faster, we can make our gradient steps move in that direction and reduce the oscillations in irrelevant directions.<\/p>\n<p id=\"421a\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The update parameter from the previous step, Vt\u22121, is added to the gradient step, \u25bd\u03b8J(\u03b8). How much information from historical steps depends on the value \u03b3, and the step size is controlled by the learning rate, \u014b.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/303\/1*2OxaruKFaSNnyZm0ZNg61w.png\" alt=\"\" width=\"303\" height=\"67\"><\/figure><div class=\"gl gm mz\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*2OxaruKFaSNnyZm0ZNg61w.png 640w, https:\/\/miro.medium.com\/max\/720\/1*2OxaruKFaSNnyZm0ZNg61w.png 720w, https:\/\/miro.medium.com\/max\/750\/1*2OxaruKFaSNnyZm0ZNg61w.png 750w, https:\/\/miro.medium.com\/max\/786\/1*2OxaruKFaSNnyZm0ZNg61w.png 786w, https:\/\/miro.medium.com\/max\/828\/1*2OxaruKFaSNnyZm0ZNg61w.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*2OxaruKFaSNnyZm0ZNg61w.png 1100w, https:\/\/miro.medium.com\/max\/606\/1*2OxaruKFaSNnyZm0ZNg61w.png 606w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 303px\" data-testid=\"og\">Introducing momentum to gradient descent<\/picture><\/div>\n<\/figure>\n<p id=\"4216\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">To update the parameter with momentum, we would use the following equation:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/215\/1*KTOrFA5vdWMQ1iVaxMdKsw.png\" alt=\"\" width=\"215\" height=\"78\"><\/figure><div class=\"gl gm na\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 640w, https:\/\/miro.medium.com\/max\/720\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 720w, https:\/\/miro.medium.com\/max\/750\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 750w, https:\/\/miro.medium.com\/max\/786\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 786w, https:\/\/miro.medium.com\/max\/828\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 1100w, https:\/\/miro.medium.com\/max\/430\/1*KTOrFA5vdWMQ1iVaxMdKsw.png 430w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 215px\" data-testid=\"og\">Parameter update with momentum<\/picture><\/div>\n<\/figure>\n<p id=\"3c65\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">This adjustment applied to mini-batch and stochastic gradient descent will help to reduce oscillations in each gradient step which in turn would speed up convergence.<\/p>\n<h3 id=\"c381\" class=\"ly lz iy bm ma mb mc md me mf mg mh mi ke mj kf mk kh ml ki mm kk mn kl mo mp ga\">#3 Nesterov Accelerated Gradient (NAG)<\/h3>\n<p id=\"80af\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">Momentum-based optimization takes steps towards the minimum based on past steps, which reduces the oscillations we see in mini-batch and stochastic gradient descent. One problem arises when we use momentum is that we may miss the minimum value of the objective function. This is because, as we approach the minimum, the value for momentum is high.<\/p>\n<blockquote class=\"nb nc nd\"><p id=\"12f6\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\">\u201cWhen the value of momentum is high while we are near to attain convergence, then the momentum actually pushes the gradient step high and it might miss out on the actual minimum value.\u201d \u2014&nbsp;<strong class=\"bm mw\">Hands-on Deep Learning Algorithms with Python,<\/strong>&nbsp;<strong class=\"bm mw\">P.101<\/strong><\/p><\/blockquote>\n<p id=\"f7a2\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Nesterov introduced an intelligent new method to help overcome this problem. The idea behind the method is to calculate the gradient of where the momentum would take us too, instead of calculating the gradient at the current position.<\/p>\n<blockquote class=\"nb nc nd\"><p id=\"342a\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\">\u201cThus, before making gradient step with momentum and reaching a new position, if we understand which position the momentum will take us to, then we can avoid overshooting the minimum value.\u201d \u2014&nbsp;<a class=\"au lc\" href=\"https:\/\/www.amazon.com\/Hands-Deep-Learning-Algorithms-Python\/dp\/1789344158\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"bm mw\">Hands-on Deep Learning Algorithms with Python,<\/strong>&nbsp;<strong class=\"bm mw\">P.102<\/strong><\/a><\/p><\/blockquote>\n<p id=\"59d0\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Upon learning that momentum is going to push us past the minimum value, we could reduce the speed of momentum to try to reach the minimum value.<\/p>\n<p id=\"5388\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">To identify the position in which we\u2019d arrive at with momentum, we calculate the gradients with respect to the approximate position of where our next gradient step will be (the lookahead position):<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/270\/1*EZW54_nbpWYKKT7M2WO_OA.png\" alt=\"\" width=\"270\" height=\"103\"><\/figure><div class=\"gl gm nh\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*EZW54_nbpWYKKT7M2WO_OA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*EZW54_nbpWYKKT7M2WO_OA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*EZW54_nbpWYKKT7M2WO_OA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*EZW54_nbpWYKKT7M2WO_OA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*EZW54_nbpWYKKT7M2WO_OA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*EZW54_nbpWYKKT7M2WO_OA.png 1100w, https:\/\/miro.medium.com\/max\/540\/1*EZW54_nbpWYKKT7M2WO_OA.png 540w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 270px\" data-testid=\"og\">Calculating the gradient of where momentum will take us<\/picture><\/div>\n<\/figure>\n<p id=\"55e1\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">We can rewrite the Vt equation presented in momentum as follows:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/502\/1*v6onYO-j1vanyZ7r-wZwyA.png\" alt=\"\" width=\"502\" height=\"100\"><\/figure><div class=\"gl gm ni\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*v6onYO-j1vanyZ7r-wZwyA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*v6onYO-j1vanyZ7r-wZwyA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*v6onYO-j1vanyZ7r-wZwyA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*v6onYO-j1vanyZ7r-wZwyA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*v6onYO-j1vanyZ7r-wZwyA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*v6onYO-j1vanyZ7r-wZwyA.png 1100w, https:\/\/miro.medium.com\/max\/1004\/1*v6onYO-j1vanyZ7r-wZwyA.png 1004w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 502px\" data-testid=\"og\">Nesterov accelerated gradient mathematical expression.<\/picture><\/div>\n<\/figure>\n<p id=\"01be\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The parameter update step remains the same.<\/p>\n<\/div>\n\n\n\n<div class=\"o dx nj nk id nl\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<blockquote class=\"nq\"><p id=\"d8f8\" class=\"nr ns iy bm nt nu nv nw nx ny nz lx cn\" data-selectable-paragraph=\"\">Join 16,000 of your colleagues at&nbsp;<a class=\"au lc\" href=\"https:\/\/www.deeplearningweekly.com\/about\" target=\"_blank\" rel=\"noopener ugc nofollow\">Deep Learning Weekly<\/a>&nbsp;for the latest products, acquisitions, technologies, deep-dives and more.<\/p><\/blockquote>\n<\/div>\n\n\n\n<div class=\"o dx nj nk id nl\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<h3 id=\"2259\" class=\"ly lz iy bm ma mb oa md me mf ob mh mi ke oc kf mk kh od ki mm kk oe kl mo mp ga\">#4 Adaptive Gradient (AdaGrad)<\/h3>\n<p id=\"9de4\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">Adaptive gradient uses parameter-specific learning rates which are constantly adapted in accordance with how frequently a parameter update is made during training.<\/p>\n<blockquote class=\"nb nc nd\"><p id=\"d469\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\">\u201cParameters that have frequent updates or high gradients will have a slower learning rate, while a parameter that has an infrequent update or small gradients will also have a slower learning rate. [\u2026] parameters that have infrequent updates implies that they are not trained enough, so we set a high learning rate for them, and parameters that have frequent updates implies they are trained enough, so we set their learning rate to a low value so we don&#8217;t overshoot the minimum.\u201d \u2014&nbsp;<strong class=\"bm mw\">Hands-on Deep Learning Algorithms with Python,<\/strong>&nbsp;<strong class=\"bm mw\">P.103<\/strong><\/p><\/blockquote>\n<p id=\"fe78\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Let\u2019s see how this looks from a mathematical standpoint. For simplicity\u2019s sake, we are going to represent the gradient as&nbsp;<em class=\"mx\">g.&nbsp;<\/em>Thus, the gradient of parameter,<em class=\"mx\">&nbsp;\u03b8i<\/em>, at an iteration,&nbsp;<em class=\"mx\">t<\/em>, can be represented as:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/257\/1*warK75DgXwWjk5tjzvl_3A.png\" alt=\"\" width=\"257\" height=\"95\"><\/figure><div class=\"gl gm of\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*warK75DgXwWjk5tjzvl_3A.png 640w, https:\/\/miro.medium.com\/max\/720\/1*warK75DgXwWjk5tjzvl_3A.png 720w, https:\/\/miro.medium.com\/max\/750\/1*warK75DgXwWjk5tjzvl_3A.png 750w, https:\/\/miro.medium.com\/max\/786\/1*warK75DgXwWjk5tjzvl_3A.png 786w, https:\/\/miro.medium.com\/max\/828\/1*warK75DgXwWjk5tjzvl_3A.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*warK75DgXwWjk5tjzvl_3A.png 1100w, https:\/\/miro.medium.com\/max\/514\/1*warK75DgXwWjk5tjzvl_3A.png 514w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 257px\" data-testid=\"og\">Representing the gradient of a parameter at iteration t<\/picture><\/div>\n<\/figure>\n<p id=\"46bb\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Our update equation can be rewritten as:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/357\/1*aRJPtJ8hv2cYPfhph-R7ig.png\" alt=\"\" width=\"357\" height=\"97\"><\/figure><div class=\"gl gm og\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*aRJPtJ8hv2cYPfhph-R7ig.png 640w, https:\/\/miro.medium.com\/max\/720\/1*aRJPtJ8hv2cYPfhph-R7ig.png 720w, https:\/\/miro.medium.com\/max\/750\/1*aRJPtJ8hv2cYPfhph-R7ig.png 750w, https:\/\/miro.medium.com\/max\/786\/1*aRJPtJ8hv2cYPfhph-R7ig.png 786w, https:\/\/miro.medium.com\/max\/828\/1*aRJPtJ8hv2cYPfhph-R7ig.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*aRJPtJ8hv2cYPfhph-R7ig.png 1100w, https:\/\/miro.medium.com\/max\/714\/1*aRJPtJ8hv2cYPfhph-R7ig.png 714w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 357px\" data-testid=\"og\">Revised update equation<\/picture><\/div>\n<\/figure>\n<p id=\"1e8e\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">For each iteration,&nbsp;<em class=\"mx\">t<\/em>, updating a parameter,<em class=\"mx\">&nbsp;\u03b8i,&nbsp;<\/em>involves diving the learning rate by the sum of squares of all previous gradients of the parameter<em class=\"mx\">, \u03b8i&nbsp;<\/em>(epsilon is added to avoid division by zero error):<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/546\/1*awzt0--DqzAjFaxej-YFAA.png\" alt=\"\" width=\"546\" height=\"161\"><\/figure><div class=\"gl gm oh\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*awzt0--DqzAjFaxej-YFAA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*awzt0--DqzAjFaxej-YFAA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*awzt0--DqzAjFaxej-YFAA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*awzt0--DqzAjFaxej-YFAA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*awzt0--DqzAjFaxej-YFAA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*awzt0--DqzAjFaxej-YFAA.png 1100w, https:\/\/miro.medium.com\/max\/1092\/1*awzt0--DqzAjFaxej-YFAA.png 1092w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 546px\" data-testid=\"og\">Adaptive gradient<\/picture><\/div>\n<\/figure>\n<p id=\"fc7c\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">When the sum of squares past gradients is a large value, our learning rate will be scaled to a smaller number. In contrast, if the value is low, the learning rate will be divided by a lower value, thus resulting in a higher learning rate.<\/p>\n<h3 id=\"a1ca\" class=\"ly lz iy bm ma mb mc md me mf mg mh mi ke mj kf mk kh ml ki mm kk mn kl mo mp ga\">#5 AdaDelta<\/h3>\n<p id=\"c4ab\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">Adadelta seeks to improve on the Agagrad optimization method. Recall that in Adagrad all the past squared gradients are summed together. This means that on every iteration the sum of the past squared gradients will increase, and when the squared past gradient value is high, we will be dividing the learning rate by a large number which causes the learning rate to decay. When the learning rate is a very small value, convergence will take longer.<\/p>\n<blockquote class=\"nb nc nd\"><p id=\"d151\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\">\u201cInstead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size&nbsp;<em class=\"iy\">w<\/em>.\u201d \u2014&nbsp;<a class=\"au lc\" href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html#gradientdescentoptimizationalgorithms\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"bm mw\">An Overview of Gradient Descent Optimization Algorithms<\/strong><\/a><strong class=\"bm mw\">, Sebastien Ruder<\/strong><\/p><\/blockquote>\n<p id=\"a151\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">To avoid the inefficient computation of squaring and storing gradients from the window,&nbsp;<em class=\"mx\">w<\/em>, in each iteration, we take the exponentially decaying running average of gradients:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/505\/1*wC_MCPvdAKAKXz4MheEHXA.png\" alt=\"\" width=\"505\" height=\"103\"><\/figure><div class=\"gl gm oi\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*wC_MCPvdAKAKXz4MheEHXA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*wC_MCPvdAKAKXz4MheEHXA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*wC_MCPvdAKAKXz4MheEHXA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*wC_MCPvdAKAKXz4MheEHXA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*wC_MCPvdAKAKXz4MheEHXA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*wC_MCPvdAKAKXz4MheEHXA.png 1100w, https:\/\/miro.medium.com\/max\/1010\/1*wC_MCPvdAKAKXz4MheEHXA.png 1010w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 505px\" data-testid=\"og\">The exponentially decaying running average of gradients<\/picture><\/div>\n<\/figure>\n<p id=\"c10d\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The \u03b3 we saw in momentum is similar to the one we see in this equation \u2014 it is used to decide how much information from the previous running average of gradients should be added. In Adadelta, it\u2019s referred to as the&nbsp;<strong class=\"bm mw\">exponential decaying rate<\/strong>.<\/p>\n<p id=\"01ce\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Our final update equation is:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/554\/1*A6RuZ8lEb7TegxH_s0QBrA.png\" alt=\"\" width=\"554\" height=\"136\"><\/figure><div class=\"gl gm oj\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*A6RuZ8lEb7TegxH_s0QBrA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*A6RuZ8lEb7TegxH_s0QBrA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*A6RuZ8lEb7TegxH_s0QBrA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*A6RuZ8lEb7TegxH_s0QBrA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*A6RuZ8lEb7TegxH_s0QBrA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*A6RuZ8lEb7TegxH_s0QBrA.png 1100w, https:\/\/miro.medium.com\/max\/1108\/1*A6RuZ8lEb7TegxH_s0QBrA.png 1108w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 554px\" data-testid=\"og\">Final update equation<\/picture><\/div>\n<\/figure>\n<blockquote class=\"nb nc nd\"><p id=\"5507\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm mw\">Note<\/strong>: See&nbsp;<a class=\"au lc\" href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html#gradientdescentoptimizationalgorithms\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"bm mw\">An Overview of Gradient Descent Optimization Algorithms<\/strong><\/a>, for the mathematical proof.<\/p><\/blockquote>\n<p>&nbsp;<\/p>\n<h3 id=\"aadf\" class=\"ly lz iy bm ma mb mc md me mf mg mh mi ke mj kf mk kh ml ki mm kk mn kl mo mp ga\">#6 RMSProp<\/h3>\n<p id=\"8c6e\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">RMSprop was also introduced to combat the decaying learning rate problem faced in Adagrad. We also use the exponentially decaying running average of the gradients:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/505\/1*wC_MCPvdAKAKXz4MheEHXA.png\" alt=\"\" width=\"505\" height=\"103\"><\/figure><div class=\"gl gm oi\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*wC_MCPvdAKAKXz4MheEHXA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*wC_MCPvdAKAKXz4MheEHXA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*wC_MCPvdAKAKXz4MheEHXA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*wC_MCPvdAKAKXz4MheEHXA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*wC_MCPvdAKAKXz4MheEHXA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*wC_MCPvdAKAKXz4MheEHXA.png 1100w, https:\/\/miro.medium.com\/max\/1010\/1*wC_MCPvdAKAKXz4MheEHXA.png 1010w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 505px\" data-testid=\"og\">The exponentially decaying running average of gradients<\/picture><\/div>\n<\/figure>\n<p id=\"07c9\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">We then divide the learning rate (0.9 is the recommended value),&nbsp;<em class=\"mx\">n<\/em>, by this running average of gradients.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/550\/1*hVqSIXvYDFN7Sz1-bImbBw.png\" alt=\"\" width=\"550\" height=\"139\"><\/figure><div class=\"gl gm ok\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*hVqSIXvYDFN7Sz1-bImbBw.png 640w, https:\/\/miro.medium.com\/max\/720\/1*hVqSIXvYDFN7Sz1-bImbBw.png 720w, https:\/\/miro.medium.com\/max\/750\/1*hVqSIXvYDFN7Sz1-bImbBw.png 750w, https:\/\/miro.medium.com\/max\/786\/1*hVqSIXvYDFN7Sz1-bImbBw.png 786w, https:\/\/miro.medium.com\/max\/828\/1*hVqSIXvYDFN7Sz1-bImbBw.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*hVqSIXvYDFN7Sz1-bImbBw.png 1100w, https:\/\/miro.medium.com\/max\/1100\/1*hVqSIXvYDFN7Sz1-bImbBw.png 1100w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 550px\" data-testid=\"og\">Update equation in RMSprop<\/picture><\/div><figcaption class=\"kz bl gn gl gm la lb bm b bn bo cn\" data-selectable-paragraph=\"\"><strong class=\"bm mw\">&nbsp;<\/strong><\/figcaption><\/figure>\n<h3 id=\"0e59\" class=\"ly lz iy bm ma mb mc md me mf mg mh mi ke mj kf mk kh ml ki mm kk mn kl mo mp ga\">#7 Adaptive Moment Estimation (Adam)<\/h3>\n<p id=\"057a\" class=\"pw-post-body-paragraph ld le iy bm b lf mq jz lh li mr kc lk ll ms ln lo lp mt lr ls lt mu lv lw lx ir ga\" data-selectable-paragraph=\"\">Adam is probably the most popularly used optimization algorithm for neural networks. It combines RMSprop and momentum by not only storing the exponentially decaying average of past squared gradients (like we see in RMSprop and Adadelta),&nbsp;<em class=\"mx\">Vt,&nbsp;<\/em>but also storing an exponentially decaying average of the past gradients,&nbsp;<em class=\"mx\">Mt<\/em>, similar to momentum.<\/p>\n<blockquote class=\"nb nc nd\"><p id=\"d924\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\">\u201cWhereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface<a class=\"au lc\" href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html#fn15\" target=\"_blank\" rel=\"noopener ugc nofollow\">[15]<\/a>.\u201d \u2014&nbsp;<strong class=\"bm mw\">An Overview of Gradient Descent Optimization Algorithms, Sebastien Ruder<\/strong><\/p><\/blockquote>\n<p id=\"9a6d\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">We compute the decaying averages of the past,&nbsp;<em class=\"mx\">Mt<\/em>, and past squared,&nbsp;<em class=\"mx\">Vt<\/em>, gradients as follows:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/391\/1*Oy8EaiEyckNWb9h98tWUnQ.png\" alt=\"\" width=\"391\" height=\"116\"><\/figure><div class=\"gl gm ol\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*Oy8EaiEyckNWb9h98tWUnQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*Oy8EaiEyckNWb9h98tWUnQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*Oy8EaiEyckNWb9h98tWUnQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*Oy8EaiEyckNWb9h98tWUnQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*Oy8EaiEyckNWb9h98tWUnQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*Oy8EaiEyckNWb9h98tWUnQ.png 1100w, https:\/\/miro.medium.com\/max\/782\/1*Oy8EaiEyckNWb9h98tWUnQ.png 782w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 391px\" data-testid=\"og\">Estimates of the first and second moments of the gradients<\/picture><\/div>\n<\/figure>\n<p id=\"6c65\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The first moment (mean),&nbsp;<em class=\"mx\">Mt<\/em>, and second moment (uncentered variance),&nbsp;<em class=\"mx\">Vt<\/em>, are both estimates of the gradients \u2014 hence the name of the method.<\/p>\n<blockquote class=\"nb nc nd\"><p id=\"ae6f\" class=\"ld le mx bm b lf lg jz lh li lj kc lk ne lm ln lo nf lq lr ls ng lu lv lw lx ir ga\" data-selectable-paragraph=\"\">\u201cWhen the initial estimates are set to 0, they remain very small, even after many iterations. This means that they would be biased towards 0, especially when&nbsp;<em class=\"iy\">\u03b21<\/em>&nbsp;and&nbsp;<em class=\"iy\">\u03b22 are close to 1.\u201d \u2014&nbsp;<\/em><strong class=\"bm mw\">Hands-on Deep Learning Algorithms with Python,<\/strong>&nbsp;<strong class=\"bm mw\">P.112<\/strong><\/p><\/blockquote>\n<p id=\"4951\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">We can counteract the biases by computing the bias-corrected first and second-moment estimates:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/321\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png\" alt=\"\" width=\"321\" height=\"213\"><\/figure><div class=\"gl gm om\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 1100w, https:\/\/miro.medium.com\/max\/642\/1*OqCNBcWqZ-q-i3RbW7H2MQ.png 642w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 321px\" data-testid=\"og\">Bias corrected first and second moments<\/picture><\/div>\n<\/figure>\n<p id=\"58ad\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Our final update parameter is given as:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/430\/1*J2rqKAsmlvZ00ePVqfcfKQ.png\" alt=\"\" width=\"430\" height=\"133\"><\/figure><div class=\"gl gm on\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 1100w, https:\/\/miro.medium.com\/max\/860\/1*J2rqKAsmlvZ00ePVqfcfKQ.png 860w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 430px\" data-testid=\"og\">Final update parameter in Adam optimization<\/picture><\/div>\n<\/figure>\n<\/div>\n\n\n\n<div class=\"o dx nj nk id nl\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<p id=\"ff10\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">In this article, we covered: Gradient descent, Momentum, NAG, Adagrad, Adadelta, RMSprop, and Adam optimization techniques. This is by no means an extensive list but serves as a good foundation for anyone interested in learning more about various optimizers used in deep learning. For a more in-depth look at the optimizers discussed, I highly recommend&nbsp;<a class=\"au lc\" href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html#gradientdescentoptimizationalgorithms\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"mx\">An Overview of Gradient Descent Optimization Algorithms<\/em><\/a>, by Sebastien Ruder.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Jo Coenen &#8211; Studio Dries 2.6&nbsp;on&nbsp;Unsplash Optimization plays a vital role in the development of machine learning and deep learning algorithms \u2014 without it, our model would not have the best design. The procedure refers to finding the set of input parameters or arguments to an objective function that results in the minimum [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[138],"class_list":["post-4428","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>7 Optimization Methods Used In Deep Learning - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"7 Optimization Methods Used In Deep Learning\" \/>\n<meta property=\"og:description\" content=\"Photo by Jo Coenen &#8211; Studio Dries 2.6&nbsp;on&nbsp;Unsplash Optimization plays a vital role in the development of machine learning and deep learning algorithms \u2014 without it, our model would not have the best design. The procedure refers to finding the set of input parameters or arguments to an objective function that results in the minimum [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2022-10-31T17:35:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:16:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k\" \/>\n<meta name=\"author\" content=\"Kurtis Pykes\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kurtis Pykes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"7 Optimization Methods Used In Deep Learning - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/","og_locale":"en_US","og_type":"article","og_title":"7 Optimization Methods Used In Deep Learning","og_description":"Photo by Jo Coenen &#8211; Studio Dries 2.6&nbsp;on&nbsp;Unsplash Optimization plays a vital role in the development of machine learning and deep learning algorithms \u2014 without it, our model would not have the best design. The procedure refers to finding the set of input parameters or arguments to an objective function that results in the minimum [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2022-10-31T17:35:21+00:00","article_modified_time":"2025-04-24T17:16:48+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k","type":"","width":"","height":""}],"author":"Kurtis Pykes","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Kurtis Pykes","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"7 Optimization Methods Used In Deep Learning","datePublished":"2022-10-31T17:35:21+00:00","dateModified":"2025-04-24T17:16:48+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/"},"wordCount":1588,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/","url":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/","name":"7 Optimization Methods Used In Deep Learning - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k","datePublished":"2022-10-31T17:35:21+00:00","dateModified":"2025-04-24T17:16:48+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#primaryimage","url":"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k","contentUrl":"https:\/\/miro.medium.com\/max\/700\/0*HZIz4s0KrLsr0P4k"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/7-optimization-methods-used-in-deep-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"7 Optimization Methods Used In Deep Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4428","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=4428"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4428\/revisions"}],"predecessor-version":[{"id":15661,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4428\/revisions\/15661"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=4428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=4428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=4428"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=4428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}