{"id":4584,"date":"2022-11-10T17:49:59","date_gmt":"2022-11-11T01:49:59","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=4584"},"modified":"2025-04-24T17:16:34","modified_gmt":"2025-04-24T17:16:34","slug":"vanishing-exploding-gradients-in-deep-neural-networks","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/","title":{"rendered":"Vanishing\/Exploding Gradients in Deep Neural Networks"},"content":{"rendered":"\n<div class=\"ir is it iu iv\">\n<p id=\"3ce5\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Building a Neural Network model can be very complicated and tuning the Neural Network model can make it even more confusing. One of the most common problems when working with Deep Neural Networks is the Vanishing and\/or Exploding Gradient Descent. In order to prevent this from happening, one solution is initializing weights.<\/p>\n<p id=\"435a\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Initializing weights in Neural Networks helps to prevent layer activation outputs from Vanishing or Exploding during forward feedback. If either Vanishing or Exploding occurs, the loss gradient will either be too small or too large, meaning the network requires more time to converge.<\/p>\n<h1 id=\"badd\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Background on Deep Neural Networks:<\/h1>\n<p id=\"e0ec\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">A&nbsp;<strong class=\"bm lv\">Neural Network<\/strong>&nbsp;is a network of biological neurons. In the use of Artificial Intelligence, Deep Neural Network contains artificial neurons or nodes.<\/p>\n<p id=\"3f2c\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\"><a class=\"au lw\" href=\"https:\/\/heartbeat.comet.ml\/deep-learning-how-it-works-ace144f750db\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"bm lv\">Deep Neural Networks<\/strong><\/a>&nbsp;are made up of nodes that contain three different layers: an input layer, one or more hidden layers, and an output layer. Each node is connected to another node and is where computation happens.<\/p>\n<ol class=\"\">\n<li id=\"d70e\" class=\"lx ly iy bm b jx jy kb kc kf lz kj ma kn mb kr mc md me mf ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Input Layer<\/strong>&nbsp;\u2014 receives the input data<\/li>\n<li id=\"edfa\" class=\"lx ly iy bm b jx mg kb mh kf mi kj mj kn mk kr mc md me mf ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Hidden Layer(s)<\/strong>&nbsp;\u2014 perform mathematical computations on the input data<\/li>\n<li id=\"2d8e\" class=\"lx ly iy bm b jx mg kb mh kf mi kj mj kn mk kr mc md me mf ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Output Layer<\/strong>&nbsp;\u2014 returns the output data.<\/li>\n<\/ol>\n<p id=\"5c95\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">The nodes in the Neural Networks are made up of parameters which are called weights and used to calculate a weighted sum of the inputs.<\/p>\n<p id=\"92db\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Weight<\/strong>&nbsp;controls the strength of the connection between two neurons. The weight is a big factor in deciding how much influence the input has on the output.<\/p>\n<p id=\"09db\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Bias&nbsp;<\/strong>is to guarantee that there will always be activation in the neurons, even if the input is 0. Bias will always have a value of 1 and is an additional input into the next layer.<\/p>\n<h1 id=\"0400\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Feedforward and Backpropagation<\/h1>\n<p id=\"89d5\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">A&nbsp;<strong class=\"bm lv\">Cost Function<\/strong>&nbsp;is a mathematical formula used to calculate the error, it is the difference between our predicted value and the actual value. In the ideal world, we would want a Cost Function of 0, telling us that our outputs are the same as the data set outputs.<\/p>\n<p id=\"0e86\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Neural Network models use a Cost Function optimization algorithm called Stochastic Gradient Descent. Its aim is to minimize the Cost Function by incrementally changing the weights of the network. Aiming to produce a set of weights that is capable of making useful predictions. In order to begin the optimization process, the algorithm requires a starting point in the space of possible weight values.<\/p>\n<p id=\"462b\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Feedforward Network<\/strong>&nbsp;is the process in which the result in the next neuron now becomes the input for the next neuron as the information always moves in one direction (forward).<\/p>\n<p id=\"8d37\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">In Neural Network, there is also a process called Backpropagation, also abbreviated as \u201cbackprop.\u201d&nbsp;<strong class=\"bm lv\">Backpropagation<\/strong>&nbsp;is the messenger who tells the Neural Network whether it made a mistake when it made a prediction.<\/p>\n<p id=\"2f47\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Backpropagation goes through these steps:<\/p>\n<ol class=\"\">\n<li id=\"cb81\" class=\"lx ly iy bm b jx jy kb kc kf lz kj ma kn mb kr mc md me mf ga\" data-selectable-paragraph=\"\">The Neural Network makes a guess about data<\/li>\n<li id=\"dc89\" class=\"lx ly iy bm b jx mg kb mh kf mi kj mj kn mk kr mc md me mf ga\" data-selectable-paragraph=\"\">The Neural Network is measured with a loss function<\/li>\n<li id=\"af22\" class=\"lx ly iy bm b jx mg kb mh kf mi kj mj kn mk kr mc md me mf ga\" data-selectable-paragraph=\"\">The error is backpropagated to be adjusted and corrected<\/li>\n<\/ol>\n<h1 id=\"3c52\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">The Process:<\/h1>\n<p id=\"2ccc\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">As the input features are propagated, going through the various hidden layers consisting of different or same activation functions, we produce a sample of predictive probabilities.<\/p>\n<p id=\"f408\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">The backpropagation algorithm moves towards the input layer and away from the output layer calculating error gradients.<\/p>\n<\/div>\n\n\n\n<div class=\"o dx ml mm id mn\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<blockquote class=\"ms\"><p id=\"208d\" class=\"mt mu iy bm mv mw mx my mz na nb kr cn\" data-selectable-paragraph=\"\">Most projects fail before they get to production.&nbsp;<a class=\"au lw\" href=\"https:\/\/go.comet.ml\/ebook-Building-Effective-ML-Teams.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Check out our free ebook<\/a>&nbsp;to learn how to implement an MLOps lifecycle to better monitor, train, and deploy your machine learning models to increase output and iteration.<\/p><\/blockquote>\n<\/div>\n\n\n\n<div class=\"o dx ml mm id mn\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<p id=\"5ca2\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">The gradient of the Cost Function in the Neural Network processes in relation to each parameter; weights and biases. The algorithm then takes a Gradient Descent steps towards the minimum cost and updates the value of each parameter in the Neural Network using these updated gradients.<\/p>\n<figure class=\"nd ne nf ng gx nh gl gm paragraph-image\">\n<div class=\"ni nj do nk ce nl\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce nm nn c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp\" alt=\"\" width=\"700\" height=\"409\"><\/figure><div class=\"gl gm nc\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/0*FrY8ou2OPu0w0wGp 640w, https:\/\/miro.medium.com\/max\/720\/0*FrY8ou2OPu0w0wGp 720w, https:\/\/miro.medium.com\/max\/750\/0*FrY8ou2OPu0w0wGp 750w, https:\/\/miro.medium.com\/max\/786\/0*FrY8ou2OPu0w0wGp 786w, https:\/\/miro.medium.com\/max\/828\/0*FrY8ou2OPu0w0wGp 828w, https:\/\/miro.medium.com\/max\/1100\/0*FrY8ou2OPu0w0wGp 1100w, https:\/\/miro.medium.com\/max\/1400\/0*FrY8ou2OPu0w0wGp 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\">Source: <\/picture><a class=\"au lw\" href=\"https:\/\/www.researchgate.net\/figure\/Schematic-diagram-of-backpropagation-training-algorithm-and-typical-neuron-model_fig2_275721804\" target=\"_blank\" rel=\"noopener ugc nofollow\">researchgate<\/a><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"ebe7\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Why do the Gradients Vanish or Explode?<\/h1>\n<h1 id=\"53ce\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Vanish<\/h1>\n<p id=\"9b33\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Vanishing<\/strong>&nbsp;is when as backpropagation occurs, the gradients normally get smaller and smaller, gradually approaching zero. This leaves the weights of the initial or lower layers unchanged, causing the Gradient Descent to never converge to the optimum.<\/p>\n<p id=\"2fbe\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">For example, Activation Functions such as the sigmoid function have a very prominent difference between the variance of their inputs and outputs. They shrink and transform a large input space into a smaller output space, which lies between [0,1].<\/p>\n<p id=\"8ff9\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Looking at the graph below of the Sigmoid Function, we can see that using larger inputs, regardless if they are negative or positive will classify at either 0 or 1. However, when the Backpropagation processes, it has no gradient to propagate backward in the Neural Network. The little gradient that does exist, will continuously keep diluting as the algorithm continues to process through the top layers, leaving nothing for the lower layers.<\/p>\n<figure class=\"nd ne nf ng gx nh gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce nm nn c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/640\/0*pv_pqIsvvG_8IkvD\" alt=\"\" width=\"640\" height=\"480\"><\/figure><div class=\"gl gm nr\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/0*pv_pqIsvvG_8IkvD 640w, https:\/\/miro.medium.com\/max\/720\/0*pv_pqIsvvG_8IkvD 720w, https:\/\/miro.medium.com\/max\/750\/0*pv_pqIsvvG_8IkvD 750w, https:\/\/miro.medium.com\/max\/786\/0*pv_pqIsvvG_8IkvD 786w, https:\/\/miro.medium.com\/max\/828\/0*pv_pqIsvvG_8IkvD 828w, https:\/\/miro.medium.com\/max\/1100\/0*pv_pqIsvvG_8IkvD 1100w, https:\/\/miro.medium.com\/max\/1280\/0*pv_pqIsvvG_8IkvD 1280w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px\" data-testid=\"og\">Source: <\/picture><a class=\"au lw\" href=\"https:\/\/machinelearningjourney.com\/index.php\/2020\/08\/07\/vanishing-and-exploding-gradients\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Machine Learning Journey<\/a><\/div>\n<\/figure>\n<h1 id=\"733c\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Explode<\/h1>\n<p id=\"2a67\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm lv\">Exploding<\/strong>&nbsp;is the opposite of Vanishing and is when the gradient continues to get larger which causes a large weight update and results in the Gradient Descent to diverge.<\/p>\n<p id=\"2de9\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Exploding gradients occur due to the weights in the Neural Network, not the activation function.<\/p>\n<p id=\"e5bc\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">The gradient linked to each weight in the Neural Network is equal to a product of numbers. If this contains a product of values that is greater than one, there is a possibility that the gradients become too large.<\/p>\n<p id=\"7fc9\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">The weights in the lower layers of the Neural Network are more likely to be affected by Exploding Gradient as their associated gradients are products of more values. This leads to the gradients of the lower layers being more unstable, causing the algorithm to diverge.<\/p>\n<figure class=\"nd ne nf ng gx nh gl gm paragraph-image\">\n<div class=\"ni nj do nk ce nl\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce nm nn c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*pq5wlxZW4zvD9iJH\" alt=\"\" width=\"700\" height=\"390\"><\/figure><div class=\"gl gm ns\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/0*pq5wlxZW4zvD9iJH 640w, https:\/\/miro.medium.com\/max\/720\/0*pq5wlxZW4zvD9iJH 720w, https:\/\/miro.medium.com\/max\/750\/0*pq5wlxZW4zvD9iJH 750w, https:\/\/miro.medium.com\/max\/786\/0*pq5wlxZW4zvD9iJH 786w, https:\/\/miro.medium.com\/max\/828\/0*pq5wlxZW4zvD9iJH 828w, https:\/\/miro.medium.com\/max\/1100\/0*pq5wlxZW4zvD9iJH 1100w, https:\/\/miro.medium.com\/max\/1400\/0*pq5wlxZW4zvD9iJH 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\">Source: <\/picture><a class=\"au lw\" href=\"https:\/\/www.udacity.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Udacity<\/a><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"3fc9\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Weight Initialization<\/h1>\n<p id=\"0d5b\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">Weight Initialization is the process of setting the weights of a Neural Network to small random values that help define the starting point for the optimization of the model.<\/p>\n<h1 id=\"bba1\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Weight Initialization Techniques:<\/h1>\n<h3 id=\"ae3a\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\">Initializing weights to zero<\/h3>\n<p id=\"c3cc\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">If we initialize all our weights to zero, our Neural Network will act as a linear model because all the layers are learning the same thing.<\/p>\n<p id=\"461c\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Therefore, the important thing to note with initializing your weights for Neural Networks is to not initialize all the weight to zero.<\/p>\n<h3 id=\"1b33\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\">Initializing weights randomly<\/h3>\n<p id=\"8589\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">Using random initialization defeats the problem caused by initializing weights to zero, as it prevents the neurons from learning the exact same features of their inputs. Our aim is for each neuron to learn the different functionalities of its input.<\/p>\n<p id=\"010a\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">However, using this technique can also lead to vanishing or exploding gradients, due to incorrect activation functions not being used. It currently works effectively with the RELU activation function.<\/p>\n<h3 id=\"7eba\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\">Initializing weights using Heuristic<\/h3>\n<p id=\"205b\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">This is considered the best technique to initialize weights for Neural Networks.<\/p>\n<p id=\"005d\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">Heuristics serve as good starting points for weight initialization as they reduce the chances of vanishing or exploding gradients from occurring. This is due to the fact that the weights are neither too bigger than 1, nor less than 1. They also help in the avoidance of slow convergence.<\/p>\n<p id=\"afdb\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">The most common heuristics used are:<\/p>\n<h2 id=\"f435\" class=\"nt kt iy bm ku nu nv nw ky nx ny nz lc kf oa ob lg kj oc od lk kn oe of lo og ga\" data-selectable-paragraph=\"\">1. He-et-al Initialization.<\/h2>\n<p id=\"1c6f\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">When using the RELU activation function, this heuristic is used by multiplying the randomly generated values of W by:<\/p>\n<figure class=\"nd ne nf ng gx nh gl gm paragraph-image\">\n<div class=\"ni nj do nk ce nl\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce nm nn c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*FtNF0ACeb4GROeb9\" alt=\"\" width=\"700\" height=\"123\"><\/figure><div class=\"gl gm oh\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/0*FtNF0ACeb4GROeb9 640w, https:\/\/miro.medium.com\/max\/720\/0*FtNF0ACeb4GROeb9 720w, https:\/\/miro.medium.com\/max\/750\/0*FtNF0ACeb4GROeb9 750w, https:\/\/miro.medium.com\/max\/786\/0*FtNF0ACeb4GROeb9 786w, https:\/\/miro.medium.com\/max\/828\/0*FtNF0ACeb4GROeb9 828w, https:\/\/miro.medium.com\/max\/1100\/0*FtNF0ACeb4GROeb9 1100w, https:\/\/miro.medium.com\/max\/1400\/0*FtNF0ACeb4GROeb9 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><a class=\"au lw\" href=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*-vY3G0W-4nJo-dQ1jm0p0w.png\" rel=\"noopener\">Source<\/a><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"7a97\" class=\"nt kt iy bm ku nu nv nw ky nx ny nz lc kf oa ob lg kj oc od lk kn oe of lo og ga\" data-selectable-paragraph=\"\">2. Xavier initialization<\/h2>\n<p id=\"8f89\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">When using the Tanh activation function, this heuristic is used by multiplying the randomly generated values of W by:<\/p>\n<figure class=\"nd ne nf ng gx nh gl gm paragraph-image\">\n<div class=\"ni nj do nk ce nl\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce nm nn c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*N_M3jNUQOxud9rCt\" alt=\"\" width=\"700\" height=\"119\"><\/figure><div class=\"gl gm oh\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/0*N_M3jNUQOxud9rCt 640w, https:\/\/miro.medium.com\/max\/720\/0*N_M3jNUQOxud9rCt 720w, https:\/\/miro.medium.com\/max\/750\/0*N_M3jNUQOxud9rCt 750w, https:\/\/miro.medium.com\/max\/786\/0*N_M3jNUQOxud9rCt 786w, https:\/\/miro.medium.com\/max\/828\/0*N_M3jNUQOxud9rCt 828w, https:\/\/miro.medium.com\/max\/1100\/0*N_M3jNUQOxud9rCt 1100w, https:\/\/miro.medium.com\/max\/1400\/0*N_M3jNUQOxud9rCt 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><a class=\"au lw\" href=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*El7FG2KM4zMRCV9w7diFTg.png\" rel=\"noopener\">Source<\/a><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"2057\" class=\"ks kt iy bm ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln lo lp ga\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"5b70\" class=\"pw-post-body-paragraph jv jw iy bm b jx lq jz ka kb lr kd ke kf ls kh ki kj lt kl km kn lu kp kq kr ir ga\" data-selectable-paragraph=\"\">The rise in Machine Learning and the implementation and application of models in our day-to-day lives raises concern about how to efficiently these models work.<\/p>\n<p id=\"cf14\" class=\"pw-post-body-paragraph jv jw iy bm b jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ir ga\" data-selectable-paragraph=\"\">A lot of time and money is spent on models that don\u2019t accurately produce the outputs we expected. Therefore, it\u2019s important to understand Cost Function and how Stochastic Gradient Descent minimizes the Cost Function by changing the weights of the network. This will benefit the overall process of building your model, producing accurate outputs to make the right conclusions and decisions, at a reduced cost.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Building a Neural Network model can be very complicated and tuning the Neural Network model can make it even more confusing. One of the most common problems when working with Deep Neural Networks is the Vanishing and\/or Exploding Gradient Descent. In order to prevent this from happening, one solution is initializing weights. Initializing weights in [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[139],"class_list":["post-4584","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Vanishing\/Exploding Gradients in Deep Neural Networks - Comet<\/title>\n<meta name=\"description\" content=\"Initializing weights in Neural Networks helps to prevent layer activation outputs from Vanishing or Exploding during forward feedback.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Vanishing\/Exploding Gradients in Deep Neural Networks\" \/>\n<meta property=\"og:description\" content=\"Initializing weights in Neural Networks helps to prevent layer activation outputs from Vanishing or Exploding during forward feedback.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2022-11-11T01:49:59+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:16:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp\" \/>\n<meta name=\"author\" content=\"Nisha Arya Ahmed\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nisha Arya Ahmed\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Vanishing\/Exploding Gradients in Deep Neural Networks - Comet","description":"Initializing weights in Neural Networks helps to prevent layer activation outputs from Vanishing or Exploding during forward feedback.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/","og_locale":"en_US","og_type":"article","og_title":"Vanishing\/Exploding Gradients in Deep Neural Networks","og_description":"Initializing weights in Neural Networks helps to prevent layer activation outputs from Vanishing or Exploding during forward feedback.","og_url":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2022-11-11T01:49:59+00:00","article_modified_time":"2025-04-24T17:16:34+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp","type":"","width":"","height":""}],"author":"Nisha Arya Ahmed","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Nisha Arya Ahmed","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"Vanishing\/Exploding Gradients in Deep Neural Networks","datePublished":"2022-11-11T01:49:59+00:00","dateModified":"2025-04-24T17:16:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/"},"wordCount":1282,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/","url":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/","name":"Vanishing\/Exploding Gradients in Deep Neural Networks - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp","datePublished":"2022-11-11T01:49:59+00:00","dateModified":"2025-04-24T17:16:34+00:00","description":"Initializing weights in Neural Networks helps to prevent layer activation outputs from Vanishing or Exploding during forward feedback.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#primaryimage","url":"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp","contentUrl":"https:\/\/miro.medium.com\/max\/700\/0*FrY8ou2OPu0w0wGp"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/vanishing-exploding-gradients-in-deep-neural-networks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Vanishing\/Exploding Gradients in Deep Neural Networks"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=4584"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4584\/revisions"}],"predecessor-version":[{"id":15651,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4584\/revisions\/15651"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=4584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=4584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=4584"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=4584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}