{"id":4437,"date":"2022-10-28T09:08:36","date_gmt":"2022-10-28T17:08:36","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=4437"},"modified":"2025-04-24T17:16:50","modified_gmt":"2025-04-24T17:16:50","slug":"activation-functions-in-neural-networks","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/","title":{"rendered":"Activation Functions In Neural Networks"},"content":{"rendered":"\n<div class=\"ir is it iu iv\" style=\"text-align: left;\">\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<div class=\"kt ku do kv ce kw\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th\" alt=\"\" width=\"700\" height=\"525\"><\/figure><div class=\"gl gm kn\" style=\"text-align: center;\"><picture>Photo by <\/picture><a class=\"au lc\" href=\"https:\/\/unsplash.com\/@john_smit?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">John Smit<\/a><picture>&nbsp;on&nbsp;<\/picture><a class=\"au lc\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/div>\n<\/div>\n<\/figure>\n<p id=\"5482\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\">An activation function plays an important role in a neural network. It\u2019s a function used in artificial neurons to non-linearly transform inputs that come from the previous cell and provide an output. Failing to apply an activation function would mean the neurons would resemble&nbsp;<a class=\"au lc\" href=\"https:\/\/towardsdatascience.com\/algorithms-from-scratch-linear-regression-c654353d1e7c\" target=\"_blank\" rel=\"noopener\">linear regression<\/a>. Thus, activation functions are required to introduce non-linearity into neural networks so they are capable of learning the complex underlying patterns that exist within data.<\/p>\n<p id=\"44aa\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\">In this article, I am going to explore various activation functions used when implementing neural networks.<\/p>\n<blockquote class=\"ly lz ma\"><p id=\"db27\" class=\"ld le mb bm b lf lg jz lh li lj kc lk mc lm ln lo md lq lr ls me lu lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\"><strong class=\"bm mf\">Note<\/strong>: See the full code in&nbsp;<a class=\"au lc\" href=\"https:\/\/github.com\/kurtispykes\/demo\/blob\/master\/deepl_learning_activation_functions.ipynb\" target=\"_blank\" rel=\"noopener ugc nofollow\">GitHub<\/a>.<\/p><\/blockquote>\n<h3 id=\"824c\" class=\"mg mh iy bm mi mj mk ml mm mn mo mp mq ke mr kf ms kh mt ki mu kk mv kl mw mx ga\">Sigmoid Function<\/h3>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/311\/1*rOIreMLQIzPhc9mrhMxJXQ.png\" alt=\"\" width=\"311\" height=\"111\"><\/figure><div class=\"gl gm my\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*rOIreMLQIzPhc9mrhMxJXQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*rOIreMLQIzPhc9mrhMxJXQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*rOIreMLQIzPhc9mrhMxJXQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*rOIreMLQIzPhc9mrhMxJXQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*rOIreMLQIzPhc9mrhMxJXQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*rOIreMLQIzPhc9mrhMxJXQ.png 1100w, https:\/\/miro.medium.com\/max\/622\/1*rOIreMLQIzPhc9mrhMxJXQ.png 622w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 311px\" data-testid=\"og\">Mathematical expression for sigmoid function<\/picture><\/div>\n<\/figure>\n<p id=\"d464\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\">The sigmoid function is a popular, bounded, differentiable, monotonic activation function used in neural networks. This means its values are scaled between 0 and 1 (bounded), the slope of the curve can be found at any two points (differentiable), and it\u2019s neither entirely increasing nor decreasing (monotonic). It has a characteristic \u201cS\u201d shaped curve which may be seen in the image below:<\/p>\n<pre class=\"ko kp kq kr gx mz bs na nb dz nc\"><span id=\"bc04\" class=\"ga nd mh iy nc b dm ne nf l ng nh\" data-selectable-paragraph=\"\"><strong class=\"nc iz\"><em class=\"mb\"># Sigmoid function in Python<\/em><\/strong>\n<strong class=\"nc iz\">import<\/strong> matplotlib.pyplot <strong class=\"nc iz\">as<\/strong> plt\n<strong class=\"nc iz\">import <\/strong>numpy <strong class=\"nc iz\">as <\/strong>np<\/span><span id=\"ad13\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">x = np.linspace(<em class=\"mb\">-5<\/em>, <em class=\"mb\">5<\/em>,<em class=\"mb\"> 50<\/em>)\nz = <em class=\"mb\">1<\/em>\/(<em class=\"mb\">1<\/em> + np.exp(<em class=\"mb\">-x<\/em>))<\/span><span id=\"5453\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.subplots(figsize=(<em class=\"mb\">8<\/em>, <em class=\"mb\">5<\/em>))\nplt.plot(<em class=\"mb\">x<\/em>, <em class=\"mb\">z<\/em>)\nplt.grid()<\/span><span id=\"0d7a\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.show()<\/span><\/pre>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/540\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png\" alt=\"\" width=\"540\" height=\"334\"><\/figure><div class=\"gl gm nj\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 1100w, https:\/\/miro.medium.com\/max\/1080\/1*kYLr2Ftw7cMPS1qmXsOJSQ.png 1080w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 540px\" data-testid=\"og\">Visual representation of the sigmoid function<\/picture><\/div>\n<\/figure>\n<p id=\"b71f\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\">This activation function is typically used for binary classification, however, it is not without fault. The sigmoid function suffers from the&nbsp;<a class=\"au lc\" href=\"https:\/\/towardsdatascience.com\/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11\" target=\"_blank\" rel=\"noopener\">vanishing gradient problem<\/a>: a scenario during backpropagation in which the gradient decreases exponentially until it becomes extremely close to 0. As a result, the weights are not updated sufficiently which leads to extremely slow convergence \u2014 once the gradient reaches 0, learning stops.<\/p>\n<h3 id=\"b5b8\" class=\"mg mh iy bm mi mj mk ml mm mn mo mp mq ke mr kf ms kh mt ki mu kk mv kl mw mx ga\" style=\"text-align: left;\">Softmax Function<\/h3>\n<p id=\"7190\" class=\"pw-post-body-paragraph ld le iy bm b lf nk jz lh li nl kc lk ll nm ln lo lp nn lr ls lt no lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\">The softmax function is a generalization of the sigmoid function to multiple dimensions. Thus, it\u2019s typically used as the last activation function in a neural network (the output layer) to predict multinomial probability distributions. In other words, we use it for classification problems in which class membership is required on more than two labels.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/290\/1*HMCZi-vrYcZic8z5gztzhg.png\" alt=\"\" width=\"290\" height=\"119\"><\/figure><div class=\"gl gm np\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*HMCZi-vrYcZic8z5gztzhg.png 640w, https:\/\/miro.medium.com\/max\/720\/1*HMCZi-vrYcZic8z5gztzhg.png 720w, https:\/\/miro.medium.com\/max\/750\/1*HMCZi-vrYcZic8z5gztzhg.png 750w, https:\/\/miro.medium.com\/max\/786\/1*HMCZi-vrYcZic8z5gztzhg.png 786w, https:\/\/miro.medium.com\/max\/828\/1*HMCZi-vrYcZic8z5gztzhg.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*HMCZi-vrYcZic8z5gztzhg.png 1100w, https:\/\/miro.medium.com\/max\/580\/1*HMCZi-vrYcZic8z5gztzhg.png 580w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 290px\" data-testid=\"og\">Mathematical expression for softmax function<\/picture><\/div>\n<\/figure>\n<div><\/div>\n<h3 id=\"7df6\" class=\"mg mh iy bm mi mj mk ml mm mn mo mp mq ke mr kf ms kh mt ki mu kk mv kl mw mx ga\" style=\"text-align: left;\">Hyperbolic Tangent (tanh) Function<\/h3>\n<p id=\"c7fe\" class=\"pw-post-body-paragraph ld le iy bm b lf nk jz lh li nl kc lk ll nm ln lo lp nn lr ls lt no lv lw lx ir ga\" style=\"text-align: left;\" data-selectable-paragraph=\"\">The hyperbolic tangent function (tanh) is similar to the sigmoid function in a sense: they share the \u201cS\u201d shaped curve characteristic. In contrast, the hyperbolic tangent function transforms the values between -1 and 1. This means inputs that are small (more negative) will be mapped closer to -1 (strongly negative) and 0 inputs will be mapped near 0.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/310\/1*bVfgTIMHuOgkmTUaBLehkQ.png\" alt=\"\" width=\"310\" height=\"116\"><\/figure><div class=\"gl gm nq\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*bVfgTIMHuOgkmTUaBLehkQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*bVfgTIMHuOgkmTUaBLehkQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*bVfgTIMHuOgkmTUaBLehkQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*bVfgTIMHuOgkmTUaBLehkQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*bVfgTIMHuOgkmTUaBLehkQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*bVfgTIMHuOgkmTUaBLehkQ.png 1100w, https:\/\/miro.medium.com\/max\/620\/1*bVfgTIMHuOgkmTUaBLehkQ.png 620w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 310px\" data-testid=\"og\">Mathematical expression for tanh function<\/picture><\/div><figcaption class=\"kz bl gn gl gm la lb bm b bn bo cn\" data-selectable-paragraph=\"\"><\/figcaption><\/figure>\n<pre class=\"ko kp kq kr gx mz bs na nb dz nc\"><span id=\"d950\" class=\"ga nd mh iy nc b dm ne nf l ng nh\" data-selectable-paragraph=\"\"><strong class=\"nc iz\"><em class=\"mb\"># tanh function in Python<\/em><\/strong>\n<strong class=\"nc iz\">import<\/strong> matplotlib.pyplot <strong class=\"nc iz\">as<\/strong> plt\n<strong class=\"nc iz\">import <\/strong>numpy <strong class=\"nc iz\">as <\/strong>np<\/span><span id=\"3b26\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">x = np.linspace(<em class=\"mb\">-5<\/em>, <em class=\"mb\">5<\/em>, <em class=\"mb\">50<\/em>)\nz = np.tanh(<em class=\"mb\">x<\/em>)<\/span><span id=\"5c14\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.subplots(figsize=(<em class=\"mb\">8<\/em>, <em class=\"mb\">5<\/em>))\nplt.plot(<em class=\"mb\">x<\/em>, <em class=\"mb\">z<\/em>)\nplt.grid()<\/span><span id=\"9d48\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.show()<\/span><\/pre>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/554\/1*9EQzxDP4qiYbY-UNDBKtoQ.png\" alt=\"\" width=\"554\" height=\"333\"><\/figure><div class=\"gl gm nr\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 1100w, https:\/\/miro.medium.com\/max\/1108\/1*9EQzxDP4qiYbY-UNDBKtoQ.png 1108w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 554px\" data-testid=\"og\">Visual representation of the tanh function<\/picture><\/div>\n<\/figure>\n<p id=\"705b\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Ian Goodfellow&#8217;s book,&nbsp;<em class=\"mb\">Deep Learning (P.195)<\/em>, states \u201c\u2026the hyperbolic tangent activation function typically performs better than the logistic sigmoid.\u201d However, it suffers a similar fate to the sigmoid function,&nbsp;<a class=\"au lc\" href=\"https:\/\/towardsdatascience.com\/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11\" target=\"_blank\" rel=\"noopener\">the vanishing gradient<\/a>&nbsp;problem. It\u2019s also computationally expensive due to its exponential operation.<\/p>\n<\/div>\n\n\n\n<div class=\"o dx ns nt id nu\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<blockquote class=\"nz\"><p id=\"c55e\" class=\"oa ob iy bm oc od oe of og oh oi lx cn\" data-selectable-paragraph=\"\">How does the team at Uber manage to keep their data organized and their team united? Comet\u2019s experiment tracking.&nbsp;<a class=\"au lc\" href=\"https:\/\/www.comet.com\/site\/how-uber-manages-machine-learning-experiments-with-comet\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Learn more from Uber\u2019s Olcay Cirit<\/a>.<\/p><\/blockquote>\n<\/div>\n\n\n\n<div class=\"o dx ns nt id nu\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<h3 id=\"16e7\" class=\"mg mh iy bm mi mj oj ml mm mn ok mp mq ke ol kf ms kh om ki mu kk on kl mw mx ga\">Rectified Linear Unit (ReLU) Function<\/h3>\n<p id=\"cdbd\" class=\"pw-post-body-paragraph ld le iy bm b lf nk jz lh li nl kc lk ll nm ln lo lp nn lr ls lt no lv lw lx ir ga\" data-selectable-paragraph=\"\">We tend to avoid using sigmoid and tanh functions when building neural networks with many layers due to the&nbsp;<a class=\"au lc\" href=\"https:\/\/towardsdatascience.com\/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11\" target=\"_blank\" rel=\"noopener\">vanishing gradient problem<\/a>. One solution could be to use the ReLU function which overcomes this problem and has become the default activation function for various neural networks.<\/p>\n<p id=\"55ec\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The ReLU activation function returns 0 if the input is 0 or less and returns the value provided as input directly if it&#8217;s greater than 0. Thus, the values range from 0 to infinity:<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/432\/1*IQbeK7bO86bw2kpdgAo5ZA.png\" alt=\"\" width=\"432\" height=\"131\"><\/figure><div class=\"gl gm oo\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*IQbeK7bO86bw2kpdgAo5ZA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*IQbeK7bO86bw2kpdgAo5ZA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*IQbeK7bO86bw2kpdgAo5ZA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*IQbeK7bO86bw2kpdgAo5ZA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*IQbeK7bO86bw2kpdgAo5ZA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*IQbeK7bO86bw2kpdgAo5ZA.png 1100w, https:\/\/miro.medium.com\/max\/864\/1*IQbeK7bO86bw2kpdgAo5ZA.png 864w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 432px\" data-testid=\"og\">Mathematical expression for ReLU function<\/picture><\/div>\n<\/figure>\n<p id=\"3f72\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">In Python, it looks as follows\u2026<\/p>\n<pre class=\"ko kp kq kr gx mz bs na nb dz nc\"><span id=\"8064\" class=\"ga nd mh iy nc b dm ne nf l ng nh\" data-selectable-paragraph=\"\"><strong class=\"nc iz\"><em class=\"mb\"># ReLU in Python<\/em><\/strong>\n<strong class=\"nc iz\">import <\/strong>matplotlib.pyplot <strong class=\"nc iz\">as <\/strong>plt\n<strong class=\"nc iz\">import<\/strong> numpy <strong class=\"nc iz\">as <\/strong>np<\/span><span id=\"17b2\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">x = np.linspace(<em class=\"mb\">-5<\/em>, <em class=\"mb\">5<\/em>, <em class=\"mb\">50<\/em>)\nz = [<strong class=\"nc iz\">max<\/strong>(<em class=\"mb\">0<\/em>, i) for i <strong class=\"nc iz\">in <\/strong>x]<\/span><span id=\"fc33\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.subplots(figsize=(<em class=\"mb\">8<\/em>, <em class=\"mb\">5<\/em>))\nplt.plot(<em class=\"mb\">x<\/em>, <em class=\"mb\">z<\/em>)\nplt.grid()<\/span><span id=\"7ae3\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.show()<\/span><\/pre>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/527\/1*dP8wGHkD19O8oSNJisoiBQ.png\" alt=\"\" width=\"527\" height=\"334\"><\/figure><div class=\"gl gm op\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*dP8wGHkD19O8oSNJisoiBQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*dP8wGHkD19O8oSNJisoiBQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*dP8wGHkD19O8oSNJisoiBQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*dP8wGHkD19O8oSNJisoiBQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*dP8wGHkD19O8oSNJisoiBQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*dP8wGHkD19O8oSNJisoiBQ.png 1100w, https:\/\/miro.medium.com\/max\/1054\/1*dP8wGHkD19O8oSNJisoiBQ.png 1054w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 527px\" data-testid=\"og\">Visual representation of ReLU function<\/picture><\/div>\n<\/figure>\n<p id=\"2d9b\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Since the ReLU function looks and behaves mostly like a linear function, the neural network is much easier to optimize. It\u2019s also very easy to implement.<\/p>\n<p id=\"0073\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Where the ReLU activation function suffers is when there are many negative values \u2014 they will all be outputted as 0. When this occurs, learning will be severely impacted and our model would be prohibited from properly learning complex patterns in the data. This is known as the dying ReLU problem. One solution to the dying ReLU problem is using Leaky ReLU.<\/p>\n<h3 id=\"0923\" class=\"mg mh iy bm mi mj mk ml mm mn mo mp mq ke mr kf ms kh mt ki mu kk mv kl mw mx ga\">Leaky ReLU Function<\/h3>\n<p id=\"301c\" class=\"pw-post-body-paragraph ld le iy bm b lf nk jz lh li nl kc lk ll nm ln lo lp nn lr ls lt no lv lw lx ir ga\" data-selectable-paragraph=\"\">Leaky ReLU is an improvement on ReLU that attempts to combat the dying ReLU problem. Instead of defining the ReLU function as 0 for all negative values of x, it is defined as an extremely small linear component of x.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<div class=\"kt ku do kv ce kw\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/440\/1*FdWQ1l0j11YChpVXAh58YA.png\" alt=\"\" width=\"440\" height=\"126\"><\/figure><div class=\"gl gm oq\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*FdWQ1l0j11YChpVXAh58YA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*FdWQ1l0j11YChpVXAh58YA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*FdWQ1l0j11YChpVXAh58YA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*FdWQ1l0j11YChpVXAh58YA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*FdWQ1l0j11YChpVXAh58YA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*FdWQ1l0j11YChpVXAh58YA.png 1100w, https:\/\/miro.medium.com\/max\/880\/1*FdWQ1l0j11YChpVXAh58YA.png 880w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 440px\" data-testid=\"og\">Mathematical expression for leaky ReLU function<\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"2ba9\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The modification to the ReLU function alters the gradient for values equal to or less than 0 to be values that are non-zero. As a result, extremely small inputs would no longer encounter dead neurons.<\/p>\n<p id=\"562d\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Let\u2019s see this in Python:<\/p>\n<pre class=\"ko kp kq kr gx mz bs na nb dz nc\"><span id=\"9a92\" class=\"ga nd mh iy nc b dm ne nf l ng nh\" data-selectable-paragraph=\"\"><strong class=\"nc iz\"><em class=\"mb\"># leaky ReLU in Python<\/em><\/strong>\n<strong class=\"nc iz\">import <\/strong>matplotlib.pyplot <strong class=\"nc iz\">as <\/strong>plt\n<strong class=\"nc iz\">import<\/strong> numpy <strong class=\"nc iz\">as <\/strong>np<\/span><span id=\"bdeb\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">x = np.linspace(<em class=\"mb\">-5<\/em>, <em class=\"mb\">5<\/em>, <em class=\"mb\">50<\/em>)\nz = [<strong class=\"nc iz\">max<\/strong>((<em class=\"mb\">0.3<\/em>*<em class=\"mb\">0)<\/em>, i) for i <strong class=\"nc iz\">in <\/strong>x]<\/span><span id=\"5705\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.subplots(figsize=(<em class=\"mb\">8<\/em>, <em class=\"mb\">5<\/em>))\nplt.plot(<em class=\"mb\">x<\/em>, <em class=\"mb\">z<\/em>)\nplt.grid()<\/span><span id=\"8752\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.show()<\/span><\/pre>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/535\/1*aufb7eD66KaIea9XLMg6Zg.png\" alt=\"\" width=\"535\" height=\"333\"><\/figure><div class=\"gl gm or\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*aufb7eD66KaIea9XLMg6Zg.png 640w, https:\/\/miro.medium.com\/max\/720\/1*aufb7eD66KaIea9XLMg6Zg.png 720w, https:\/\/miro.medium.com\/max\/750\/1*aufb7eD66KaIea9XLMg6Zg.png 750w, https:\/\/miro.medium.com\/max\/786\/1*aufb7eD66KaIea9XLMg6Zg.png 786w, https:\/\/miro.medium.com\/max\/828\/1*aufb7eD66KaIea9XLMg6Zg.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*aufb7eD66KaIea9XLMg6Zg.png 1100w, https:\/\/miro.medium.com\/max\/1070\/1*aufb7eD66KaIea9XLMg6Zg.png 1070w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 535px\" data-testid=\"og\">Visual representation of the Leaky ReLU function<\/picture><\/div>\n<div><\/div>\n<\/figure>\n<blockquote class=\"ly lz ma\"><p id=\"7e9e\" class=\"ld le mb bm b lf lg jz lh li lj kc lk mc lm ln lo md lq lr ls me lu lv lw lx ir ga\" data-selectable-paragraph=\"\"><strong class=\"bm mf\">Note<\/strong>: We typically set the \u03b1 value to 0.01. Rarely ever is it set to 1 (or close to it) since that would make Leaky ReLU a linear function.<\/p><\/blockquote>\n<h3 id=\"5eed\" class=\"mg mh iy bm mi mj mk ml mm mn mo mp mq ke mr kf ms kh mt ki mu kk mv kl mw mx ga\">Exponential Linear Unit (ELU) Function<\/h3>\n<p id=\"fdfd\" class=\"pw-post-body-paragraph ld le iy bm b lf nk jz lh li nl kc lk ll nm ln lo lp nn lr ls lt no lv lw lx ir ga\" data-selectable-paragraph=\"\">The Exponential Linear Unit (ELU) is an activation function for neural networks. In contrast to ReLUs, ELUs have negative values which allow them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. [Credit:&nbsp;<a class=\"au lc\" href=\"https:\/\/paperswithcode.com\/method\/elu\" target=\"_blank\" rel=\"noopener ugc nofollow\">PaperWithCode<\/a>]<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/602\/1*Hzy-YH4gH-ZjguupcQOTsw.png\" alt=\"\" width=\"602\" height=\"118\"><\/figure><div class=\"gl gm os\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*Hzy-YH4gH-ZjguupcQOTsw.png 640w, https:\/\/miro.medium.com\/max\/720\/1*Hzy-YH4gH-ZjguupcQOTsw.png 720w, https:\/\/miro.medium.com\/max\/750\/1*Hzy-YH4gH-ZjguupcQOTsw.png 750w, https:\/\/miro.medium.com\/max\/786\/1*Hzy-YH4gH-ZjguupcQOTsw.png 786w, https:\/\/miro.medium.com\/max\/828\/1*Hzy-YH4gH-ZjguupcQOTsw.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*Hzy-YH4gH-ZjguupcQOTsw.png 1100w, https:\/\/miro.medium.com\/max\/1204\/1*Hzy-YH4gH-ZjguupcQOTsw.png 1204w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 602px\" data-testid=\"og\">Mathematical expression for ELU function<\/picture><\/div>\n<\/figure>\n<\/div>\n\n\n\n<div><\/div>\n\n\n\n<div><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<pre class=\"ko kp kq kr gx mz bs na nb dz nc\"><span id=\"ddfd\" class=\"ga nd mh iy nc b dm ne nf l ng nh\" data-selectable-paragraph=\"\"># ELU function in Python\nimport matplotlib.pyplot as plt\nfrom tensorflow.keras.activations import elu<\/span><span id=\"6d1b\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">x = np.linspace(-5, 5, 50)\nz = elu(x, alpha=1)<\/span><span id=\"1766\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.subplots(figsize=(8, 5))\nplt.plot(x, z)\nplt.grid()<\/span><span id=\"3e29\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.show()<\/span><\/pre>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/538\/1*uVHwFPl2er2tC265KVE9LQ.png\" alt=\"\" width=\"538\" height=\"335\"><\/figure><div class=\"gl gm ot\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*uVHwFPl2er2tC265KVE9LQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*uVHwFPl2er2tC265KVE9LQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*uVHwFPl2er2tC265KVE9LQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*uVHwFPl2er2tC265KVE9LQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*uVHwFPl2er2tC265KVE9LQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*uVHwFPl2er2tC265KVE9LQ.png 1100w, https:\/\/miro.medium.com\/max\/1076\/1*uVHwFPl2er2tC265KVE9LQ.png 1076w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 538px\" data-testid=\"og\">Visual representation of the ELU function<\/picture><\/div>\n<\/figure>\n<div><\/div>\n<h3 id=\"1b1d\" class=\"mg mh iy bm mi mj mk ml mm mn mo mp mq ke mr kf ms kh mt ki mu kk mv kl mw mx ga\">Swish Function<\/h3>\n<p id=\"3c89\" class=\"pw-post-body-paragraph ld le iy bm b lf nk jz lh li nl kc lk ll nm ln lo lp nn lr ls lt no lv lw lx ir ga\" data-selectable-paragraph=\"\">The swish function was announced as an alternative to ReLU by Google in 2017. It tends to perform better than ReLU in deeper networks, across a number of challenging datasets. This comes about after the authors showed that simply substituting ReLU activations with Swish functions improved the classification accuracy of ImageNet.<\/p>\n<p id=\"b256\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\"><em class=\"mb\">\u201cQfter performing analysis on ImageNet data, researchers from Google alleged that using the function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions. It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.\u201d<\/em><br>\n\u2014 [<strong class=\"bm mf\">Source<\/strong>:&nbsp;<a class=\"au lc\" href=\"https:\/\/en.wikipedia.org\/wiki\/Swish_function\" target=\"_blank\" rel=\"noopener ugc nofollow\">Wikipedia<\/a>]<\/p>\n<p id=\"85c9\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">One drawback of the swish function is that it\u2019s computationally expensive in comparison to ReLU and its variants.<\/p>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/248\/1*rlep3njwAwN4zcjzoLo-MQ.png\" alt=\"\" width=\"248\" height=\"82\"><\/figure><div class=\"gl gm ou\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*rlep3njwAwN4zcjzoLo-MQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*rlep3njwAwN4zcjzoLo-MQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*rlep3njwAwN4zcjzoLo-MQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*rlep3njwAwN4zcjzoLo-MQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*rlep3njwAwN4zcjzoLo-MQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*rlep3njwAwN4zcjzoLo-MQ.png 1100w, https:\/\/miro.medium.com\/max\/496\/1*rlep3njwAwN4zcjzoLo-MQ.png 496w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 248px\" data-testid=\"og\">Mathematical expression for the swish function<\/picture><\/div>\n<\/figure>\n<\/div>\n\n\n\n<div><\/div>\n\n\n\n<div><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<pre class=\"ko kp kq kr gx mz bs na nb dz nc\"><span id=\"7b9e\" class=\"ga nd mh iy nc b dm ne nf l ng nh\" data-selectable-paragraph=\"\"><strong class=\"nc iz\"><em class=\"mb\"># Swish function in Python<\/em><\/strong>\n<strong class=\"nc iz\">import<\/strong> matplotlib.pyplot <strong class=\"nc iz\">as<\/strong> plt\n<strong class=\"nc iz\">import <\/strong>numpy <strong class=\"nc iz\">as <\/strong>np<\/span><span id=\"1953\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">x = np.linspace(<em class=\"mb\">-5<\/em>, <em class=\"mb\">5<\/em>,<em class=\"mb\"> 50<\/em>)\nz = x * (<em class=\"mb\">1<\/em>\/(<em class=\"mb\">1<\/em> + np.exp(<em class=\"mb\">-x<\/em>)))<\/span><span id=\"8872\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.subplots(figsize=(<em class=\"mb\">8<\/em>, <em class=\"mb\">5<\/em>))\nplt.plot(<em class=\"mb\">x<\/em>, <em class=\"mb\">z<\/em>)\nplt.grid()<\/span><span id=\"6400\" class=\"ga nd mh iy nc b dm ni nf l ng nh\" data-selectable-paragraph=\"\">plt.show()<\/span><\/pre>\n<figure class=\"ko kp kq kr gx ks gl gm paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"ce kx ky c aligncenter\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/527\/1*YsAoFYY_xeHd6kzqYKDoLw.png\" alt=\"\" width=\"527\" height=\"334\"><\/figure><div class=\"gl gm op\" style=\"text-align: center;\"><picture><source srcset=\"https:\/\/miro.medium.com\/max\/640\/1*YsAoFYY_xeHd6kzqYKDoLw.png 640w, https:\/\/miro.medium.com\/max\/720\/1*YsAoFYY_xeHd6kzqYKDoLw.png 720w, https:\/\/miro.medium.com\/max\/750\/1*YsAoFYY_xeHd6kzqYKDoLw.png 750w, https:\/\/miro.medium.com\/max\/786\/1*YsAoFYY_xeHd6kzqYKDoLw.png 786w, https:\/\/miro.medium.com\/max\/828\/1*YsAoFYY_xeHd6kzqYKDoLw.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*YsAoFYY_xeHd6kzqYKDoLw.png 1100w, https:\/\/miro.medium.com\/max\/1054\/1*YsAoFYY_xeHd6kzqYKDoLw.png 1054w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 527px\" data-testid=\"og\">Visual representation of the Swish function<\/picture><\/div>\n<\/figure>\n<\/div>\n\n\n\n<div class=\"o dx ns nt id nu\" role=\"separator\"><\/div>\n\n\n\n<div role=\"separator\"><\/div>\n\n\n\n<div class=\"ir is it iu iv\">\n<p id=\"37d7\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">Many activation functions exist to help neural networks learn complex patterns in data. It\u2019s common practice to use the same activation through all the hidden layers. For some activation functions, like the softmax activation function, it\u2019s rare to see them used in hidden layers \u2014 we typically use the softmax function in the output layer when there are two or more labels.<\/p>\n<p id=\"b36e\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\">The activation function you decide to use for your model may significantly impact the performance of the neural network.&nbsp;<a class=\"au lc\" href=\"https:\/\/machinelearningmastery.com\/choose-an-activation-function-for-deep-learning\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"mb\">How to Choose an Activation Function for Deep Learning<\/em><\/a>&nbsp;is a good article to learn more about choosing activation functions when building neural networks.<\/p>\n<p id=\"9888\" class=\"pw-post-body-paragraph ld le iy bm b lf lg jz lh li lj kc lk ll lm ln lo lp lq lr ls lt lu lv lw lx ir ga\" data-selectable-paragraph=\"\"><em class=\"mb\">Thanks for reading.<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by John Smit&nbsp;on&nbsp;Unsplash An activation function plays an important role in a neural network. It\u2019s a function used in artificial neurons to non-linearly transform inputs that come from the previous cell and provide an output. Failing to apply an activation function would mean the neurons would resemble&nbsp;linear regression. Thus, activation functions are required to [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[138],"class_list":["post-4437","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Activation Functions In Neural Networks - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Activation Functions In Neural Networks\" \/>\n<meta property=\"og:description\" content=\"Photo by John Smit&nbsp;on&nbsp;Unsplash An activation function plays an important role in a neural network. It\u2019s a function used in artificial neurons to non-linearly transform inputs that come from the previous cell and provide an output. Failing to apply an activation function would mean the neurons would resemble&nbsp;linear regression. Thus, activation functions are required to [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2022-10-28T17:08:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:16:50+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th\" \/>\n<meta name=\"author\" content=\"Kurtis Pykes\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kurtis Pykes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Activation Functions In Neural Networks - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/","og_locale":"en_US","og_type":"article","og_title":"Activation Functions In Neural Networks","og_description":"Photo by John Smit&nbsp;on&nbsp;Unsplash An activation function plays an important role in a neural network. It\u2019s a function used in artificial neurons to non-linearly transform inputs that come from the previous cell and provide an output. Failing to apply an activation function would mean the neurons would resemble&nbsp;linear regression. Thus, activation functions are required to [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2022-10-28T17:08:36+00:00","article_modified_time":"2025-04-24T17:16:50+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th","type":"","width":"","height":""}],"author":"Kurtis Pykes","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Kurtis Pykes","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"Activation Functions In Neural Networks","datePublished":"2022-10-28T17:08:36+00:00","dateModified":"2025-04-24T17:16:50+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/"},"wordCount":1093,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/","url":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/","name":"Activation Functions In Neural Networks - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th","datePublished":"2022-10-28T17:08:36+00:00","dateModified":"2025-04-24T17:16:50+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#primaryimage","url":"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th","contentUrl":"https:\/\/miro.medium.com\/max\/700\/0*d31lOUMVCqEND9th"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/activation-functions-in-neural-networks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Activation Functions In Neural Networks"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4437","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=4437"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4437\/revisions"}],"predecessor-version":[{"id":15663,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/4437\/revisions\/15663"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=4437"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=4437"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=4437"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=4437"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}