{"id":7128,"date":"2023-08-14T04:54:40","date_gmt":"2023-08-14T12:54:40","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7128"},"modified":"2025-04-24T17:14:50","modified_gmt":"2025-04-24T17:14:50","slug":"how-to-make-your-machine-learning-models-robust-to-outliers","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/","title":{"rendered":"How to Make Your Machine Learning Models Robust to Outliers"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png\" alt=\"\" width=\"1500\" height=\"996\"><\/figure><div class=\"ms bg\">\n<figure class=\"mt mu mv mw mx ms bg paragraph-image\"><picture><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"c993\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">According to <a class=\"af na\" href=\"https:\/\/en.wikipedia.org\/wiki\/Outlier\" target=\"_blank\" rel=\"noopener ugc nofollow\">Wikipedia<\/a>, an <strong class=\"be nb\">outlier<\/strong> is an observation point that is distant from other observations. This definition is vague because it doesn\u2019t quantify the word \u201cdistant\u201d. In this blog, we\u2019ll try to understand the different interpretations of this \u201cdistant\u201d notion. We will also look into the outlier detection and treatment techniques while seeing their impact on different types of machine learning models.<\/p>\n<p id=\"1a37\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error, or simply through natural deviations in populations. A sample may have been contaminated with elements from outside the population being examined.<\/p>\n<p id=\"b3ec\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Many machine learning models, like <a class=\"af na\" href=\"https:\/\/www.comet.com\/site\/blog\/5-regression-loss-functions-all-machine-learners-should-know\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">linear &amp; logistic regression<\/a>, are easily impacted by the outliers in the training data. Models like <a class=\"af na\" href=\"https:\/\/machinelearningmastery.com\/boosting-and-adaboost-for-machine-learning\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">AdaBoost<\/a> increase the weights of misclassified points on every iteration and therefore might put high weights on these outliers as they tend to be often misclassified. This can become an issue if that outlier is an error of some type, or if we want our model to generalize well and not care for extreme values.<\/p>\n<p id=\"f9a2\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">To overcome this issue, we can either change the model or metric, or we can make some changes in the data and use the same models. For the analysis, we will look into <a class=\"af na\" href=\"https:\/\/www.kaggle.com\/c\/house-prices-advanced-regression-techniques\/data\" target=\"_blank\" rel=\"noopener ugc nofollow\">House Prices Kaggle Data<\/a>. All the codes for plots and implementation can be found on this <a class=\"af na\" href=\"https:\/\/github.com\/aswalin\/Outlier-Impact-Treatment\" target=\"_blank\" rel=\"noopener ugc nofollow\">GitHub Repository<\/a>.<\/p>\n<h1 id=\"7177\" class=\"nc nd fo be ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny nz bj\" data-selectable-paragraph=\"\">What do we mean by outliers?<\/h1>\n<p id=\"6eb5\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">Extreme values can be present in both dependent &amp; independent variables, in the case of supervised learning methods.<\/p>\n<p id=\"a5da\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">These extreme values need not necessarily impact the model performance or accuracy, but when they do they are called <strong class=\"be nb\">\u201cInfluential\u201d<\/strong> points.<\/p>\n<h2 id=\"c1cb\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Extreme Values in Independent Variables<\/h2>\n<p id=\"b6b0\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">These are called points of <strong class=\"be nb\">\u201chigh leverage<\/strong>\u201d. With a single predictor, an extreme value is simply one that is particularly high or low. With multiple predictors, extreme values may be particularly high or low for one or more predictors <strong class=\"be nb\"><em class=\"mr\">(univariate analysis \u2014 analysis of one variable at a time)<\/em><\/strong> or may be \u201cunusual\u201d combinations of predictor values <strong class=\"be nb\"><em class=\"mr\">(multivariate analysis)<\/em><\/strong><\/p>\n<p id=\"0fc3\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">In the following figure, all the points on the right-hand side of the orange line are leverage points.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:446\/0*mxWaQXIrh0NHwSnH.\" alt=\"\" width=\"446\" height=\"424\"><\/figure><div class=\"ow ox oy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*mxWaQXIrh0NHwSnH. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*mxWaQXIrh0NHwSnH. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*mxWaQXIrh0NHwSnH. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*mxWaQXIrh0NHwSnH. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*mxWaQXIrh0NHwSnH. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*mxWaQXIrh0NHwSnH. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:892\/0*mxWaQXIrh0NHwSnH. 892w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 446px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*mxWaQXIrh0NHwSnH. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*mxWaQXIrh0NHwSnH. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*mxWaQXIrh0NHwSnH. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*mxWaQXIrh0NHwSnH. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*mxWaQXIrh0NHwSnH. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*mxWaQXIrh0NHwSnH. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:892\/0*mxWaQXIrh0NHwSnH. 892w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 446px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"6447\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Extreme Values in Target Variables<\/h2>\n<p id=\"af1c\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">Regression \u2014 these extreme values are termed as <strong class=\"be nb\">\u201coutliers\u201d<\/strong>. They may or may not be influential points, which we will see later. In the following figure, all the points above the orange line can be classified as outliers.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:446\/0*QaO-iDZqbr9eh9xL.\" alt=\"\" width=\"446\" height=\"424\"><\/figure><div class=\"ow ox oy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*QaO-iDZqbr9eh9xL. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*QaO-iDZqbr9eh9xL. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*QaO-iDZqbr9eh9xL. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*QaO-iDZqbr9eh9xL. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*QaO-iDZqbr9eh9xL. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*QaO-iDZqbr9eh9xL. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:892\/0*QaO-iDZqbr9eh9xL. 892w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 446px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*QaO-iDZqbr9eh9xL. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*QaO-iDZqbr9eh9xL. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*QaO-iDZqbr9eh9xL. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*QaO-iDZqbr9eh9xL. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*QaO-iDZqbr9eh9xL. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*QaO-iDZqbr9eh9xL. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:892\/0*QaO-iDZqbr9eh9xL. 892w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 446px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"d43e\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Classification: Here, we have two types of extreme values:<\/p>\n<p id=\"2203\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">1. Outliers:<\/strong> For example, in an image classification problem in which we\u2019re trying to identify dogs\/cats, one of the images in the training set has a gorilla (or any other category not part of the goal of the problem) by mistake. Here, the gorilla image is clearly noise. Detecting outliers here does not make sense because we already know which categories we want to focus on and which to discard<\/p>\n<p id=\"88b5\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">2. Novelties:<\/strong> Many times we\u2019re dealing with novelties, and the problem is often called <strong class=\"be nb\">supervised anomaly detection<\/strong>. In this case, the goal is not to remove outliers or reduce their impact, but we are interested in detecting anomalies in new observations. Therefore we won\u2019t be discussing it in this post. It is especially used for fraud detection in credit-card transactions, fake calls, etc.<\/p>\n<p id=\"2fcf\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">All the points we have discussed above, including influential points, will become very clear once we visualize the following figure.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:446\/0*OWYbfo8gMrwpLxrO.\" alt=\"\" width=\"446\" height=\"424\"><\/figure><div class=\"ow ox oy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*OWYbfo8gMrwpLxrO. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*OWYbfo8gMrwpLxrO. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*OWYbfo8gMrwpLxrO. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*OWYbfo8gMrwpLxrO. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*OWYbfo8gMrwpLxrO. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*OWYbfo8gMrwpLxrO. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:892\/0*OWYbfo8gMrwpLxrO. 892w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 446px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*OWYbfo8gMrwpLxrO. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*OWYbfo8gMrwpLxrO. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*OWYbfo8gMrwpLxrO. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*OWYbfo8gMrwpLxrO. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*OWYbfo8gMrwpLxrO. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*OWYbfo8gMrwpLxrO. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:892\/0*OWYbfo8gMrwpLxrO. 892w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 446px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<blockquote class=\"oz pa pb\"><p id=\"3965\" class=\"lu lv mr be b lw lx ly lz ma mb mc md pc mf mg mh pd mj mk ml pe mn mo mp mq fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\"><em class=\"fo\">Inference<br>\n<\/em><\/strong><em class=\"fo\">&#8211; Points in Q1: Outliers<br>\n&#8211; Points in Q3: Leverage Points<br>\n&#8211; Points in Q2: Both outliers &amp; leverage but non-influential points<br>\n&#8211; Circled points: Example of Influential Points. There can be more but these are the prominent ones<\/em><\/p><\/blockquote>\n<p id=\"773d\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Our major focus will be outliers (extreme values in <strong class=\"be nb\">target variable<\/strong> for further investigation and treatment). We\u2019ll see the impact of these extreme values on the model\u2019s performance.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"e5b9\" class=\"nc nd fo be ne nf px nh ni nj py nl nm nn pz np nq nr qa nt nu nv qb nx ny nz bj\" data-selectable-paragraph=\"\">Common Methods for Detecting Outliers<\/h1>\n<p id=\"4e0f\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">When detecting outliers, we are either doing univariate analysis or multivariate analysis. When your linear model has a single predictor, then you can use univariate analysis. However, it can give misleading results if you use it for multiple predictors. One common way of performing outlier detection is <strong class=\"be nb\">to assume that the regular data come from a known distribution<\/strong> (e.g. data are Gaussian distributed). This assumption is discussed in the Z-Score method section below.<\/p>\n<h2 id=\"74aa\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Box-Plot<\/h2>\n<p id=\"0805\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">The quickest and easiest way to identify outliers is by <a class=\"af na\" href=\"https:\/\/heartbeat.comet.ml\/introduction-to-matplotlib-data-visualization-in-python-d9143287ae39\" target=\"_blank\" rel=\"noopener ugc nofollow\">visualizing them<\/a> using plots. If your dataset is not huge (approx. up to 10k observations &amp; 100 features), I would highly recommend you build scatter plots &amp; box-plots of variables. If there aren\u2019t outliers, you\u2019ll definitely gain some other insights like correlations, variability, or external factors like the impact of world war\/recession on economic factors. However, this method is not recommended for high dimensional data where the power of visualization fails.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:356\/0*ZAyuXvalL7MVjnXd.\" alt=\"\" width=\"356\" height=\"266\"><\/figure><div class=\"ow ox qc\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*ZAyuXvalL7MVjnXd. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*ZAyuXvalL7MVjnXd. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*ZAyuXvalL7MVjnXd. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*ZAyuXvalL7MVjnXd. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*ZAyuXvalL7MVjnXd. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*ZAyuXvalL7MVjnXd. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:712\/0*ZAyuXvalL7MVjnXd. 712w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 356px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*ZAyuXvalL7MVjnXd. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*ZAyuXvalL7MVjnXd. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*ZAyuXvalL7MVjnXd. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*ZAyuXvalL7MVjnXd. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*ZAyuXvalL7MVjnXd. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*ZAyuXvalL7MVjnXd. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:712\/0*ZAyuXvalL7MVjnXd. 712w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 356px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"9868\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">The box plot uses inter-quartile range to detect outliers. Here, we first determine the quartiles <em class=\"mr\">Q<\/em>1 and <em class=\"mr\">Q<\/em>3.<\/p>\n<p id=\"fabf\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Interquartile range is given by, IQR = Q3 \u2014 Q1<\/p>\n<p id=\"f39c\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Upper limit = Q3+1.5*IQR<\/p>\n<p id=\"b2c9\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Lower limit = Q1\u20131.5*IQR<\/p>\n<p id=\"5fd7\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Anything below the lower limit and above the upper limit is considered an outlier<\/p>\n<h2 id=\"77ff\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Cook\u2019s Distance<\/h2>\n<p id=\"40b1\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">This is a multivariate approach for finding influential points. These points may or may not be outliers as explained above, but they have the power to influence the regression model. We will see their impact in the later part of the blog.<\/p>\n<p id=\"0090\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">This method is used only for linear regression and therefore has a limited application. <a class=\"af na\" href=\"https:\/\/en.wikipedia.org\/wiki\/Cook%27s_distance\" target=\"_blank\" rel=\"noopener ugc nofollow\">Cook\u2019s distance<\/a> measures the effect of deleting a given observation. It\u2019s represents the sum of all the changes in the regression model when observation <strong class=\"be nb\">\u201ci\u201d<\/strong> is removed from it.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:486\/0*sGu8fP6M6y1j5ZZf.\" alt=\"\" width=\"486\" height=\"138\"><\/figure><div class=\"ow ox qd\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*sGu8fP6M6y1j5ZZf. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*sGu8fP6M6y1j5ZZf. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*sGu8fP6M6y1j5ZZf. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*sGu8fP6M6y1j5ZZf. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*sGu8fP6M6y1j5ZZf. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*sGu8fP6M6y1j5ZZf. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:972\/0*sGu8fP6M6y1j5ZZf. 972w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 486px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*sGu8fP6M6y1j5ZZf. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*sGu8fP6M6y1j5ZZf. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*sGu8fP6M6y1j5ZZf. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*sGu8fP6M6y1j5ZZf. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*sGu8fP6M6y1j5ZZf. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*sGu8fP6M6y1j5ZZf. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:972\/0*sGu8fP6M6y1j5ZZf. 972w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 486px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"acb1\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Here, p is the number of predictors and s\u00b2 is the mean squared error of the regression model. There are different views regarding the cut-off values to use for spotting highly influential points. A rule of thumb is that D(i) &gt; 4\/n, can be good cut off for influential points.<\/p>\n<p id=\"e945\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">R has the <a class=\"af na\" href=\"http:\/\/cran.r-project.org\/web\/packages\/car\/index.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">car<\/a> (Companion to Applied Regression) package where you can directly find outliers using Cook\u2019s distance. Implementation is provided in this <a class=\"af na\" href=\"https:\/\/www.statmethods.net\/stats\/rdiagnostics.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">R-Tutorial<\/a>. Another similar approach is <strong class=\"be nb\">DFFITS<\/strong>, which you can see details of <a class=\"af na\" href=\"https:\/\/newonlinecourses.science.psu.edu\/stat501\/node\/340\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n<h2 id=\"1945\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Z-Score<\/h2>\n<p id=\"09d7\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">This method assumes that the variable has a Gaussian distribution. It represents the number of standard deviations an observation is away from the mean:<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:85\/0*05VVM2yHIiWyrvjM.\" alt=\"\" width=\"85\" height=\"39\"><\/figure><div class=\"ow ox qe\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*05VVM2yHIiWyrvjM. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*05VVM2yHIiWyrvjM. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*05VVM2yHIiWyrvjM. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*05VVM2yHIiWyrvjM. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*05VVM2yHIiWyrvjM. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*05VVM2yHIiWyrvjM. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:170\/0*05VVM2yHIiWyrvjM. 170w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 85px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*05VVM2yHIiWyrvjM. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*05VVM2yHIiWyrvjM. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*05VVM2yHIiWyrvjM. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*05VVM2yHIiWyrvjM. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*05VVM2yHIiWyrvjM. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*05VVM2yHIiWyrvjM. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:170\/0*05VVM2yHIiWyrvjM. 170w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 85px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"4812\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Here, we normally define outliers as points whose modulus of z-score is greater than a threshold value. This threshold value is usually greater than 2 (3 is a common value).<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*2NlsLGlMtgtII_hN.\" alt=\"\" width=\"700\" height=\"525\"><\/figure><div class=\"ow ox qf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*2NlsLGlMtgtII_hN. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*2NlsLGlMtgtII_hN. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*2NlsLGlMtgtII_hN. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*2NlsLGlMtgtII_hN. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*2NlsLGlMtgtII_hN. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*2NlsLGlMtgtII_hN. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*2NlsLGlMtgtII_hN. 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*2NlsLGlMtgtII_hN. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*2NlsLGlMtgtII_hN. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*2NlsLGlMtgtII_hN. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*2NlsLGlMtgtII_hN. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*2NlsLGlMtgtII_hN. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*2NlsLGlMtgtII_hN. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*2NlsLGlMtgtII_hN. 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div><figcaption class=\"qk ql qm ow ox qn qo be b bf z dv\" data-selectable-paragraph=\"\">Reference: <a class=\"af na\" href=\"http:\/\/slideplayer.com\/slide\/6394283\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">http:\/\/slideplayer.com\/slide\/6394283\/<\/a><\/figcaption><\/figure>\n<p id=\"e307\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">All the above methods are good for initial analysis of data, but they don\u2019t have much value in multivariate settings or with high dimensional data. For such datasets, we have to use advanced methods like <strong class=\"be nb\">PCA, LOF (Local Outlier Factor) &amp; HiCS: High Contrast Subspaces for Density-Based Outlier Ranking<\/strong>.<\/p>\n<p id=\"b38d\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">We won\u2019t be discussing these methods in this blog, as they are beyond its scope. Our focus here is to see how various outlier treatment techniques affect the performance of models. You can read <a class=\"af na\" href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/introduction-to-outlier-detection-methods\" target=\"_blank\" rel=\"noopener ugc nofollow\">this blog<\/a> for details on these methods.<\/p>\n<h1 id=\"1b2c\" class=\"nc nd fo be ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny nz bj\" data-selectable-paragraph=\"\">Impact &amp; Treatment of Outliers<\/h1>\n<p id=\"f680\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">The impact of outliers can be seen not only in predictive modeling but also in statistical tests where it reduces the power of tests. Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers. But in this post, we are focusing only on the impact of outliers in predictive modeling.<\/p>\n<h2 id=\"1e80\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">To Drop or Not to Drop<\/h2>\n<p id=\"fd47\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">I believe dropping data is always a harsh step and should be taken only in extreme conditions when we\u2019re very sure that the <strong class=\"be nb\">outlier is a measurement error<\/strong>, which we generally do not know. The data collection process is rarely provided. When we drop data, we lose information in terms of the variability in data. When we have too many observations and <strong class=\"be nb\">outliers are few<\/strong>, then we can think of dropping these observations.<\/p>\n<p id=\"9ab0\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">In the following example we can see that the slope of the regression line changes a lot in the presence of the extreme values at the top. Hence, it is reasonable to drop them and get a better fit &amp; more general solution.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*_SHBO9KupWceAJsa.\" alt=\"\" width=\"700\" height=\"350\"><\/figure><div class=\"ow ox qf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*_SHBO9KupWceAJsa. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*_SHBO9KupWceAJsa. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*_SHBO9KupWceAJsa. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*_SHBO9KupWceAJsa. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*_SHBO9KupWceAJsa. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*_SHBO9KupWceAJsa. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*_SHBO9KupWceAJsa. 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*_SHBO9KupWceAJsa. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*_SHBO9KupWceAJsa. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*_SHBO9KupWceAJsa. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*_SHBO9KupWceAJsa. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*_SHBO9KupWceAJsa. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*_SHBO9KupWceAJsa. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*_SHBO9KupWceAJsa. 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"qk ql qm ow ox qn qo be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af na\" href=\"https:\/\/www.r-bloggers.com\/outlier-detection-and-treatment-with-r\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/www.r-bloggers.com\/outlier-detection-and-treatment-with-r\/<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"7bd9\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Other Data-Based Methods<\/h2>\n<ul class=\"\">\n<li id=\"cf3a\" class=\"lu lv fo be b lw oa ly lz ma ob mc md pc oc mg mh pd od mk ml pe oe mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">Winsorizing:<\/strong> This method involves setting the extreme values of an attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 5th percentile, while the upper 5% of values are set equal to the maximum value in the 95th percentile. This is more advanced than trimming where we just exclude the extreme values.<\/li>\n<li id=\"eea1\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">Log-Scale Transformation:<\/strong> This method is often used to reduce the variability of data including outlying observation. Here, the y value is changed to log(y). It\u2019s often preferred when the response variable follows <strong class=\"be nb\">exponential distribution or is right-skewed<\/strong>.<\/li>\n<li id=\"9722\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\">However, it\u2019s a controversial step and <strong class=\"be nb\">does not necessarily reduce<\/strong>the variance. For example, this <a class=\"af na\" href=\"https:\/\/stats.stackexchange.com\/questions\/130262\/why-not-log-transform-all-variables-that-are-not-of-main-interest\" target=\"_blank\" rel=\"noopener ugc nofollow\">answer<\/a> beautifully captures all those cases.<\/li>\n<li id=\"59dd\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\">Poor example of transformation &#8211;<\/li>\n<\/ul>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:474\/0*cOM2KZVXSTLkFDC4.\" alt=\"\" width=\"474\" height=\"265\"><\/figure><div class=\"ow ox qx\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*cOM2KZVXSTLkFDC4. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*cOM2KZVXSTLkFDC4. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*cOM2KZVXSTLkFDC4. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*cOM2KZVXSTLkFDC4. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*cOM2KZVXSTLkFDC4. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*cOM2KZVXSTLkFDC4. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:948\/0*cOM2KZVXSTLkFDC4. 948w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 474px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*cOM2KZVXSTLkFDC4. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*cOM2KZVXSTLkFDC4. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*cOM2KZVXSTLkFDC4. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*cOM2KZVXSTLkFDC4. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*cOM2KZVXSTLkFDC4. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*cOM2KZVXSTLkFDC4. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:948\/0*cOM2KZVXSTLkFDC4. 948w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 474px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"qk ql qm ow ox qn qo be b bf z dv\" data-selectable-paragraph=\"\">An initially left skewed distribution becomes more skewed after log-transform<\/figcaption>\n<\/figure>\n<ul class=\"\">\n<li id=\"4f60\" class=\"lu lv fo be b lw lx ly lz ma mb mc md pc mf mg mh pd mj mk ml pe mn mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">Binning:<\/strong> This refers to dividing a list of continuous variables into groups. We do this to discover sets of patterns in continuous variables, which are difficult to analyze otherwise. But, it also leads to <strong class=\"be nb\">loss of information<\/strong> and loss of power.<\/li>\n<\/ul>\n<h2 id=\"5ab9\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Model-Based Methods<\/h2>\n<ul class=\"\">\n<li id=\"aaac\" class=\"lu lv fo be b lw oa ly lz ma ob mc md pc oc mg mh pd od mk ml pe oe mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\">Use a different model: Instead of linear models, we can use <a class=\"af na\" href=\"https:\/\/heartbeat.comet.ml\/introduction-to-decision-tree-learning-cd604f85e236\" target=\"_blank\" rel=\"noopener ugc nofollow\">tree-based methods<\/a> like Random Forests and Gradient Boosting techniques, which are less impacted by outliers. This <a class=\"af na\" href=\"https:\/\/www.quora.com\/Why-are-tree-based-models-robust-to-outliers\" target=\"_blank\" rel=\"noopener ugc nofollow\">answer <\/a>clearly explains why tree based methods are robust to outliers.<\/li>\n<li id=\"965d\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\"><mark class=\"adk adl ao\">Metrics: Use MAE instead of RMSE as a loss function. We can also use truncated loss:<\/mark><\/li>\n<\/ul>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:401\/0*fUWNxn5eq6346V24.\" alt=\"\" width=\"401\" height=\"54\"><\/figure><div class=\"ow ox qy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*fUWNxn5eq6346V24. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*fUWNxn5eq6346V24. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*fUWNxn5eq6346V24. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*fUWNxn5eq6346V24. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*fUWNxn5eq6346V24. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*fUWNxn5eq6346V24. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:802\/0*fUWNxn5eq6346V24. 802w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 401px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*fUWNxn5eq6346V24. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*fUWNxn5eq6346V24. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*fUWNxn5eq6346V24. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*fUWNxn5eq6346V24. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*fUWNxn5eq6346V24. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*fUWNxn5eq6346V24. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:802\/0*fUWNxn5eq6346V24. 802w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 401px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*iVLpXsc-Oz1URLBG.\" alt=\"\" width=\"700\" height=\"544\"><\/figure><div class=\"ow ox qf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*iVLpXsc-Oz1URLBG. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*iVLpXsc-Oz1URLBG. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*iVLpXsc-Oz1URLBG. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*iVLpXsc-Oz1URLBG. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*iVLpXsc-Oz1URLBG. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*iVLpXsc-Oz1URLBG. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*iVLpXsc-Oz1URLBG. 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*iVLpXsc-Oz1URLBG. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*iVLpXsc-Oz1URLBG. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*iVLpXsc-Oz1URLBG. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*iVLpXsc-Oz1URLBG. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*iVLpXsc-Oz1URLBG. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*iVLpXsc-Oz1URLBG. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*iVLpXsc-Oz1URLBG. 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"qk ql qm ow ox qn qo be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af na\" href=\"https:\/\/eranraviv.com\/outliers-and-loss-functions\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/eranraviv.com\/outliers-and-loss-functions\/<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"5550\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">Case Study Comparison<\/h2>\n<p id=\"6f88\" class=\"pw-post-body-paragraph lu lv fo be b lw oa ly lz ma ob mc md me oc mg mh mi od mk ml mm oe mo mp mq fh bj\" data-selectable-paragraph=\"\">For this comparison, I chose only four important predictors (Overall Quality, MSubClass, Total Basement Area, Ground living area) out of total 80 predictors and tried to predict Sales Price using these predictors. The idea is to see how outliers affect linear &amp; tree-based methods.<\/p>\n<figure class=\"mt mu mv mw mx ms ow ox paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg my mz c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*W0bucej8NH4wUrRL.\" alt=\"\" width=\"700\" height=\"467\"><\/figure><div class=\"ow ox qf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*W0bucej8NH4wUrRL. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*W0bucej8NH4wUrRL. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*W0bucej8NH4wUrRL. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*W0bucej8NH4wUrRL. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*W0bucej8NH4wUrRL. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*W0bucej8NH4wUrRL. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*W0bucej8NH4wUrRL. 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*W0bucej8NH4wUrRL. 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*W0bucej8NH4wUrRL. 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*W0bucej8NH4wUrRL. 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*W0bucej8NH4wUrRL. 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*W0bucej8NH4wUrRL. 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*W0bucej8NH4wUrRL. 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*W0bucej8NH4wUrRL. 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"e9b3\" class=\"of nd fo be ne og oh oi ni oj ok ol nm me om on oo mi op oq or mm os ot ou ov bj\" data-selectable-paragraph=\"\">End Notes<\/h2>\n<ul class=\"\">\n<li id=\"93de\" class=\"lu lv fo be b lw oa ly lz ma ob mc md pc oc mg mh pd od mk ml pe oe mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\">Since there are only 1400 total observation in the dataset, the impact of outliers is considerable on a linear regression model, as we can see from the RMSE scores of \u201c<strong class=\"be nb\">With outliers<\/strong>\u201d (0.93) and \u201c<strong class=\"be nb\">Without outliers<\/strong>\u201d (0.18) \u2014 a significant drop.<\/li>\n<li id=\"9497\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\">For this dataset, the target variable is right skewed. Because of this, log-transformation works better than removing outliers. Hence we should always try to transform the data first rather than remove it. However, winsorizing is not as effective as compared to outlier removal. It might be because, by hard replacement, we are somehow introducing inaccuracies into the data.<\/li>\n<li id=\"6aaa\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qp qq qr bj\" data-selectable-paragraph=\"\">Clearly, Random Forest is not affected by outliers because after removing the outliers, RMSE increased. This might be the reason why changing the criteria from MSE to MAE did not help much (from 0.188 to 0.186). Even for this case, log-transformation turned out to be the winner: the reason being, the skewed nature of the target variable. After transformation, the data are becoming uniform and splitting is becoming better in the Random Forest.<\/li>\n<\/ul>\n<p id=\"8c06\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">From the above results, we can conclude that transformation techniques generally works better than dropping for improving the predictive accuracy of both linear &amp; tree-based models. It is very important to treat outliers by either dropping or transforming them if you are using linear regression model.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"bbde\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">If I have missed any important techniques for outliers treatment, I would love to hear about them in comments. Thank you for reading.<\/p>\n<p id=\"b80d\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">About Me:<\/strong> Graduated with Masters in Data Science at USF. Interested in working with cross-functional groups to derive insights from data, and apply Machine Learning knowledge to solve complicated data science problems. <a class=\"af na\" href=\"https:\/\/alviraswalin.wixsite.com\/alvira\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/alviraswalin.wixsite.com\/alvira<\/a><\/p>\n<p id=\"5159\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\">Check out my other blogs <a class=\"af na\" href=\"https:\/\/medium.com\/@aswalin\" rel=\"noopener\">here<\/a>!<\/p>\n<p id=\"f06e\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">LinkedIn: <\/strong><a class=\"af na\" href=\"http:\/\/www.linkedin.com\/in\/alvira-swalin\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nb\">www.linkedin.com\/in\/alvira-swalin<\/strong><\/a><\/p>\n<h1 id=\"bf88\" class=\"nc nd fo be ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny nz bj\" data-selectable-paragraph=\"\">References:<\/h1>\n<ol class=\"\">\n<li id=\"16f3\" class=\"lu lv fo be b lw oa ly lz ma ob mc md pc oc mg mh pd od mk ml pe oe mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\">The treatment methods have been taught by <a class=\"af na\" href=\"https:\/\/www.usfca.edu\/faculty\/yannet-interian\" target=\"_blank\" rel=\"noopener ugc nofollow\">Yannet Interian at USF<\/a><\/li>\n<li id=\"31b3\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\"><a class=\"af na\" href=\"https:\/\/github.com\/aswalin\/Outlier-Impact-Treatment\" target=\"_blank\" rel=\"noopener ugc nofollow\">GitHub Repo for Codes<\/a><\/li>\n<li id=\"38d0\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\"><a class=\"af na\" href=\"https:\/\/www.kaggle.com\/c\/house-prices-advanced-regression-techniques\/data\" target=\"_blank\" rel=\"noopener ugc nofollow\">Data For House Price Analysis<\/a><\/li>\n<li id=\"2bc1\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\"><a class=\"af na\" href=\"https:\/\/newonlinecourses.science.psu.edu\/stat462\/node\/170\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Lesson on Distinction Between Outliers and High Leverage Observations<\/a><\/li>\n<li id=\"65c6\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\"><a class=\"af na\" href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/introduction-to-outlier-detection-methods\" target=\"_blank\" rel=\"noopener ugc nofollow\">Introduction to Outlier Detection Methods<\/a><\/li>\n<li id=\"ea0d\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\"><a class=\"af na\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2016\/01\/guide-data-exploration\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">A Comprehensive Guide to Data Exploration<\/a><\/li>\n<li id=\"cf63\" class=\"lu lv fo be b lw qs ly lz ma qt mc md pc qu mg mh pd qv mk ml pe qw mo mp mq qz qq qr bj\" data-selectable-paragraph=\"\"><a class=\"af na\" href=\"https:\/\/www.statmethods.net\/stats\/rdiagnostics.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Cook\u2019s D Implementation in R<\/a><\/li>\n<\/ol>\n<p id=\"1b04\" class=\"pw-post-body-paragraph lu lv fo be b lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nb\">Discuss this post on <\/strong><a class=\"af na\" href=\"https:\/\/news.ycombinator.com\/item?id=17197027\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nb\">Hacker News<\/strong><\/a><strong class=\"be nb\">.<\/strong><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>According to Wikipedia, an outlier is an observation point that is distant from other observations. This definition is vague because it doesn\u2019t quantify the word \u201cdistant\u201d. In this blog, we\u2019ll try to understand the different interpretations of this \u201cdistant\u201d notion. We will also look into the outlier detection and treatment techniques while seeing their impact [&hellip;]<\/p>\n","protected":false},"author":73,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[170],"class_list":["post-7128","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Make Your Machine Learning Models Robust to Outliers - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Make Your Machine Learning Models Robust to Outliers\" \/>\n<meta property=\"og:description\" content=\"According to Wikipedia, an outlier is an observation point that is distant from other observations. This definition is vague because it doesn\u2019t quantify the word \u201cdistant\u201d. In this blog, we\u2019ll try to understand the different interpretations of this \u201cdistant\u201d notion. We will also look into the outlier detection and treatment techniques while seeing their impact [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-14T12:54:40+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:50+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png\" \/>\n<meta name=\"author\" content=\"Alvira Swalin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Alvira Swalin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How to Make Your Machine Learning Models Robust to Outliers - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/","og_locale":"en_US","og_type":"article","og_title":"How to Make Your Machine Learning Models Robust to Outliers","og_description":"According to Wikipedia, an outlier is an observation point that is distant from other observations. This definition is vague because it doesn\u2019t quantify the word \u201cdistant\u201d. In this blog, we\u2019ll try to understand the different interpretations of this \u201cdistant\u201d notion. We will also look into the outlier detection and treatment techniques while seeing their impact [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-14T12:54:40+00:00","article_modified_time":"2025-04-24T17:14:50+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png","type":"","width":"","height":""}],"author":"Alvira Swalin","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Alvira Swalin","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/"},"author":{"name":"Alvira Swalin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/fb0aa3adc60c245798d056de6504c0b8"},"headline":"How to Make Your Machine Learning Models Robust to Outliers","datePublished":"2023-08-14T12:54:40+00:00","dateModified":"2025-04-24T17:14:50+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/"},"wordCount":2000,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/","url":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/","name":"How to Make Your Machine Learning Models Robust to Outliers - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png","datePublished":"2023-08-14T12:54:40+00:00","dateModified":"2025-04-24T17:14:50+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*Ocxi71mUXvhgo3rNg1XR4g.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-make-your-machine-learning-models-robust-to-outliers\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"How to Make Your Machine Learning Models Robust to Outliers"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/fb0aa3adc60c245798d056de6504c0b8","name":"Alvira Swalin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/80283cf026a5fd36f7bb3272c1badbcb","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1517278901294-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1517278901294-96x96.jpg","caption":"Alvira Swalin"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/alviraswalingmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7128","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/73"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7128"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7128\/revisions"}],"predecessor-version":[{"id":15583,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7128\/revisions\/15583"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7128"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7128"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7128"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7128"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}