{"id":7617,"date":"2023-09-22T12:13:38","date_gmt":"2023-09-22T20:13:38","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7617"},"modified":"2025-04-24T17:13:52","modified_gmt":"2025-04-24T17:13:52","slug":"resampling-to-properly-handle-imbalanced-datasets-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/","title":{"rendered":"Resampling to Properly Handle Imbalanced Datasets in Machine Learning"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"lu bg\">\n<figure class=\"lv lw lx ly lz lu bg paragraph-image\"><picture><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg\" alt=\"\" width=\"2400\" height=\"1706\"><\/picture><figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mj\" href=\"https:\/\/unsplash.com\/@martinsanchez?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Martin Sanchez<\/a> on <a class=\"af mj\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"65b5\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Often in machine learning, and specifically with classification problems, we encounter imbalanced datasets. This typically refers to an issue where the classes are not represented equally, which can cause huge problems for some algorithms.<\/p>\n<p id=\"33fd\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">In this article, we\u2019ll explore a technique called <strong class=\"be nh\">resampling<\/strong>, which is used to reduce this effect on our machine learning algorithms.<\/p>\n<blockquote class=\"ni nj nk\"><p id=\"9836\" class=\"mk ml nl be b mm mn mo mp mq mr ms mt nm mv mw mx nn mz na nb no nd ne nf ng fh bj\" data-selectable-paragraph=\"\">This article presumes that you know some machine learning concepts and are familiar with Python and its data science libraries.<\/p><\/blockquote>\n<h1 id=\"f7b1\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">What is an Imbalanced Dataset?<\/h1>\n<p id=\"a22c\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">The best way to learn something is through an example:<\/p>\n<p id=\"c690\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Say that you have a fraud detection binary classification model (two classes \u2014 \u201cFraud\u201d or \u201cNot-Fraud\u201d) problem with 100 instances (rows). A total of 80 instances are labeled as <strong class=\"be nh\">Fraud <\/strong>and the remaining 20 instances are labeled as a <strong class=\"be nh\">Not-Fraud. <\/strong>This is an imbalanced dataset, and the ratio of Fraud to Not-Fraud instances is 80:20, or 4:1.<\/p>\n<p id=\"933d\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Most classification datasets don\u2019t have exactly equal numbers of records in each class, but a small difference doesn\u2019t matter as much.<\/p>\n<p id=\"94d6\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">This class imbalance problem can occur in binary classification problems as well as multi-class classification problems, but most techniques can be used on either.<\/p>\n<h1 id=\"ec46\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">Project setup<\/h1>\n<p id=\"9542\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">In addition to using the core Python libraries like NumPy, Pandas, and scikit-learn, we\u2019re going to use another great library called <a class=\"af mj\" href=\"http:\/\/contrib.scikit-learn.org\/imbalanced-learn\/stable\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">imbalanced-learn<\/a>, which is a part of scikit-learn-contrib projects.<\/p>\n<p id=\"7f5f\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">imbalanced-learn<\/strong> provides more advanced methods to handle imbalanced datasets like SMOTE and Tomek Links.<\/p>\n<p id=\"c667\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here are the commands to install it via pip or conda:<\/p>\n<pre class=\"os ot ou ov ow ox oy oz pa ax pb bj\"><span id=\"200f\" class=\"pc nq fo oy b ho pd pe l ie pf\" data-selectable-paragraph=\"\"># using pip\npip install -U imbalanced-learn<\/span><span id=\"e87c\" class=\"pc nq fo oy b ho pg pe l ie pf\" data-selectable-paragraph=\"\"># using conda\nconda install -c conda-forge imbalanced-learn<\/span><\/pre>\n<h1 id=\"cd76\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">The Metric Problem<\/h1>\n<p id=\"1530\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Most beginners struggle when dealing with imbalanced datasets for the first time. They tend to use accuracy as a metric to evaluate their machine learning models. This intuitively makes sense, as classification accuracy is often the first measure we use when evaluating such models.<\/p>\n<p id=\"bfdd\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Nevertheless, this can be misleading, because most of the algorithms we use are designed to achieve the best accuracy, so the classifier always \u201cpredicts\u201d the most common class without performing any features analysis. It will still have a high accuracy rate, but it will give false predictions, nevertheless.<\/p>\n<h2 id=\"af38\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Example<\/h2>\n<p id=\"1867\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Say we have 1000 emails as a dataset: 990 are spam emails and 10 aren\u2019t. If you build a simple model you\u2019ll get ~99% accuracy, which at first glance seems great.<\/p>\n<p id=\"bfb3\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">But on the other hand, the algorithm doesn\u2019t perform any learning \u2014 the accuracy here is only reflecting the underlying class distribution because models look at the data and cleverly decide that the best thing to do is to always predict spam and achieve high accuracy. As such, the model\u2019s success and is just an illusion.<\/p>\n<p id=\"1668\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">This is why <strong class=\"be nh\">the choice of metrics used <\/strong>when working with imbalanced datasets is extremely important.<\/p>\n<h1 id=\"ffe4\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">Investigate your dataset<\/h1>\n<p id=\"e4d8\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">You should have an imbalanced dataset to apply the methods described here\u2014 you can get started with <a class=\"af mj\" href=\"https:\/\/www.kaggle.com\/mlg-ulb\/creditcardfraud\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nh\">this dataset<\/strong><\/a> from Kaggle.<\/p>\n<p id=\"7603\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">You can use <a class=\"af mj\" href=\"https:\/\/heartbeat.comet.ml\/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07\" target=\"_blank\" rel=\"noopener ugc nofollow\">Seaborn<\/a> to plot the count of each class to see if your dataset presents imbalanced dataset problem like the following:<\/p>\n<pre># import the data sciecne libraries.\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\n# read the dataset\ndata = pd.read_csv('training.csv')\n\n# print the count of each class from the target vatiables\nprint(data.FraudResult.value_counts())\n\n# plot the count of each class from the target vatiables\nsns.countplot(data.FraudResult)<\/pre>\n<p id=\"4211\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Or you can use the sklearn to compute class weight and get the class ratios as follows:<\/p>\n<pre># import the function to compute the class weights\nfrom sklearn.utils import compute_class_weight\n\n# calculate the class weight by providing the 'balanced' parameter.\nclass_weight = compute_class_weight('balanced', data.FraudResult.unique() , data.FraudResult)\n\n# print the result\nprint(class_weight)<\/pre>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<blockquote class=\"qi\"><p id=\"e4c6\" class=\"qj qk fo be ql qm qn qo qp qq qr ng dv\" data-selectable-paragraph=\"\">Join more than 14,000 of your fellow machine learners and data scientists. <a class=\"af mj\" href=\"https:\/\/www.deeplearningweekly.com\/?utm_campaign=dlweekly-newsletter-peers2&amp;utm_source=heartbeat\" target=\"_blank\" rel=\"noopener ugc nofollow\">Subscribe to the premier newsletter for all things deep learning<\/a>.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"332c\" class=\"np nq fo be nr ns qs nu nv nw qt ny nz oa qu oc od oe qv og oh oi qw ok ol om bj\" data-selectable-paragraph=\"\">Resampling<\/h1>\n<p id=\"5518\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">There are multiple ways to handle the issue of imbalanced datasets. The techniques we\u2019re going to use in this tutorials is called resampling.<\/p>\n<p id=\"be74\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Resampling is a widely-adopted technique for dealing with imbalanced datasets, and it is often very easy to implement, fast to run, and an excellent starting point.<\/p>\n<p id=\"e933\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Resampling changes the dataset into a more balanced one by adding instances to the minority class or deleting ones from the majority class, that way we build better machine learning models.<\/p>\n<p id=\"112e\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">The way to introduce these changes in a given dataset is achieved via two main methods: <strong class=\"be nh\">Oversampling<\/strong> and <strong class=\"be nh\">Undersampling<\/strong>.<\/p>\n<ul class=\"\">\n<li id=\"bfdb\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Oversampling: <\/strong>This method adds copies of instances from the under-represented class (minority class) to obtain a balanced dataset. There are multiple ways you can oversample a dataset, like random oversampling. We\u2019ll cover some of these methods in this article.<\/li>\n<\/ul>\n<figure class=\"os ot ou ov ow lu mf mg paragraph-image\">\n<div class=\"re rf eb rg bg rh\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*o_KfyMzF7LITK2DlYm_wHw.png\" alt=\"\" width=\"700\" height=\"504\"><\/figure><div class=\"mf mg rd\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*o_KfyMzF7LITK2DlYm_wHw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*o_KfyMzF7LITK2DlYm_wHw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*o_KfyMzF7LITK2DlYm_wHw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*o_KfyMzF7LITK2DlYm_wHw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*o_KfyMzF7LITK2DlYm_wHw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*o_KfyMzF7LITK2DlYm_wHw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*o_KfyMzF7LITK2DlYm_wHw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*o_KfyMzF7LITK2DlYm_wHw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Oversampling Method<\/strong><\/figcaption>\n<\/figure>\n<ul class=\"\">\n<li id=\"e641\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Undersampling methods: <\/strong>These methods simply delete instances from the over-represented class (majority class) in different ways. The most obvious way is to do delete instances randomly.<\/li>\n<\/ul>\n<figure class=\"os ot ou ov ow lu mf mg paragraph-image\">\n<div class=\"re rf eb rg bg rh\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*gHW_PLz7kWrhdl5t1sJRRA.png\" alt=\"\" width=\"700\" height=\"504\"><\/figure><div class=\"mf mg rd\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*gHW_PLz7kWrhdl5t1sJRRA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*gHW_PLz7kWrhdl5t1sJRRA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*gHW_PLz7kWrhdl5t1sJRRA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*gHW_PLz7kWrhdl5t1sJRRA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*gHW_PLz7kWrhdl5t1sJRRA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*gHW_PLz7kWrhdl5t1sJRRA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*gHW_PLz7kWrhdl5t1sJRRA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*gHW_PLz7kWrhdl5t1sJRRA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\">Undersampling Method<\/figcaption>\n<\/figure>\n<h2 id=\"646c\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Disadvantages<\/h2>\n<p id=\"d8d5\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Notwithstanding the advantage of balancing classes, these techniques also have some drawbacks:<\/p>\n<ul class=\"\">\n<li id=\"6d6e\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">If you duplicate random records from the minority class to do oversampling, this will cause overfitting.<\/li>\n<li id=\"bf3c\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">By undersampling and removing random records from the majority class, you risk losing some important information for the machine learning algorithm to use while training and predicting.<\/li>\n<\/ul>\n<p id=\"1ba8\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">We\u2019ll now show the underlying techniques in each method, along with some code snippets.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"6f8b\" class=\"np nq fo be nr ns qs nu nv nw qt ny nz oa qu oc od oe qv og oh oi qw ok ol om bj\" data-selectable-paragraph=\"\">Undersampling<\/h1>\n<h2 id=\"1c32\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Random Undersampling<\/h2>\n<p id=\"8929\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Random undersampling randomly deletes records from the majority class. You should consider trying this technique when you have a lot of data.<\/p>\n<p id=\"e6a3\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">A simple undersampling technique is to undersample the majority class randomly and uniformly. This can potentially lead to information loss, though. But if the examples of the majority class are near to others in terms of distance, this method might yield good results.<\/p>\n<p id=\"e454\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a code snippet:<\/p>\n<pre># import the Random Under Sampler object.\nfrom imblearn.under_sampling import RandomUnderSampler\n\n# create the object.\nunder_sampler = RandomUnderSampler()\n\n# fit the object to the training data.\nx_train_under, y_train_under = under_sampler.fit_sample(x_train, y_train)<\/pre>\n<p id=\"4ddf\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here\u2019s the set of parameters you can specify to the <code class=\"cw rn ro rp oy b\">RandomUnderSampler<\/code> object (the same thing apply for the other objects from the imblearn library):<\/p>\n<ul class=\"\">\n<li id=\"456c\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><code class=\"cw rn ro rp oy b\">sampling_strategy<\/code> : This parameter can be used to tell the object how to perform undersampling on our dataset. It can be <code class=\"cw rn ro rp oy b\">majority<\/code> to resample only the majority class, <code class=\"cw rn ro rp oy b\">not_minority<\/code> to resample all classes but the minority class, and <code class=\"cw rn ro rp oy b\">auto<\/code> is the default one here, which stands for <code class=\"cw rn ro rp oy b\">not_minority<\/code>. You can check out the documentation (included below in \u201cResources\u201d) to learn more.<\/li>\n<li id=\"452d\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><code class=\"cw rn ro rp oy b\">return_indices<\/code> : Boolean on whether to return the indices of the removed instances or not.<\/li>\n<li id=\"2b2c\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><code class=\"cw rn ro rp oy b\">random_state<\/code> : An integer that controls the randomness of the procedure, allowing you to reproduce the results.<\/li>\n<\/ul>\n<h2 id=\"7662\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">NearMiss Undersampling<\/h2>\n<p id=\"770d\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">The NearMiss algorithm has been proposed to solve the issue of potential information loss. It\u2019s based on the nearest neighbor algorithm and has a lot of variations that we\u2019ll see in this section.<\/p>\n<p id=\"0afa\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">The basics of the NearMiss algorithms include the following:<\/p>\n<ol class=\"\">\n<li id=\"7100\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\">The method starts by calculating the distances between all instances of the majority class and the instances of the minority class.<\/li>\n<li id=\"1b5a\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\">Then <strong class=\"be nh\">k<\/strong> instances of the majority class that have the <strong class=\"be nh\">smallest <\/strong>distances to those in the minority class are selected to be retained.<\/li>\n<li id=\"6e45\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\">If there are <strong class=\"be nh\">n<\/strong> instances in the minority class, NearMiss will result in <strong class=\"be nh\">k \u00d7 n<\/strong>instances of the majority class.<\/li>\n<\/ol>\n<figure class=\"os ot ou ov ow lu mf mg paragraph-image\">\n<div class=\"re rf eb rg bg rh\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*8WM0gsh_naPEa9HTpE2c1A.png\" alt=\"\" width=\"700\" height=\"305\"><\/figure><div class=\"mf mg rr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*8WM0gsh_naPEa9HTpE2c1A.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*8WM0gsh_naPEa9HTpE2c1A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*8WM0gsh_naPEa9HTpE2c1A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*8WM0gsh_naPEa9HTpE2c1A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*8WM0gsh_naPEa9HTpE2c1A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*8WM0gsh_naPEa9HTpE2c1A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*8WM0gsh_naPEa9HTpE2c1A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*8WM0gsh_naPEa9HTpE2c1A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\">NearMiss Algorithm<\/figcaption>\n<\/figure>\n<p id=\"30f1\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here are the different versions of this algorithm:<\/p>\n<ul class=\"\">\n<li id=\"8431\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">NearMiss-1<\/strong> chooses instances of the majority class where their average distances to the three closest instances of the minority class are the <strong class=\"be nh\">smallest<\/strong>.<\/li>\n<li id=\"a2a5\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">NearMiss-2<\/strong> uses the three farthest samples of the minority class.<\/li>\n<li id=\"5b31\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">NearMiss-3 <\/strong>picks a given number of the closest samples of the majority class for each sample of the minority class.<\/li>\n<\/ul>\n<p id=\"ff0f\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a code snippet:<\/p>\n<pre># import the NearMiss object.\nfrom imblearn.under_sampling import NearMiss\n\n# create the object with auto\nnear = NearMiss(sampling_strategy=\"not minority\")\n\n# fit the object to the training data.\nx_train_near, y_train_near = near.fit_sample(x_train, y_train)<\/pre>\n<p id=\"4023\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">You can tune also the following parameters:<\/p>\n<ul class=\"\">\n<li id=\"6e6a\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><code class=\"cw rn ro rp oy b\">version<\/code> : the version of the near-miss algorithm, which can be 3,1, or 2.<\/li>\n<li id=\"5451\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><code class=\"cw rn ro rp oy b\">n_neighbors<\/code> : the number of neighbors to consider to compute the average distance\u2014three is the default.<\/li>\n<\/ul>\n<h2 id=\"2344\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Undersampling with Tomek links<\/h2>\n<p id=\"3e18\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Tomek links are pairs of very close instances that belong to different classes. They\u2019re samples near the borderline between classes. By removing the examples of the <strong class=\"be nh\">majority<\/strong> class of each pair, we increase the space between the two classes and move toward balancing the dataset by deleting those points.<\/p>\n<figure class=\"os ot ou ov ow lu mf mg paragraph-image\">\n<div class=\"re rf eb rg bg rh\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*pR35KsLpz7-_zvbvdm0frg.png\" alt=\"\" width=\"700\" height=\"306\"><\/figure><div class=\"mf mg rs\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*pR35KsLpz7-_zvbvdm0frg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*pR35KsLpz7-_zvbvdm0frg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*pR35KsLpz7-_zvbvdm0frg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*pR35KsLpz7-_zvbvdm0frg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*pR35KsLpz7-_zvbvdm0frg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*pR35KsLpz7-_zvbvdm0frg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*pR35KsLpz7-_zvbvdm0frg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*pR35KsLpz7-_zvbvdm0frg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\">TomekLinks Algorithm<\/figcaption>\n<\/figure>\n<p id=\"edca\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a code snippet to resample the majority class:<\/p>\n<pre># import the TomekLinks object.\nfrom imblearn.under_sampling import TomekLinks\n\n# instantiate the object with the right ratio strategy.\ntomek_links = TomekLinks(sampling_strategy='majority')\n\n# fit the object to the training data.\nx_train_tl, y_train_tl = tomek_links.fit_sample(x_train, y_train)<\/pre>\n<h2 id=\"7e4f\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Undersampling with Cluster Centroids<\/h2>\n<p id=\"a1be\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">The idea here is basically to remove the unimportant instance from the majority class. To decide whether an instance is important or not, we use the concept of <strong class=\"be nh\">clustering<\/strong> on the geometry of the feature space.<\/p>\n<p id=\"97e4\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Clustering is an unsupervised learning approach, in which clusters are creating encircling data points belonging.<\/p>\n<p id=\"b2ad\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">We will use it only to find cluster <strong class=\"be nh\">centroid <\/strong>that are obtained by averaging feature vector for all the features over the data points.<\/p>\n<p id=\"a8b9\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">After finding the cluster centroid of the majority class, we decide the following:<\/p>\n<ul class=\"\">\n<li id=\"0c18\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">The instance belonging to the cluster (majority class), which is <strong class=\"be nh\">farthest from the cluster centroid in the feature space, <\/strong>is considered to be the most unimportant instance.<\/li>\n<li id=\"83e3\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">The instance belonging to the majority class, which is <strong class=\"be nh\">nearest to the cluster centroid in the feature space, <\/strong>is considered to be the most important instance.<\/li>\n<\/ul>\n<figure class=\"os ot ou ov ow lu mf mg paragraph-image\">\n<div class=\"re rf eb rg bg rh\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*1XlHmnc9hKn1oPz48lrn7Q.png\" alt=\"\" width=\"700\" height=\"257\"><\/figure><div class=\"mf mg rt\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*1XlHmnc9hKn1oPz48lrn7Q.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*1XlHmnc9hKn1oPz48lrn7Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*1XlHmnc9hKn1oPz48lrn7Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*1XlHmnc9hKn1oPz48lrn7Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*1XlHmnc9hKn1oPz48lrn7Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*1XlHmnc9hKn1oPz48lrn7Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*1XlHmnc9hKn1oPz48lrn7Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*1XlHmnc9hKn1oPz48lrn7Q.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\">Cluster Centroids Algorithm<\/figcaption>\n<\/figure>\n<p id=\"3542\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a code snippet for using cluster centroids:<\/p>\n<pre># import the ClusterCentroids object.\nfrom imblearn.under_sampling import ClusterCentroids\n\n# instantiate the object with the right ratio.\ncluster_centroids = ClusterCentroids(sampling_strategy=\"auto\")\n\n# fit the object to the training data.\nx_train_cc, y_train_cc = cluster_centroids.fit_sample(x_train, y_train)<\/pre>\n<p id=\"1a88\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Besides the previous parameter, here\u2019s another one you can tune to get better results:<\/p>\n<ul class=\"\">\n<li id=\"eb37\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><code class=\"cw rn ro rp oy b\">estimator<\/code>: An object that performs the clustering process for this method\u2014K-Means is the default here.<\/li>\n<\/ul>\n<h2 id=\"444a\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Undersampling with Edited Nearest Neighbor Rule<\/h2>\n<p id=\"40a9\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Edited Nearest Neighbor Rule (or ENN) was proposed in 1972 to remove instances from the majority class (undersampling).<\/p>\n<p id=\"4505\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">This technique removes any instance from the majority class whose class label is different from the class label of at least two of its three nearest neighbors. In general terms, it\u2019s near or around the borderline of different classes.<\/p>\n<p id=\"e693\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">The point here is to increase the classification accuracy of minority instances rather than majority instances.<\/p>\n<p id=\"09fd\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a sample code snippet:<\/p>\n<pre># import the EditedNearestNeighbours object.\nfrom imblearn.under_sampling import EditedNearestNeighbours\n\n# create the object to resample the majority class.\nenn = EditedNearestNeighbours(sampling_strategy=\"majority\")\n\n# fit the object to the training data.\nx_train_enn, y_train_enn = enn.fit_sample(x_train, y_train)<\/pre>\n<h2 id=\"8b1f\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Undersampling with Neighborhood Cleaning Rule<\/h2>\n<p id=\"7540\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Neighborhood Cleaning Rule ( or NCR) deals with the majority and minority samples separately when sampling the datasets.<\/p>\n<p id=\"f9d9\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">NCR starts by calculating the nearest three neighbors for all instances in the training set. We then do the following:<\/p>\n<ul class=\"\">\n<li id=\"005b\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">If the instance belongs to the <strong class=\"be nh\">majority <\/strong>class and the classification given by its three nearest neighbors is the opposite of the class of the chosen instance \u2014 then the chosen instance is <strong class=\"be nh\">removed<\/strong>.<\/li>\n<li id=\"d973\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">If the instance belongs to the <strong class=\"be nh\">minority <\/strong>class and it\u2019s misclassified by its three nearest neighbors \u2014 then the <strong class=\"be nh\">nearest neighbors<\/strong> that belong to the majority class are <strong class=\"be nh\">removed<\/strong>.<\/li>\n<\/ul>\n<p id=\"eaae\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a sample code snippet:<\/p>\n<pre># import the NeighbourhoodCleaningRule object.\nfrom imblearn.under_sampling import NeighbourhoodCleaningRule\n\n# create the object to resample the majority class.\nncr = NeighbourhoodCleaningRule(sampling_strategy=\"majority\")\n\n# fit the object to the training data.\nx_train_ncr, y_train_ncr = ncr.fit_sample(x_train, y_train)<\/pre>\n<p id=\"eb9f\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">An important parameter here is <code class=\"cw rn ro rp oy b\">threshold_clearning<\/code>, which is a float number used after applying ENN, it tells the algorithm to consider a class or not during the cleaning.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"0a01\" class=\"np nq fo be nr ns qs nu nv nw qt ny nz oa qu oc od oe qv og oh oi qw ok ol om bj\" data-selectable-paragraph=\"\">Oversampling<\/h1>\n<h2 id=\"0639\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">Random Oversampling<\/h2>\n<p id=\"b5d6\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Random oversampling randomly duplicate records from the minority class. Try this technique when you don\u2019t have a lot of data.<\/p>\n<p id=\"8481\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Random oversampling simply replicates random minority class examples. It\u2019s known to increase the likelihood of overfitting, which is a major drawback.<\/p>\n<p id=\"79ce\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a sample code snippet:<\/p>\n<pre># import the Random Over Sampler object.\nfrom imblearn.over_sampling import RandomOverSampler\n\n# create the object.\nover_sampler = RandomOverSampler()\n\n# fit the object to the training data.\nx_train_over, y_train_over = over_sampler.fit_sample(x_train, y_train)<\/pre>\n<h2 id=\"121d\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">SMOTE Oversampling<\/h2>\n<p id=\"25a4\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">SMOTE stands for <strong class=\"be nh\">S<\/strong>ynthetic <strong class=\"be nh\">M<\/strong>inority <strong class=\"be nh\">O<\/strong>versampling <strong class=\"be nh\">Te<\/strong>chnique \u2014 it consists of creating or synthesizing elements or samples from the minority class rather than creating copies based on those that exist already. This is used to avoid model overfitting.<\/p>\n<p id=\"34a2\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">To create a synthetic instance, SMOTE finds the K-nearest neighbors of each minority instance, randomly selects one of them and then calculates linear interpolations to produce a new minority instance in the neighborhood. It can be also explained by changing this instance features one at a time by a random amount \u2014 so as a result, the new points are added between the neighbors.<\/p>\n<figure class=\"os ot ou ov ow lu mf mg paragraph-image\">\n<div class=\"re rf eb rg bg rh\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ma mb c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*bSOwLuDleEEGiuw7PtooOQ.png\" alt=\"\" width=\"700\" height=\"311\"><\/figure><div class=\"mf mg ru\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*bSOwLuDleEEGiuw7PtooOQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*bSOwLuDleEEGiuw7PtooOQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*bSOwLuDleEEGiuw7PtooOQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*bSOwLuDleEEGiuw7PtooOQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*bSOwLuDleEEGiuw7PtooOQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*bSOwLuDleEEGiuw7PtooOQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*bSOwLuDleEEGiuw7PtooOQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*bSOwLuDleEEGiuw7PtooOQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mc md me mf mg mh mi be b bf z dv\" data-selectable-paragraph=\"\">SMOTE Algorithm<\/figcaption>\n<\/figure>\n<p id=\"de55\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a code snippet showing how to resample the minority class:<\/p>\n<pre># import the SMOTETomek\nfrom imblearn.over_sampling import SMOTE\n\n# create the  object with the desired sampling strategy.\nsmote = SMOTE(sampling_strategy='minority')\n\n# fit the object to our training data\nx_train_smote, y_train_smote = smote.fit_sample(x_train, y_train)<\/pre>\n<h2 id=\"c406\" class=\"pc nq fo be nr ph pi pj nv pk pl pm nz mu pn po pp my pq pr ps nc pt pu pv pw bj\" data-selectable-paragraph=\"\">ADASYN Oversampling<\/h2>\n<p id=\"0a5d\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">ADASYN stands for<strong class=\"be nh\"> Ada<\/strong>ptive<strong class=\"be nh\"> Syn<\/strong>thetic sampling, and as SMOTE does, ADASYN generates samples of the minority class. But here, because of their <strong class=\"be nh\">density distributions<\/strong>, this technique receives wide attention.<\/p>\n<p id=\"a1d5\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Its purpose is to generate data for minority class samples that are harder to learn, as compared to those minority samples that are easier to learn.<\/p>\n<p id=\"51bc\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">It measures the K-nearest neighbors for all minority instances, then calculates the class ratio of the minority and majority instances to create new samples.<\/p>\n<p id=\"a535\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Repeating this process, we will <strong class=\"be nh\">adaptively <\/strong>shift the decision boundary to focus on those samples that are hard to learn.<\/p>\n<p id=\"899c\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">Here is a code snippet:<\/p>\n<pre># import the ADASYN object.\nfrom imblearn.over_sampling import ADASYN\n\n# create the object to resample the majority class.\nadasyn = ADASYN(sampling_strategy=\"minority\")\n\n# fit the object to the training data.\nx_train_adasyn, y_train_adasyn = adasyn.fit_sample(x_train, y_train)<\/pre>\n<h1 id=\"52e9\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">Combining Oversampling and Undersampling<\/h1>\n<p id=\"97a5\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">We can combine oversampling and undersampling techniques to get better sampling results. Here are two ways that <code class=\"cw rn ro rp oy b\">imblearn<\/code> provides:<\/p>\n<p id=\"a553\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">SMOTE &amp; Tomek Links<\/strong> \u2014 Here\u2019s a code snippet:<\/p>\n<pre># import the SMOTETomek.\nfrom imblearn.combine import SMOTETomek\n\n# create the  object with the desired sampling strategy.\nsmotemek = SMOTETomek(sampling_strategy='auto')\n\n# fit the object to our training data.\nx_train_smt, y_train_smt = smotemek.fit_sample(x_train, y_train)<\/pre>\n<p id=\"1908\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">SMOTE &amp; Edited Nearest Neighbor<\/strong> \u2014 Here\u2019s a code snippet:<\/p>\n<pre># import the SMOTEENN.\nfrom imblearn.combine import SMOTEENN\n\n# create the  object with the desired samplig strategy.\nsmoenn = SMOTEENN(sampling_strategy='minority')\n\n# fit the object to our training data.\nx_train_smtenn, y_train_smtenn = smoenn.fit_sample(x_train, y_train)<\/pre>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"4d2d\" class=\"np nq fo be nr ns qs nu nv nw qt ny nz oa qu oc od oe qv og oh oi qw ok ol om bj\" data-selectable-paragraph=\"\">Other Techniques<\/h1>\n<p id=\"8639\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">Besides the resampling methodologies we\u2019ve covered in this article, there are other intuitive and advanced techniques you can employ to deal with this problem. Here are some of them:<\/p>\n<ol class=\"\">\n<li id=\"df04\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Collect more data:<\/strong> You can always collect more data from other sources to build a more robust model.<\/li>\n<li id=\"5afa\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Changing the performance metric<\/strong>: We\u2019ve seen that accuracy is misleading \u2014 it\u2019s not the metric to use when dealing with imbalanced datasets. Some metrics have been designed for such a case, including: Confusion Matrix, Precision &amp; Recall, F1 Score, ROC Curves.<\/li>\n<li id=\"b19e\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Use different algorithms<\/strong>: Some algorithms are better than others when dealing with imbalanced datasets. Generally, in machine learning, we test a variety of different types of algorithms on a given problem to see which ones provide better performance.<\/li>\n<li id=\"d7c7\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng rq rb rc bj\" data-selectable-paragraph=\"\"><strong class=\"be nh\">Use penalized models<\/strong>: Some algorithms allow you to give them a different perspective on the problem. For instance, with some algorithms, we can add costs to force the model to pay attention to the minority class. There are penalized versions of algorithms such as penalized-SVM and logistic regression, even when using deep learning models throughout the <code class=\"cw rn ro rp oy b\">class_weight<\/code> attribute.<\/li>\n<\/ol>\n<h1 id=\"f5a6\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">Resources<\/h1>\n<p id=\"8bca\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">There are more resources out there to handle your imbalanced dataset. Here are a few to help you get started:<\/p>\n<ul class=\"\">\n<li id=\"294a\" class=\"mk ml fo be b mm mn mo mp mq mr ms mt mu qx mw mx my qy na nb nc qz ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\"><a class=\"af mj\" href=\"https:\/\/androidkt.com\/set-class-weight-for-imbalance-dataset-in-keras\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">How to set class weights for the imbalanced dataset in Keras<\/a><\/li>\n<li id=\"39f5\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">The <code class=\"cw rn ro rp oy b\"><a class=\"af mj\" href=\"http:\/\/contrib.scikit-learn.org\/imbalanced-learn\/stable\/index.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">imbalanced-learn<\/a><\/code> documentation.<\/li>\n<li id=\"0fa3\" class=\"mk ml fo be b mm ri mo mp mq rj ms mt mu rk mw mx my rl na nb nc rm ne nf ng ra rb rc bj\" data-selectable-paragraph=\"\">Another undersampling method called <a class=\"af mj\" href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.CondensedNearestNeighbour.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Condensed Nearest Neighbour<\/a>.<\/li>\n<\/ul>\n<h1 id=\"c01f\" class=\"np nq fo be nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"19c7\" class=\"pw-post-body-paragraph mk ml fo be b mm on mo mp mq oo ms mt mu op mw mx my oq na nb nc or ne nf ng fh bj\" data-selectable-paragraph=\"\">In this article, we confronted the problem of imbalanced datasets by exploring several different resampling techniques, which allow you to change your dataset\u2019s balance so your model can learn more effectively.<\/p>\n<p id=\"9ef4\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">You\u2019ll need to experiment with these techniques on your specific machine learning problem to see what best fits your case \u2014 we don\u2019t have a technique that always allows us to build a better model while achieving the best performance. Otherwise, we wouldn\u2019t be able to say that we\u2019re machine learning engineers.<\/p>\n<p id=\"a5ec\" class=\"pw-post-body-paragraph mk ml fo be b mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng fh bj\" data-selectable-paragraph=\"\">You can combine these methods to obtain more reliable models, but I suggest starting small and building upon what you learn.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Martin Sanchez on Unsplash Often in machine learning, and specifically with classification problems, we encounter imbalanced datasets. This typically refers to an issue where the classes are not represented equally, which can cause huge problems for some algorithms. In this article, we\u2019ll explore a technique called resampling, which is used to reduce this [&hellip;]<\/p>\n","protected":false},"author":66,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[195],"class_list":["post-7617","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Resampling to Properly Handle Imbalanced Datasets in Machine Learning - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Resampling to Properly Handle Imbalanced Datasets in Machine Learning\" \/>\n<meta property=\"og:description\" content=\"Photo by Martin Sanchez on Unsplash Often in machine learning, and specifically with classification problems, we encounter imbalanced datasets. This typically refers to an issue where the classes are not represented equally, which can cause huge problems for some algorithms. In this article, we\u2019ll explore a technique called resampling, which is used to reduce this [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-22T20:13:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:13:52+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg\" \/>\n<meta name=\"author\" content=\"Younes Charfaoui\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Younes Charfaoui\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Resampling to Properly Handle Imbalanced Datasets in Machine Learning - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/","og_locale":"en_US","og_type":"article","og_title":"Resampling to Properly Handle Imbalanced Datasets in Machine Learning","og_description":"Photo by Martin Sanchez on Unsplash Often in machine learning, and specifically with classification problems, we encounter imbalanced datasets. This typically refers to an issue where the classes are not represented equally, which can cause huge problems for some algorithms. In this article, we\u2019ll explore a technique called resampling, which is used to reduce this [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-09-22T20:13:38+00:00","article_modified_time":"2025-04-24T17:13:52+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg","type":"","width":"","height":""}],"author":"Younes Charfaoui","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Younes Charfaoui","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/"},"author":{"name":"Younes Charfaoui","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/25b5b9a90bbf0683645e45d9a945c906"},"headline":"Resampling to Properly Handle Imbalanced Datasets in Machine Learning","datePublished":"2023-09-22T20:13:38+00:00","dateModified":"2025-04-24T17:13:52+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/"},"wordCount":2302,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/","url":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/","name":"Resampling to Properly Handle Imbalanced Datasets in Machine Learning - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg","datePublished":"2023-09-22T20:13:38+00:00","dateModified":"2025-04-24T17:13:52+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*ME0iNAfWGHek4qaPkpJUEw.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/resampling-to-properly-handle-imbalanced-datasets-in-machine-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Resampling to Properly Handle Imbalanced Datasets in Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/25b5b9a90bbf0683645e45d9a945c906","name":"Younes Charfaoui","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/804fee6b856de8b829678c2484ad6096","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/09\/x4ZF3ZJD_400x400-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/09\/x4ZF3ZJD_400x400-96x96.jpg","caption":"Younes Charfaoui"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/younescharfaouigmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7617","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/66"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7617"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7617\/revisions"}],"predecessor-version":[{"id":15530,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7617\/revisions\/15530"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7617"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7617"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}