{"id":7878,"date":"2023-10-06T15:38:45","date_gmt":"2023-10-06T23:38:45","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7878"},"modified":"2025-04-24T17:05:38","modified_gmt":"2025-04-24T17:05:38","slug":"major-problems-of-machine-learning-datasets-part-3","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/","title":{"rendered":"Major Problems of Machine Learning Datasets: Part 3"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\">\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg\" alt=\"\" width=\"700\" height=\"468\"><\/figure><div class=\"mf mg mh\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Photo by <a class=\"af mz\" href=\"https:\/\/unsplash.com\/@punttim?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Tim Gouw<\/a> on <a class=\"af mz\" href=\"https:\/\/unsplash.com\/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<p id=\"0424\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Note: This is the part 3 of the series <strong class=\"be nv\">Major Problems of Machine Learning Datasets<\/strong>. You can read <a class=\"af mz\" href=\"https:\/\/heartbeat.comet.ml\/major-problems-of-machine-learning-datasets-part-1-5d5a06221c90\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nv\">part 1 here<\/strong><\/a> and <a class=\"af mz\" href=\"https:\/\/heartbeat.comet.ml\/major-problems-of-machine-learning-datasets-part-2-ba82e551fee2\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nv\">part 2 here<\/strong><\/a>.<\/p>\n<h1 id=\"458c\" class=\"nw nx fp be ny nz oa gp ob oc od gs oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Imbalanced data<\/h1>\n<p id=\"0d30\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">Imbalanced data occurs when there is an uneven distribution of classes or labels. For example, in a credit card detection task, the number of non-fraudulent transactions will likely be much greater than the number of fraudulent credit card transactions.<\/p>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:300\/1*k-XSoDCCqlhFi-iu36tqKQ.png\" alt=\"\" width=\"300\" height=\"300\"><\/figure><div class=\"mf mg ox\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:600\/format:webp\/1*k-XSoDCCqlhFi-iu36tqKQ.png 600w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 300px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*k-XSoDCCqlhFi-iu36tqKQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*k-XSoDCCqlhFi-iu36tqKQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*k-XSoDCCqlhFi-iu36tqKQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*k-XSoDCCqlhFi-iu36tqKQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*k-XSoDCCqlhFi-iu36tqKQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*k-XSoDCCqlhFi-iu36tqKQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:600\/1*k-XSoDCCqlhFi-iu36tqKQ.png 600w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 300px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"e6bb\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Particularly with classification tasks, class balance is extremely important. When classes are imbalanced, the majority class will influence the output of the model more, making our classifier biased towards the majority class.<\/p>\n<p id=\"e88a\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Models trained with imbalanced data usually have high precision and recall scores for the majority class, whereas these scores will likely drop significantly for the minority class.<\/p>\n<p id=\"58bd\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Let\u2019s take a look at this example of a <a class=\"af mz\" href=\"https:\/\/www.kaggle.com\/datasets\/mlg-ulb\/creditcardfraud\" target=\"_blank\" rel=\"noopener ugc nofollow\">credit card fraud dataset<\/a>, and build a model on this imbalanced data. Our goal is to correctly identify fraudulent transactions:<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"4f40\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">import pandas as pd\nimport seaborn as sns\nimport matplotlib.pyplot as plt<\/span><span id=\"8603\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">df = pd.read_csv(\"<a class=\"af mz\" href=\"https:\/\/www.kaggle.com\/datasets\/mlg-ulb\/creditcardfraud\" target=\"_blank\" rel=\"noopener ugc nofollow\">creditcard.csv<\/a>\")<\/span><span id=\"f9d6\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\"># checking class distribution of column `Class`\nsns.countplot(data= df, x = \"Class\")<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:407\/1*K82lx6CD9dZ8m1wNcCLijg.png\" alt=\"\" width=\"407\" height=\"262\"><\/figure><div class=\"mf mg pi\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:814\/format:webp\/1*K82lx6CD9dZ8m1wNcCLijg.png 814w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 407px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*K82lx6CD9dZ8m1wNcCLijg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*K82lx6CD9dZ8m1wNcCLijg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*K82lx6CD9dZ8m1wNcCLijg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*K82lx6CD9dZ8m1wNcCLijg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*K82lx6CD9dZ8m1wNcCLijg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*K82lx6CD9dZ8m1wNcCLijg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:814\/1*K82lx6CD9dZ8m1wNcCLijg.png 814w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 407px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"5518\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">We have a clear class imbalance here. Let\u2019s see how much this will affect model performance:<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"e35a\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">#  Splitting Data Into Training and Test Sets\nfrom sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X,\n                                                    y,\n                                                    test_size=0.2,\n                                                    random_state=42)\nprint(X_train.shape, vX_test.shape)\n------------\n(227845, 30) (56962, 30)<\/span><\/pre>\n<p id=\"9f69\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Now that the data is divided into training and test sets, let\u2019s build a baseline logistic regression model.<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"4bfa\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">from sklearn.linear_model import LogisticRegression\nlr = LogisticRegression(solver='liblinear')\nlr.fit(X_train, y_train)<\/span><\/pre>\n<p id=\"8e21\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Now we evaluate the performance of the model with a classification report and confusion matrix:<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"5a75\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">from sklearn.metrics import (plot_confusion_matrix,\n                             classification_report)\ndef gen_report(model):\n  preds = model.predict(X_test)\n  print(classification_report(preds, y_test))\n  plot_confusion_matrix(model, X_test, y_test)<\/span><span id=\"ed02\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">print(\"---LOGISTIC REGRESSION MODEL---\")\ngen_report(model=lr)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:549\/1*HoQTLbNM09ubUrcDSexNQg.png\" alt=\"\" width=\"549\" height=\"227\"><\/figure><div class=\"mf mg pj\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1098\/format:webp\/1*HoQTLbNM09ubUrcDSexNQg.png 1098w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 549px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*HoQTLbNM09ubUrcDSexNQg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*HoQTLbNM09ubUrcDSexNQg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*HoQTLbNM09ubUrcDSexNQg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*HoQTLbNM09ubUrcDSexNQg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*HoQTLbNM09ubUrcDSexNQg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*HoQTLbNM09ubUrcDSexNQg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1098\/1*HoQTLbNM09ubUrcDSexNQg.png 1098w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 549px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:325\/1*dAuO3eyAqVc7z1xESI2uYw.png\" alt=\"\" width=\"325\" height=\"262\"><\/figure><div class=\"mf mg pk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:650\/format:webp\/1*dAuO3eyAqVc7z1xESI2uYw.png 650w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 325px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*dAuO3eyAqVc7z1xESI2uYw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*dAuO3eyAqVc7z1xESI2uYw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*dAuO3eyAqVc7z1xESI2uYw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*dAuO3eyAqVc7z1xESI2uYw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*dAuO3eyAqVc7z1xESI2uYw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*dAuO3eyAqVc7z1xESI2uYw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:650\/1*dAuO3eyAqVc7z1xESI2uYw.png 650w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 325px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"8e3a\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">A quick look at these metrics, and it would seem our model performs amazingly \u2014 100% accuracy, 100% precision, and 100% recall! But if we dig a little deeper, we\u2019ll see that our precision drops to 53% on fraudulent data (the minority class) and recall drops to 83%.<\/p>\n<h2 id=\"161a\" class=\"pd nx fp be ny pl pm pn ob po pp pq oe ni pr ps pt nm pu pv pw nq px py pz qa bj\" data-selectable-paragraph=\"\">How to deal with imbalanced data<\/h2>\n<p id=\"a0eb\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">The first thing we can do is change our sampling method. Instead of random sampling, we can use <strong class=\"be nv\">stratified sampling<\/strong>, which makes sure train and test data will have an almost equal ratio of fraudulent transactions. The downside to this method, however, is that with such a small proportion of total minority class cases in the total data, we will still have very few instances of fraudulent cases present for the model to learn from.<\/p>\n<p id=\"7edb\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Instead, we might try a resampling method that attempts to tackle the underlying problem of not enough representation of the minority class. <strong class=\"be nv\">Resampling<\/strong> includes both oversampling and under-sampling methods.<\/p>\n<p id=\"8e0a\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">Oversampling<\/strong> is a resampling technique using which generates more instances of the underrepresented class by randomly sampling from the existing instances.<\/p>\n<p id=\"e240\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">Under-sampling<\/strong> is a resampling technique in which the majority class is reduced to the size of the minority class by randomly sampling from the majority class.<\/p>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*9E8o5Zo290yKRg2uwZZ_Ig.png\" alt=\"\" width=\"700\" height=\"300\"><\/figure><div class=\"mf mg qb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*9E8o5Zo290yKRg2uwZZ_Ig.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"3f3c\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Scikit-learn\u2019s <code class=\"cw qc qd qe oz b\">imblearn<\/code> is an open-source, MIT-licensed Python library that provides tools to deal with imbalanced classes. It has a class named <code class=\"cw qc qd qe oz b\">SMOTETomek<\/code> that combines the concepts of under-sampling and oversampling techniques, theoretically providing a good compromise between the pros and cons of each (though in practice this is not always the case; for your dataset, you may need to experiment with each sampling method to see which performs best for your dataset).<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"e6aa\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">from imblearn.combine import SMOTETomek\nsmt = SMOTETomek()<\/span><span id=\"3cd4\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">X_res, y_res= smt.fit_resample(X, y)<\/span><\/pre>\n<p id=\"51e0\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Let\u2019s again create a model on our resampled data and see if performance improves:<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"1d90\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X_res,\n                                                    y_res,\n                                                    test_size=0.2,\n                                                    stratify=y,\n                                                    random_state=42\n                                                    )\nfrom sklearn.linear_model import LogisticRegression\nre_lrm = LogisticRegression(solver='liblinear')\nre_lrm.fit(X_train, y_train)\n<\/span><span id=\"2d7f\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">print(\"---LOGISTIC REGRESSION MODEL (RESAMPLED)---\")\ngen_report(model= re_lrm)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:559\/1*HholJlJQ4UxUwRHLdrTZRQ.png\" alt=\"\" width=\"559\" height=\"216\"><\/figure><div class=\"mf mg qf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1118\/format:webp\/1*HholJlJQ4UxUwRHLdrTZRQ.png 1118w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 559px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*HholJlJQ4UxUwRHLdrTZRQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*HholJlJQ4UxUwRHLdrTZRQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*HholJlJQ4UxUwRHLdrTZRQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*HholJlJQ4UxUwRHLdrTZRQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*HholJlJQ4UxUwRHLdrTZRQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*HholJlJQ4UxUwRHLdrTZRQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1118\/1*HholJlJQ4UxUwRHLdrTZRQ.png 1118w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 559px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:334\/1*gusjMAnyEqGnuSj3Eti6yQ.png\" alt=\"\" width=\"334\" height=\"262\"><\/figure><div class=\"mf mg qg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:668\/format:webp\/1*gusjMAnyEqGnuSj3Eti6yQ.png 668w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 334px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*gusjMAnyEqGnuSj3Eti6yQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*gusjMAnyEqGnuSj3Eti6yQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*gusjMAnyEqGnuSj3Eti6yQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*gusjMAnyEqGnuSj3Eti6yQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*gusjMAnyEqGnuSj3Eti6yQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*gusjMAnyEqGnuSj3Eti6yQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:668\/1*gusjMAnyEqGnuSj3Eti6yQ.png 668w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 334px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"94c3\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">As you can see, we\u2019ve had a vast improvement in model performance!<\/p>\n<h1 id=\"152c\" class=\"nw nx fp be ny nz oa gp ob oc od gs oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">High-dimensional data<\/h1>\n<p id=\"c3d7\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">In machine learning, the number (or, degree) of features in a dataset is referred to as its dimensionality. Machine learning problems with high dimensionality have a variety of issues.<\/p>\n<p id=\"4d98\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">As the number of features in a dataset increases, the time needed to train the model will also increase. But that\u2019s not all. High-dimensional datasets are extremely difficult (if not impossible) to visualize (imagine a 6-dimensional plot&#8230; neither can I!). The harder it is to visualize a dataset, the harder it can be to explain, which contributes to the \u201cblack box\u201d problem in machine learning. Furthermore, the amount of data needed to train a model typically grows exponentially with the number of features (or, dimensions) in a dataset. This is often referred to as the Curse of Dimensionality and can lead to a slew of statistical phenomena that do not occur in low-dimensional settings. To save our model from this problem, we can perform dimensionality reduction.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<blockquote class=\"qp\"><p id=\"cec7\" class=\"qq qr fp be qs qt qu qv qw qx qy nu dw\" data-selectable-paragraph=\"\">Want to see the evolution of AI-generated art projects? <a class=\"af mz\" href=\"https:\/\/www.comet.com\/team-comet-ml\/clipdraw\/view\/Y4aT3gy6IrPQKBi5wncFXCYLR?utm_campaign=clipdraw-gradio&amp;utm_source=blog&amp;utm_medium=summary\" target=\"_blank\" rel=\"noopener ugc nofollow\">Visit our public project<\/a> to see time-lapses, experiment evolutions, and more!<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<h2 id=\"ea05\" class=\"pd nx fp be ny pl pm pn ob po pp pq oe ni pr ps pt nm pu pv pw nq px py pz qa bj\" data-selectable-paragraph=\"\">How to deal with high dimensionality<\/h2>\n<p id=\"9ad0\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">Dimensionality reduction attempts to reduce the number of features in a dataset, while still preserving as much variation as possible from the original data. Dimensionality reduction can reduce the chances of overfitting, takes care of multicollinearity, removes noise from data, and can also be useful to transform non-linear data into linear data.<\/p>\n<p id=\"0fab\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are two main techniques to perform dimensionality reduction: we can either remove the least important feature, or we can try to combine the original features into newer, fewer features. One of the most popular methods of linear dimensionality reduction is PCA (Principle Component Analysis).<\/p>\n<p id=\"8b95\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">PCA transforms a set of correlated features into a smaller number of uncorrelated features, called principal components. PCA makes use of the correlation between features for reducing dimensions. Note that it is important to perform feature scaling before applying PCA because it is very sensitive to relative ranges of features.<\/p>\n<p id=\"2396\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Sci-kit learn provides a built-in tool for this process with <code class=\"cw qc qd qe oz b\">sklearn.decomposition.PCA<\/code>.<\/p>\n<p id=\"fdca\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Let\u2019s first scale the data using Standard Scaler:<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"98b2\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">from sklearn.preprocessing import StandardScaler<\/span><span id=\"b283\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">scaler = StandardScaler()\nscaler.fit(X)\nX_scaled = scaler.transform(X)<\/span><\/pre>\n<p id=\"9f40\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">And now let\u2019s apply PCA:<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"bf39\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">from sklearn.decomposition import PCA<\/span><span id=\"96ee\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">pca_10 = PCA(n_components = 10, random_state = 42)\nX_pca_10 = pca_10.fit_transform(X_scaled)<\/span><span id=\"66b6\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\"># comparing before and after PCA shape\nprint(X.shape, X_pca_10.shape)\n------------\n((284807, 30), (284807, 10))<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*auxmPRrummFB__vTOoEWZw.png\" alt=\"\" width=\"700\" height=\"168\"><\/figure><div class=\"mf mg qb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*auxmPRrummFB__vTOoEWZw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*auxmPRrummFB__vTOoEWZw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*auxmPRrummFB__vTOoEWZw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*auxmPRrummFB__vTOoEWZw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*auxmPRrummFB__vTOoEWZw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*auxmPRrummFB__vTOoEWZw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*auxmPRrummFB__vTOoEWZw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*auxmPRrummFB__vTOoEWZw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Principal Components<\/figcaption>\n<\/figure>\n<h1 id=\"6929\" class=\"nw nx fp be ny nz oa gp ob oc od gs oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Non-normal distribution of data<\/h1>\n<p id=\"fadf\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">The normal distribution is a type of distribution of data in which data points are distributed in a <strong class=\"be nv\">symmetrical <\/strong>manner around the mean of data. It looks like a bell shape curve.<\/p>\n<p id=\"ccc3\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">It is important for most machine learning algorithms that features should follow a normal distribution. non-normal distribution of data affects model performance and generates wrong predictions normality is an important assumption for many machine learning models.<\/p>\n<p id=\"277d\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">In this blog, we will discuss the two most common types of non-normal distribution Left skewed and right-skewed distribution.<\/p>\n<p id=\"903f\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The left-skewed distribution has a long tail with negatively-skewed data, whereas the right skewed has a long tail on the right side of the distribution.<\/p>\n<h2 id=\"e041\" class=\"pd nx fp be ny pl pm pn ob po pp pq oe ni pr ps pt nm pu pv pw nq px py pz qa bj\" data-selectable-paragraph=\"\">How to Deal With Non-Normal Distribution of Data<\/h2>\n<p id=\"c152\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">There are many transformation methods that are used to convert non-normal distribution into a normal distribution.<\/p>\n<ol class=\"\">\n<li id=\"41a9\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni qz nk nl nm ra no np nq rb ns nt nu rc rd re bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">Log Normal Transformation: <\/strong>In this technique, we take the log of values of a particular feature.<\/li>\n<\/ol>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"e8c8\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">import numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy import stats<\/span><span id=\"93d6\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\"># sample data generation\nnp.random.seed(42)\ndata = sorted(stats.lognorm.rvs(s=0.5,\n                                loc=1,\n                                scale=1000,\n                                size=1000))<\/span><span id=\"72cb\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\"># fit lognormal distribution\nshape, loc, scale = stats.lognorm.fit(data, loc=0)\npdf_lognorm = stats.lognorm.pdf(data, shape, loc, scale)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:516\/1*yW8_phz1fS5EbanJjnP-xw.png\" alt=\"\" width=\"516\" height=\"268\"><\/figure><div class=\"mf mg rf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1032\/format:webp\/1*yW8_phz1fS5EbanJjnP-xw.png 1032w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 516px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*yW8_phz1fS5EbanJjnP-xw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*yW8_phz1fS5EbanJjnP-xw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*yW8_phz1fS5EbanJjnP-xw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*yW8_phz1fS5EbanJjnP-xw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*yW8_phz1fS5EbanJjnP-xw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*yW8_phz1fS5EbanJjnP-xw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1032\/1*yW8_phz1fS5EbanJjnP-xw.png 1032w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 516px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"cec1\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">2. Box-Cox Transformation<\/strong>: Box-Cox transformation is a part of the power transformers family. It makes use of the exponent lambda (\u03bb), which ranges from -5 to 5, to convert non-normal distribution of dependent variables into a normal distribution.<\/p>\n<pre class=\"mi mj mk ml mm oy oz pa pb ax pc bj\"><span id=\"5b8e\" class=\"pd nx fp oz b ia pe pf l iq pg\" data-selectable-paragraph=\"\">import numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy import stats<\/span><span id=\"6d31\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\"># sample data generation\nnp.random.seed(42)\ndata = sorted(stats.lognorm.rvs(s=0.7, loc=3, scale=1000, size=1000))<\/span><span id=\"0eee\" class=\"pd nx fp oz b ia ph pf l iq pg\" data-selectable-paragraph=\"\">pdf_boxcox = stats.boxcox(data)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:516\/1*uvAWwV-aqRHDUEo_hkVo4Q.png\" alt=\"\" width=\"516\" height=\"264\"><\/figure><div class=\"mf mg rf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1032\/format:webp\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 1032w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 516px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1032\/1*uvAWwV-aqRHDUEo_hkVo4Q.png 1032w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 516px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Tim Gouw on Unsplash Note: This is the part 3 of the series Major Problems of Machine Learning Datasets. You can read part 1 here and part 2 here. Imbalanced data Imbalanced data occurs when there is an uneven distribution of classes or labels. For example, in a credit card detection task, the [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[140],"class_list":["post-7878","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Major Problems of Machine Learning Datasets: Part 3 - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Major Problems of Machine Learning Datasets: Part 3\" \/>\n<meta property=\"og:description\" content=\"Photo by Tim Gouw on Unsplash Note: This is the part 3 of the series Major Problems of Machine Learning Datasets. You can read part 1 here and part 2 here. Imbalanced data Imbalanced data occurs when there is an uneven distribution of classes or labels. For example, in a credit card detection task, the [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-10-06T23:38:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:05:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg\" \/>\n<meta name=\"author\" content=\"Abhay Parashar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abhay Parashar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Major Problems of Machine Learning Datasets: Part 3 - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/","og_locale":"en_US","og_type":"article","og_title":"Major Problems of Machine Learning Datasets: Part 3","og_description":"Photo by Tim Gouw on Unsplash Note: This is the part 3 of the series Major Problems of Machine Learning Datasets. You can read part 1 here and part 2 here. Imbalanced data Imbalanced data occurs when there is an uneven distribution of classes or labels. For example, in a credit card detection task, the [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-10-06T23:38:45+00:00","article_modified_time":"2025-04-24T17:05:38+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg","type":"","width":"","height":""}],"author":"Abhay Parashar","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abhay Parashar","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"Major Problems of Machine Learning Datasets: Part 3","datePublished":"2023-10-06T23:38:45+00:00","dateModified":"2025-04-24T17:05:38+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/"},"wordCount":1064,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/","url":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/","name":"Major Problems of Machine Learning Datasets: Part 3 - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg","datePublished":"2023-10-06T23:38:45+00:00","dateModified":"2025-04-24T17:05:38+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*2zagnBfllq2sIoAhcKeK3A.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-3\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Major Problems of Machine Learning Datasets: Part 3"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7878","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7878"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7878\/revisions"}],"predecessor-version":[{"id":15502,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7878\/revisions\/15502"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7878"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7878"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7878"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7878"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}