{"id":7874,"date":"2023-10-06T15:34:59","date_gmt":"2023-10-06T23:34:59","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7874"},"modified":"2025-04-24T17:05:41","modified_gmt":"2025-04-24T17:05:41","slug":"major-problems-of-machine-learning-datasets-part-1","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/","title":{"rendered":"Major Problems of Machine Learning Datasets: Part 1"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\">\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU\" alt=\"\" width=\"700\" height=\"467\"><\/figure><div class=\"mf mg mh\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Photo by <a class=\"af mz\" href=\"https:\/\/unsplash.com\/@jeshoots?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">JESHOOTS.COM<\/a> on <a class=\"af mz\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<p id=\"2ce7\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Data play a key role in machine learning, and the better and more relevant data you have, the more accurate the model you will build. Getting the perfect data, however, is still a dream for many data scientists. A lot of data comes from web scraping, APIs and other external sources, and most real-world datasets will just look like an ugly stack of information, at least at first. However, data will speak for itself, if you keep it organized.<\/p>\n<p id=\"8835\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">In this blog, I would love to share some major problems that occur with many supervised machine learning datasets, as well as how to deal with them.<\/p>\n<h1 id=\"5176\" class=\"nv nw fp be nx ny nz gp oa ob oc gs od oe of og oh oi oj ok ol om on oo op oq bj\" data-selectable-paragraph=\"\">Missing Values<\/h1>\n<h2 id=\"5097\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\">How to Deal with Missing Values In Datasets?<\/h2>\n<p id=\"3234\" class=\"pw-post-body-paragraph na nb fp be b gn pi nd ne gq pj ng nh ni pk nk nl nm pl no np nq pm ns nt nu fi bj\" data-selectable-paragraph=\"\">There are various ways of dealing with missing values, and you\u2019ll likely need to determine which method is right for your task at hand on a case-by-case basis. If only a very small percentage of your data is missing, then you might be able to simply drop all missing values. In some cases of extremely large amounts of missing data, it may even be better to consider finding a new dataset, additional datasets, increasing your domain knowledge, or reframing your problem. When about 10\u201350% of your data are missing, however, you may also consider imputation.<\/p>\n<p id=\"930a\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are two types of imputation: numerical and categorical.<\/p>\n<h2 id=\"b46f\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\">Imputing Missing Numerical Data<\/h2>\n<ol class=\"\">\n<li id=\"57c0\" class=\"na nb fp be b gn pi nd ne gq pj ng nh ni pn nk nl nm po no np nq pp ns nt nu pq pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Mean, median, or mode imputation<\/strong>: If the distribution of your data appears normal, you might consider calculating the mean of a particular feature to replace missing values. However, if the distribution of data appears left- or right-skewed, then you might be better off with median imputation. The following implementation assumes that your missing values are represented by <code class=\"cw pu pv pw px b\">NaN<\/code>s:<\/li>\n<\/ol>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"e282\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\"># import numpy as np<\/span><span id=\"5e20\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Normally-distributed data (fill with mean)\ndf['col'] = df['col'].fillna(df['col'].mean()) <\/span><span id=\"992d\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Skewed data (fill with median)\ndf['col'] = df['col'].fillna(df['col'].median())<\/span><\/pre>\n<p id=\"27f4\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">2.<strong class=\"be pt\"> Treat <\/strong><code class=\"cw pu pv pw px b\"><strong class=\"be pt\">NaN<\/strong><\/code><strong class=\"be pt\"> as new category<\/strong>: This technique is useful when there is some relationship between missing values and non-missing values. In this method, we add a new column to our DataFrame with all missing values as <code class=\"cw pu pv pw px b\">1<\/code> and non-missing values as <code class=\"cw pu pv pw px b\">0<\/code>. Note that this method creates a category feature out of whether or not a missing value exists in a particular numerical feater, but does not replace, or otherwise handle, the original missing values. Later we should still replace missing values from the original column.<\/p>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"9dba\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">import numpy as np<\/span><span id=\"0f43\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\">df['col_nan']= np.where(df['col'].isnull(),1,0) <\/span><span id=\"1f9e\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\">df['col'] = df['col'].fillna(df['col'].mean()) <\/span><\/pre>\n<p id=\"5cc5\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">3. <strong class=\"be pt\">KNN Imputer<\/strong>: KNN Imputer is a distance-based imputation method that utilizes the <code class=\"cw pu pv pw px b\">k<\/code>-Nearest Neighbors algorithm to replace missing values in the dataset with the mean value of <code class=\"cw pu pv pw px b\">k<\/code> nearest neighbors found in training data. The value of <code class=\"cw pu pv pw px b\">k<\/code> is determined using the parameter <code class=\"cw pu pv pw px b\">n_neighbors<\/code>. By default, KNN uses the Euclidean distance metric to find the nearest neighbors. One thing to remember is that we need to normalize the data before passing it to KNN Imputer, otherwise replacements will be biased towards the larger range of features.<\/p>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"11de\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\"># Loading DataFrame\nimport pandas as pd\ndf = pd.read_csv('titanic.csv')<\/span><span id=\"62a6\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Extracting all Columns With Numerical Data\nnum = [col for col in df.columns if df[col].dtypes != 'O']<\/span><span id=\"f1be\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Scaling Data Using MinMaxScaler\nfrom sklearn.preprocessing import MinMaxScaler\nscaler = MinMaxScaler()\nnorm_df = pd.DataFrame(scaler.fit_transform(df[num]), columns = num)<\/span><span id=\"47f2\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Initialize Imputer\nfrom sklearn.impute import KNNImputer\nknn = KNNImputer(n_neighbors=5)<\/span><span id=\"c4fa\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Fit Imputer and Transform Data\nknn.fit(norm_df)<\/span><span id=\"444e\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Transform Data and Save It In a New Instance of DataFrame\nnon_nan_df=pd.DataFrame(knn.transform(norm_df), columns=num)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*PeHDBtThIfTcZtIPztXJSA.png\" alt=\"\" width=\"700\" height=\"221\"><\/figure><div class=\"mf mg qg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*PeHDBtThIfTcZtIPztXJSA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*PeHDBtThIfTcZtIPztXJSA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*PeHDBtThIfTcZtIPztXJSA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*PeHDBtThIfTcZtIPztXJSA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*PeHDBtThIfTcZtIPztXJSA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*PeHDBtThIfTcZtIPztXJSA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*PeHDBtThIfTcZtIPztXJSA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*PeHDBtThIfTcZtIPztXJSA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Image by author<\/figcaption>\n<\/figure>\n<p id=\"a4cb\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">This method can also be used for categorical data, but for that, we first need to convert the data to numerical form using label- or one-hot-encoding.<\/p>\n<p id=\"2251\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">4.<strong class=\"be pt\"> Filling categories proportionally using their probability weights<\/strong>: In this technique, we replace missing categories with other categories based on their contribution to creating the whole feature. This method doesn\u2019t change the category proportion, which means In a feature if category A contributes 90%, B 7%, and C 3% then after filling missing values proportion of A, B and C will stay almost the same.<\/p>\n<pre>def fill_proportionally(col, dataset):\n    import random\n    random.seed(0)\n\n    # getting all unique values (without nan)\n    values = dataset[col].dropna().unique()\n\n    # getting weights for probability weighting\n    weights = dataset[col].value_counts().values \/ dataset[col].value_counts().values.sum()\n    print('Before Imputation Probablity Weights\\n',weights)\n    # filling\n    dataset[col] = dataset[col].apply(lambda x: random.choices(values, weights=weights)[0] if pd.isnull(x) else x)\n\nimport pandas as pd\ndf = pd.read_csv('https:\/\/raw.githubusercontent.com\/Abhayparashar31\/datasets\/master\/titanic.csv')\n\n### Imputing Missing Categories\nfill_proportionally('Embarked', df)<\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:324\/1*ki1HqTCTJjHVG1QTH0kqyQ.png\" alt=\"\" width=\"324\" height=\"110\"><\/figure><div class=\"mf mg qk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:648\/format:webp\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 648w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 324px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:648\/1*ki1HqTCTJjHVG1QTH0kqyQ.png 648w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 324px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"021b\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are other methods as well that make use of machine learning models to predict missing values. You can check them by visiting my <a class=\"af mz\" href=\"https:\/\/www.kaggle.com\/code\/abhayparashar31\/feature-engineering-handling-missing-values\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pt\"><em class=\"ql\">Kaggle notebook<\/em><\/strong><\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<blockquote class=\"qu\"><p id=\"a438\" class=\"qv qw fp be qx qy qz ra rb rc rd nu dw\" data-selectable-paragraph=\"\">Innovation and academia go hand-in-hand. <a class=\"af mz\" href=\"https:\/\/www.youtube.com\/watch?v=7XCsi64HLQ8\" target=\"_blank\" rel=\"noopener ugc nofollow\">Listen to our own CEO Gideon Mendels chat with the Stanford MLSys Seminar Series<\/a> team about the future of MLOps and give the Comet platform a try for free!<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<h1 id=\"8318\" class=\"nv nw fp be nx ny re gp oa ob rf gs od oe rg og oh oi rh ok ol om ri oo op oq bj\" data-selectable-paragraph=\"\">Categorical Data<\/h1>\n<p id=\"f08f\" class=\"pw-post-body-paragraph na nb fp be b gn pi nd ne gq pj ng nh ni pk nk nl nm pl no np nq pm ns nt nu fi bj\" data-selectable-paragraph=\"\">Many machine learning algorithms work only with numerical data including regression models. For this reason, it is important to convert categorical data to a numerical form before feeding them to a machine learning model. Categorical data refer to data with labels as values (e.g., sex, city, position, etc.).<\/p>\n<h2 id=\"cc62\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\">How to deal with categorical data in datasets?<\/h2>\n<p id=\"af12\" class=\"pw-post-body-paragraph na nb fp be b gn pi nd ne gq pj ng nh ni pk nk nl nm pl no np nq pm ns nt nu fi bj\" data-selectable-paragraph=\"\">One option when utilizing categorical data is to choose a tree-based model, when appropriate. If the situation calls for another type of model, however,<br>\nthen another option is to convert the categorical values into numerical form.<\/p>\n<p id=\"12e4\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Imputing Categorical Data (with no inherent order)<\/strong><\/p>\n<p id=\"8e0f\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Inherent order means there is a relationship between the order of our categories. In categories with no inherent order, we can use nominal encoding methods. One of the most common methods of nominal encoding is One-Hot Encoding<\/p>\n<ul class=\"\">\n<li id=\"e7cf\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">One-Hot Encoding<\/strong>: In One-Hot encoding, we create new columns to represent each individual class label, and ascribe <code class=\"cw pu pv pw px b\">0<\/code> or <code class=\"cw pu pv pw px b\">1<\/code> values to each feature, depending on whether or not they belong to a particular class. This approach can be used for both single- and multi-class categorical values. One major downfall of this method, however, is that increases the dimension of the data.<\/li>\n<\/ul>\n<pre>import pandas as pd\nimport numpy as np\n\n### Sample Data Creation\ncountries = ['india','uk','usa','canada']\ncol = np.random.choice(countries,100)\ndf = pd.DataFrame(col,columns=['Country'])\n\n### One Hot Encoding\ndummies = pd.get_dummies(df['Country'])\npd.concat([df,dummies],axis=1).head()<\/pre>\n<figure class=\"mi mj mk ml mm mn\"><\/figure>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:362\/1*lzLJepFA5-wIfPfMOYEDrA.png\" alt=\"\" width=\"362\" height=\"240\"><\/figure><div class=\"mf mg rn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:724\/format:webp\/1*lzLJepFA5-wIfPfMOYEDrA.png 724w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 362px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*lzLJepFA5-wIfPfMOYEDrA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*lzLJepFA5-wIfPfMOYEDrA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*lzLJepFA5-wIfPfMOYEDrA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*lzLJepFA5-wIfPfMOYEDrA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*lzLJepFA5-wIfPfMOYEDrA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*lzLJepFA5-wIfPfMOYEDrA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:724\/1*lzLJepFA5-wIfPfMOYEDrA.png 724w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 362px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Country Column Before and After One Hot Encoding \u2014 Screenshot Taken By Author<\/figcaption>\n<\/figure>\n<p id=\"d8e5\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Imputing Categorical Data (with inherent order)<\/strong><\/p>\n<p id=\"7e50\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">If there is some relation between the order of categories, we refer to the categories as ordinal. For these situations, we can use methods like label encoding, and target-guided encoding.<\/p>\n<ul class=\"\">\n<li id=\"067f\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Label Encoding<\/strong>: Label encoding refers to the process of converting labels into a range of numbers (not just <code class=\"cw pu pv pw px b\">0<\/code> and <code class=\"cw pu pv pw px b\">1<\/code>), thereby preserving some of their ordinality. The major downfall of label-encoding, however, is that machines don\u2019t just infer the order of the categories, but also the values of the numbers themselves, sometimes giving higher weight or importance to features assigned to larger category numbers.<\/li>\n<\/ul>\n<pre>import pandas as pd\nimport numpy as np\n\n### Sample Data Creation\ncountries = ['Low','Medium','High']\ncol = np.random.choice(countries,100)\ndf = pd.DataFrame(col,columns=['Level'])\n\n### Label Encoding (Sklearn)\nfrom sklearn.preprocessing import LabelEncoder\nle = LabelEncoder()\nle.fit_transform(df['Level'])\n\n### Label Encoding (Manual Method)\ndf['Level'].map({\n    'High':2,\n    'Medium':1,\n    'Low':0\n})<\/pre>\n<figure class=\"mi mj mk ml mm mn\"><\/figure>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:258\/1*Qgoylw_od7nN6zA6piZM8Q.png\" alt=\"\" width=\"258\" height=\"245\"><\/figure><div class=\"mf mg ro\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:516\/format:webp\/1*Qgoylw_od7nN6zA6piZM8Q.png 516w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 258px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Qgoylw_od7nN6zA6piZM8Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Qgoylw_od7nN6zA6piZM8Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Qgoylw_od7nN6zA6piZM8Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Qgoylw_od7nN6zA6piZM8Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Qgoylw_od7nN6zA6piZM8Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Qgoylw_od7nN6zA6piZM8Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:516\/1*Qgoylw_od7nN6zA6piZM8Q.png 516w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 258px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Level Column Before and After Label Encoding \u2014 Screenshot Taken By Author<\/figcaption>\n<\/figure>\n<ul class=\"\">\n<li id=\"6151\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Count Encoding<\/strong>: Count encoding is a simple method in which we replace all the categories with their count:<\/li>\n<\/ul>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"07f8\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">Cat_Count = df['col'].value_counts().to_dict()\ndf['col'].map(Cat_Count)<\/span><\/pre>\n<h1 id=\"bafc\" class=\"nv nw fp be nx ny nz gp oa ob oc gs od oe of og oh oi oj ok ol om on oo op oq bj\" data-selectable-paragraph=\"\">Different ranges<\/h1>\n<p id=\"62ae\" class=\"pw-post-body-paragraph na nb fp be b gn pi nd ne gq pj ng nh ni pk nk nl nm pl no np nq pm ns nt nu fi bj\" data-selectable-paragraph=\"\">Datasets often include multiple features, each with a unique range of values. For example, a dataset containing information about salaries of different employees based on age and years of experience will have a much different range in the <code class=\"cw pu pv pw px b\">salary<\/code> values than it will in the <code class=\"cw pu pv pw px b\">age<\/code> values. Due to this difference, the features with higher value ranges may influence the output more.<\/p>\n<p id=\"9af7\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">One way to overcome this quirk is to use tree-based models like Random Forest. However, if your problem requires the use of regularized linear models or neural networks, then your should scale your feature ranges (e.g., 0 to 1).<\/p>\n<h2 id=\"8656\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Feature scaling techniques:<\/strong><\/h2>\n<ol class=\"\">\n<li id=\"3563\" class=\"na nb fp be b gn pi nd ne gq pj ng nh ni pn nk nl nm po no np nq pp ns nt nu pq pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Min Max Scaler: <\/strong>By default, this method scales all data between <strong class=\"be pt\">0 <\/strong>and<strong class=\"be pt\"> 1. <\/strong>However, you can also use a min-max scaler to scale values within a custom range.<\/li>\n<\/ol>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*8VqQpECGZ43yhgKFeb8Q_g.png\" alt=\"\" width=\"700\" height=\"113\"><\/figure><div class=\"mf mg rp\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*8VqQpECGZ43yhgKFeb8Q_g.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*8VqQpECGZ43yhgKFeb8Q_g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*8VqQpECGZ43yhgKFeb8Q_g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*8VqQpECGZ43yhgKFeb8Q_g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*8VqQpECGZ43yhgKFeb8Q_g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*8VqQpECGZ43yhgKFeb8Q_g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*8VqQpECGZ43yhgKFeb8Q_g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*8VqQpECGZ43yhgKFeb8Q_g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"ed2c\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">from sklearn.preprocessing import MinMaxScaler<\/span><span id=\"d2a4\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Defining Scaler\nscaler = MinMaxScaler()<\/span><span id=\"aa7a\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Scaling Columns Values\ncolumn_names = ['salary_col', 'age_col']\nfeatures = df[column_names]\nfeatures[column_names] = scaler.fit_transform(features.values)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:142\/0*cXaSCzv9fWR1gEpy.png\" alt=\"\" width=\"142\" height=\"214\"><\/figure><div class=\"mf mg rq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*cXaSCzv9fWR1gEpy.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*cXaSCzv9fWR1gEpy.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*cXaSCzv9fWR1gEpy.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*cXaSCzv9fWR1gEpy.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*cXaSCzv9fWR1gEpy.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*cXaSCzv9fWR1gEpy.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:284\/format:webp\/0*cXaSCzv9fWR1gEpy.png 284w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 142px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*cXaSCzv9fWR1gEpy.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*cXaSCzv9fWR1gEpy.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*cXaSCzv9fWR1gEpy.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*cXaSCzv9fWR1gEpy.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*cXaSCzv9fWR1gEpy.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*cXaSCzv9fWR1gEpy.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:284\/0*cXaSCzv9fWR1gEpy.png 284w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 142px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"12f1\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">You can change the scaling range by specifying <code class=\"cw pu pv pw px b\">feature_range = (lower, upper)<\/code>.<\/p>\n<p id=\"ed56\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">2. <strong class=\"be pt\">Standard Scaler<\/strong>: This method assumes that values of the column are normally distributed. It scales values in a way that the mean of all the values is 0 and the standard deviation is 1.<\/p>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"bd79\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">from sklearn.preprocessing import StandardScaler<\/span><span id=\"48f6\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Defining Scaler\nscaler = StandardScaler()<\/span><span id=\"1bfd\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\">col_names = ['salary', 'age']\nfeatures = df[col_names]<\/span><span id=\"d021\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\"># Scaling Values\nfeatures[col_names] = scaler.fit_transform(features.values)\nfeatures<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:176\/0*r3CR_YqgPt5KA96u.png\" alt=\"\" width=\"176\" height=\"188\"><\/figure><div class=\"mf mg rr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*r3CR_YqgPt5KA96u.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*r3CR_YqgPt5KA96u.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*r3CR_YqgPt5KA96u.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*r3CR_YqgPt5KA96u.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*r3CR_YqgPt5KA96u.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*r3CR_YqgPt5KA96u.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:352\/format:webp\/0*r3CR_YqgPt5KA96u.png 352w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 176px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*r3CR_YqgPt5KA96u.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*r3CR_YqgPt5KA96u.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*r3CR_YqgPt5KA96u.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*r3CR_YqgPt5KA96u.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*r3CR_YqgPt5KA96u.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*r3CR_YqgPt5KA96u.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:352\/0*r3CR_YqgPt5KA96u.png 352w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 176px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"452c\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are other scalers as well that are somewhat less famous, but still useful, including: <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.RobustScaler.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Robust Scaler<\/a>, <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MaxAbsScaler.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">MaxAbsScaler<\/a>, <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.QuantileTransformer.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Quantile Transformer Scaler<\/a>, and many more. You can learn about all of them by reading my previous article about <a class=\"af mz\" href=\"https:\/\/pub.towardsai.net\/feature-transformation-and-scaling-techniques-f9645cb538e\" target=\"_blank\" rel=\"noopener ugc nofollow\">feature scaling and transformation<\/a>.<\/p>\n<h1 id=\"f369\" class=\"nv nw fp be nx ny nz gp oa ob oc gs od oe of og oh oi oj ok ol om on oo op oq bj\" data-selectable-paragraph=\"\">Too little training data<\/h1>\n<p id=\"f2a5\" class=\"pw-post-body-paragraph na nb fp be b gn pi nd ne gq pj ng nh ni pk nk nl nm pl no np nq pm ns nt nu fi bj\" data-selectable-paragraph=\"\">Too little training data is a major problem for computer vision datasets, a popular section of deep learning. These models require incredibly large amounts of labeled training data, which tends to be expensive to produce, and limited in availability. The simple and straightforward approach would be to collect more data, but in reality, this is not possible every time. Instead, we\u2019ll often use data augmentation.<\/p>\n<h2 id=\"90e6\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\">Image data augmentation<\/h2>\n<p id=\"b195\" class=\"pw-post-body-paragraph na nb fp be b gn pi nd ne gq pj ng nh ni pk nk nl nm pl no np nq pm ns nt nu fi bj\" data-selectable-paragraph=\"\">Image data augmentation is a technique in which we apply certain transformations to existing image data in order to generate multiple, unique copies for training purposes. Some transformations include rotating, cropping, padding, scaling, flipping, changing brightness, adding noise, and many more.<\/p>\n<p id=\"d7ff\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">We can perform image augmentation manually using python, pillow, and OpenCV library. One automated way of doing this uses the deep learning library <a class=\"af mz\" href=\"https:\/\/keras.io\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Keras<\/a>. In the Keras image class, is the <a class=\"af mz\" href=\"https:\/\/keras.io\/ja\/preprocessing\/image\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">ImageDataGenerator<\/a> method that provides different options to perform position and color augmentation.<\/p>\n<pre>from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img\n\ndatagen = ImageDataGenerator(\n        rotation_range=40,\n        width_shift_range=0.2,\n        height_shift_range=0.2,\n        brightness_range= [0.5, 1.5],\n        rescale=1.\/255,\n        shear_range=0.2,\n        zoom_range=0.4,\n        horizontal_flip=True,\n        fill_mode='nearest',\n        zca_epsilon=True)\n\npath = '\/content\/drive\/MyDrive\/cat.jpg' ## Image Path\nimg = load_img(f\"{path}\")\nx = img_to_array(img)\nx = x.reshape((1,) + x.shape)\ni = 0\n\n### Create 25 Augmentated Images and Save Them In `aug_img` directory\nfor batch in datagen.flow(x, batch_size=1,\n                      save_to_dir=\"\/content\/drive\/MyDrive\/aug_imgs\", save_prefix='img', save_format='jpeg'):\n    i += 1\n    if i &gt; 25:   ## Total 25 Augmented Images\n        break<\/pre>\n<h1 id=\"7376\" class=\"nv nw fp be nx ny nz gp oa ob oc gs od oe of og oh oi oj ok ol om on oo op oq bj\" data-selectable-paragraph=\"\">Different representations of data<\/h1>\n<h2 id=\"130d\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\">1. Range instead of singular integer value<\/h2>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"c13c\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">import pandas as pd<\/span><span id=\"b477\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\">df = pd.DataFrame(['10-12','15-17','20-23','18-25','26-28'],columns=['Age'])<\/span><\/pre>\n<ul class=\"\">\n<li id=\"e628\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Replace with lower limit:<\/strong><\/li>\n<\/ul>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"02c6\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">df['Age'].apply(lambda x : x.split('-')[0])<\/span><\/pre>\n<ul class=\"\">\n<li id=\"eaf2\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Replace with upper limit:<\/strong><\/li>\n<\/ul>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"4be0\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">df['Age'].apply(lambda x : x.split('-')[1])<\/span><\/pre>\n<ul class=\"\">\n<li id=\"46b0\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Mean of range<\/strong>:<\/li>\n<\/ul>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"ae95\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">import numpy as np\nnp.mean(lower_limit,upper_limit)<\/span><\/pre>\n<ul class=\"\">\n<li id=\"f914\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni rj nk nl nm rk no np nq rl ns nt nu rm pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"be pt\">Random value imputation from range<\/strong><\/li>\n<\/ul>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"a563\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">import random\nrandom.randint(lower_limit,upper_limit)<\/span><\/pre>\n<h2 id=\"3953\" class=\"or nw fp be nx os ot ou oa ov ow ox od ni oy oz pa nm pb pc pd nq pe pf pg ph bj\" data-selectable-paragraph=\"\">2. Conversion using <code class=\"cw pu pv pw px b\">numerizer<\/code><\/h2>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"9472\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">import pandas as pd<\/span><span id=\"dacf\" class=\"or nw fp px b ia qf qd l iq qe\" data-selectable-paragraph=\"\">df = pd.DataFrame(['Twenty Two','fifteen','Twenty','Twenty Six','Thirty'],columns=['Age'])<\/span><\/pre>\n<p id=\"6e8e\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">In the case of text or categorical representation of numerical data, we can use the <code class=\"cw pu pv pw px b\">numerizer<\/code> library to quickly and simply transform our data:<\/p>\n<pre class=\"mi mj mk ml mm py px pz qa ax qb bj\"><span id=\"1b75\" class=\"or nw fp px b ia qc qd l iq qe\" data-selectable-paragraph=\"\">from numerizer import numerize\ndf['Age'].apply(lambda x: numerize(x))<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:211\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png\" alt=\"\" width=\"211\" height=\"246\"><\/figure><div class=\"mf mg rs\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:422\/format:webp\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 422w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 211px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:422\/1*2ZQcPjGO4eRxGUCqsRJ2EA.png 422w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 211px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">String and numerical conversion using numerizer<\/figcaption>\n<\/figure>\n<p id=\"ddb2\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Read more in <a class=\"af mz\" href=\"https:\/\/heartbeat.comet.ml\/major-problems-of-machine-learning-datasets-part-2-ba82e551fee2\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pt\">Major Problems of Machine Learning Datasets: Part 2<\/strong><\/a> and<a class=\"af mz\" href=\"https:\/\/heartbeat.comet.ml\/major-problems-of-machine-learning-datasets-part-3-eae18ab40eda\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pt\">Major Problems of Machine Learning Datasets: Part 3<\/strong><\/a><strong class=\"be pt\">!<\/strong><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by JESHOOTS.COM on Unsplash Data play a key role in machine learning, and the better and more relevant data you have, the more accurate the model you will build. Getting the perfect data, however, is still a dream for many data scientists. A lot of data comes from web scraping, APIs and other external [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[140],"class_list":["post-7874","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Major Problems of Machine Learning Datasets: Part 1 - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Major Problems of Machine Learning Datasets: Part 1\" \/>\n<meta property=\"og:description\" content=\"Photo by JESHOOTS.COM on Unsplash Data play a key role in machine learning, and the better and more relevant data you have, the more accurate the model you will build. Getting the perfect data, however, is still a dream for many data scientists. A lot of data comes from web scraping, APIs and other external [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-10-06T23:34:59+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:05:41+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU\" \/>\n<meta name=\"author\" content=\"Abhay Parashar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abhay Parashar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Major Problems of Machine Learning Datasets: Part 1 - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/","og_locale":"en_US","og_type":"article","og_title":"Major Problems of Machine Learning Datasets: Part 1","og_description":"Photo by JESHOOTS.COM on Unsplash Data play a key role in machine learning, and the better and more relevant data you have, the more accurate the model you will build. Getting the perfect data, however, is still a dream for many data scientists. A lot of data comes from web scraping, APIs and other external [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-10-06T23:34:59+00:00","article_modified_time":"2025-04-24T17:05:41+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU","type":"","width":"","height":""}],"author":"Abhay Parashar","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abhay Parashar","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"Major Problems of Machine Learning Datasets: Part 1","datePublished":"2023-10-06T23:34:59+00:00","dateModified":"2025-04-24T17:05:41+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/"},"wordCount":1395,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/","url":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/","name":"Major Problems of Machine Learning Datasets: Part 1 - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU","datePublished":"2023-10-06T23:34:59+00:00","dateModified":"2025-04-24T17:05:41+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*-sPGj0m_sjFaJBbU"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-1\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Major Problems of Machine Learning Datasets: Part 1"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7874","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7874"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7874\/revisions"}],"predecessor-version":[{"id":15504,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7874\/revisions\/15504"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7874"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}