{"id":7876,"date":"2023-10-06T15:37:09","date_gmt":"2023-10-06T23:37:09","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7876"},"modified":"2025-04-24T17:05:39","modified_gmt":"2025-04-24T17:05:39","slug":"major-problems-of-machine-learning-datasets-part-2","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/","title":{"rendered":"Major Problems of Machine Learning Datasets: Part 2"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\">\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk\" alt=\"\" width=\"700\" height=\"467\"><\/figure><div class=\"mf mg mh\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Photo by <a class=\"af mz\" href=\"https:\/\/unsplash.com\/@elisa_ventur?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Elisa Ventur<\/a> on <a class=\"af mz\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<h1 id=\"91e8\" class=\"nw nx fp be ny nz oa gp ob oc od gs oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Outliers in data<\/h1>\n<p id=\"8117\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">Outliers are unusual data points that differ significantly from other values in the sample of a population. Outliers sometimes represent errors in measurement or data collection, and can have significant effects on descriptive statistics and machine learning model outcomes. There are several ways to detect outliers in our data, and here we will discuss two methods: standard deviation and box plots.<\/p>\n<p>1. <strong class=\"be nv\">Standard deviation<\/strong>: In this method, we choose a minimum and maximum standard deviation threshold, and data points outside this limit are considered outliers.<\/p>\n<p id=\"e1bf\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">2. <strong class=\"be nv\">Box Plots<\/strong>: Box plots are widely used graphical representations that make use of the 25th, 50th (median), and 75th quartiles of a data\u2019s distribution to show a visual representation of outliers.<\/p>\n<h2 id=\"1ca3\" class=\"ox nx fp be ny oy oz pa ob pb pc pd oe ni pe pf pg nm ph pi pj nq pk pl pm pn bj\" data-selectable-paragraph=\"\">How To Deal With Outliers?<\/h2>\n<p id=\"800d\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">Once we\u2019ve detected outliers, we handle them by removing or replacing them with some value.<\/p>\n<p id=\"b35c\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">1. Dropping outliers: <\/strong>This is one of the simplest methods to handle outliers. Here, we remove all data points that fall outside a specified threshold or boundary. Typically, these thresholds are defined as numbers of standard deviations from the mean, but for simplicity, in the example below, we set a lower limit at 5% and an upper limit at 95% of values.<\/p>\n<p id=\"122d\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">Disadvantage<\/strong>: The biggest disadvantage of dropping outliers is the loss of data. Especially when outliers are not due to error, they may contain some very valuable information.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"eefb\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">import matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd<\/span><span id=\"48a7\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">fig, axes = plt.subplots(1, 2)\nplt.tight_layout(0.2) <\/span><span id=\"3146\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df = pd.read_csv('<a class=\"af mz\" href=\"https:\/\/raw.githubusercontent.com\/Abhayparashar31\/datasets\/master\/titanic_with_no_nan.csv\" target=\"_blank\" rel=\"noopener ugc nofollow\">titanic_with_no_nan.csv<\/a>')\nprint(\"Before Shape:\", df.shape) <\/span><span id=\"db24\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">max_val = df.Age.quantile(0.95)\nmin_val = df.Age.quantile(0.05) <\/span><span id=\"308c\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df2 = df[(df['Age'] &gt; min_val) &amp; (df['Age'] &lt; max_val)]\nprint(\"After Shape:\", df2.shape)<\/span><span id=\"34be\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">sns.boxplot(df['Age'], orient='v', ax=axes[0])\naxes[0].title.set_text(\"Before\")\nsns.boxplot(df2['Age'], orient='v', ax=axes[1])\naxes[1].title.set_text(\"After\")\nplt.show()<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:612\/1*Hv9s5NBXWseFWd3yLXZmTg.png\" alt=\"\" width=\"612\" height=\"405\"><\/figure><div class=\"mf mg px\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1224\/format:webp\/1*Hv9s5NBXWseFWd3yLXZmTg.png 1224w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 612px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Hv9s5NBXWseFWd3yLXZmTg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Hv9s5NBXWseFWd3yLXZmTg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Hv9s5NBXWseFWd3yLXZmTg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Hv9s5NBXWseFWd3yLXZmTg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Hv9s5NBXWseFWd3yLXZmTg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Hv9s5NBXWseFWd3yLXZmTg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1224\/1*Hv9s5NBXWseFWd3yLXZmTg.png 1224w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 612px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"93e4\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">2. Replacing outliers with custom percentiles<\/strong>: Using this method, instead of dropping values outside a particular threshold range, we replace them with minimum and maximum threshold values.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"6593\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">import matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd<\/span><span id=\"bea1\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">fig, axes = plt.subplots(1, 2)\nplt.tight_layout(0.2) <\/span><span id=\"fb9e\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df = pd.read_csv('<a class=\"af mz\" href=\"https:\/\/raw.githubusercontent.com\/Abhayparashar31\/datasets\/master\/titanic_with_no_nan.csv\" target=\"_blank\" rel=\"noopener ugc nofollow\">titanic_with_no_nan.csv<\/a>')\nprint(\"Before Shape:\", df.shape) <\/span><span id=\"63fd\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">max_val = df.Age.quantile(0.95)\nmin_val = df.Age.quantile(0.05) <\/span><span id=\"dc6f\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df2 = df[(df['Age'] &gt; min_val) &amp; (df['Age'] &lt; max_val)]\nprint(\"After Shape:\", df2.shape)<\/span><span id=\"b00f\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">sns.boxplot(df['Age'], orient='v', ax=axes[0])\naxes[0].title.set_text(\"Before\")\nsns.boxplot(df2['Age'], orient='v', ax=axes[1])\naxes[1].title.set_text(\"After\")\nplt.show()<\/span><\/pre>\n<p id=\"32c5\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">3. Replacing outliers using IQR<\/strong>: The Interquartile range (IQR) is a measure of statistical dispersion that divides datapoints into units, or quantiles. IQR helps measure the spread of data.<\/p>\n<p id=\"d946\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">IQR primarily utilizes three quantiles: Q1, which represents the 25th quantile, Q2 which represents the 50th quantile, and Q3, which represents the 75th quantile. Additionally, Q1 is also the median of the first half of the data, Q2 is the median of the whole dataset and Q3 is the median of the second half of the data.<\/p>\n<p id=\"b6ac\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The interquartile range (IQR) is calculated by subtracting Q3 from Q1. One method of eliminating outliers is to place thresholds at 1.5 IQRs past the second and third quantiles in each direction.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"4fa9\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">import numpy as np\nimport matplotlib.pyplot as plt\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nfig, axes = plt.subplots(1, 2)\nplt.tight_layout(0.2)<\/span><span id=\"32e3\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df = pd.read_csv('data\/titanic_with_no_nan.csv')\nprint(\"Previous Shape With Outlier: \", df.shape)\nsns.boxplot(df['Age'], orient='v', ax=axes[0])\naxes[0].title.set_text(\"Before\")<\/span><span id=\"91c9\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">Q1 = df.Age.quantile(0.25)\nQ3 = df.Age.quantile(0.75)\nprint(Q1, Q3)<\/span><span id=\"0212\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">IQR = Q3-Q1\nprint(IQR)<\/span><span id=\"09f4\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">lower_limit = Q1 - 1.5*IQR\nupper_limit = Q3 + 1.5*IQR\nprint(lower_limit, upper_limit)<\/span><span id=\"d0bc\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df2 = df.copy()\ndf2['Age'] = np.where(df2['Age']&gt;upper_limit,upper_limit,df2['Age'])\ndf2['Age'] = np.where(df2['Age']&lt;lower_limit,lower_limit,df2['Age'])<\/span><span id=\"45ae\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">print(\"Shape After Removing Outliers:\", df2.shape)\nsns.boxplot(df2['Age'], orient='v', ax=axes[1])\naxes[1].title.set_text(\"After\")\nplt.show()<\/span><\/pre>\n<p id=\"c13e\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are other methods that make use of unsupervised machine learning to detect outliers, like <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.covariance.EllipticEnvelope.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Elliptic Envelope<\/a>, <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.IsolationForest.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Isolation Forest<\/a>, <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/svm\/plot_oneclass.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">One-class SVM<\/a>, and <a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/neighbors\/plot_lof_outlier_detection.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Local Outlier Factor<\/a>. You can read about all of them <a class=\"af mz\" href=\"https:\/\/towardsdatascience.com\/4-machine-learning-techniques-for-outlier-detection-in-python-21e9cfacb81d\" target=\"_blank\" rel=\"noopener\"><strong class=\"be nv\">here<\/strong><\/a>.<\/p>\n<h1 id=\"ed13\" class=\"nw nx fp be ny nz oa gp ob oc od gs oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Feature Selection<\/h1>\n<h2 id=\"aba0\" class=\"ox nx fp be ny oy oz pa ob pb pc pd oe ni pe pf pg nm ph pi pj nq pk pl pm pn bj\" data-selectable-paragraph=\"\">Unwanted features<\/h2>\n<p id=\"aeee\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">Machine learning datasets are often collected with the help of web scraping or APIs, and may likely contain features we are not interested in. For example, a dataset for predicting health outcomes might contain names or other PII we\u2019d rather not include in our project. Or maybe we\u2019re looking to predict income, and have found a dataset that also contains information about individuals\u2019 pets. In situations where your data contains features that are irrelevant, or potentially unethical, it is better to drop all of these columns and not include them in the model-building process. Later in the article we will also discuss how to handle features that may be correlated.<\/p>\n<p id=\"72e6\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Determining which features are most relevant to a particular task is not always so straightforward as in the examples above, however. In these situations, it may be helpful to use a feature selection technique to help us identify features with less importance. These techniques often rank features based on their importance in predicting an outcome. You can then filter all the features using a threshold value for extracting the most important features.<\/p>\n<ol class=\"\">\n<li id=\"047b\" class=\"na nb fp be b gn nc nd ne gq nf ng nh ni py nk nl nm pz no np nq qa ns nt nu qb qc qd bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">Extra Trees model<\/strong>: The Extra Trees estimator fits <code class=\"cw qe qf qg pp b\">n<\/code> number of decision trees to create a more generalized model with less bias. The Extra Trees estimator class has an attribute called <code class=\"cw qe qf qg pp b\">feature_importance<\/code> that makes use of the Gini Index to calculate the importance of features on predicting the outcome.<\/li>\n<\/ol>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"e63a\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">from sklearn.ensemble import ExtraTreesClassifier\nmodel=ExtraTreesClassifier()\nmodel.fit(X,y)\nprint(model.feature_importances_)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:517\/1*Di8BMsKnUfOCCvit7HBZHw.png\" alt=\"\" width=\"517\" height=\"66\"><\/figure><div class=\"mf mg qh\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1034\/format:webp\/1*Di8BMsKnUfOCCvit7HBZHw.png 1034w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 517px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Di8BMsKnUfOCCvit7HBZHw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Di8BMsKnUfOCCvit7HBZHw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Di8BMsKnUfOCCvit7HBZHw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Di8BMsKnUfOCCvit7HBZHw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Di8BMsKnUfOCCvit7HBZHw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Di8BMsKnUfOCCvit7HBZHw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1034\/1*Di8BMsKnUfOCCvit7HBZHw.png 1034w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 517px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"7231\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">We can make this a bit more attractive, by plotting the results above with <code class=\"cw qe qf qg pp b\">matplotlib<\/code>.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"9a79\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">from matplotlib import pyplot as plt\ntop_features=pd.Series(model.feature_importances_, index=X.columns)\ntop_features.nlargest(10).plot(kind='barh')\nplt.show()<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:434\/1*gG9Hcw0O3VZTdQahTuA7wg.png\" alt=\"\" width=\"434\" height=\"246\"><\/figure><div class=\"mf mg qi\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:868\/format:webp\/1*gG9Hcw0O3VZTdQahTuA7wg.png 868w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 434px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*gG9Hcw0O3VZTdQahTuA7wg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*gG9Hcw0O3VZTdQahTuA7wg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*gG9Hcw0O3VZTdQahTuA7wg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*gG9Hcw0O3VZTdQahTuA7wg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*gG9Hcw0O3VZTdQahTuA7wg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*gG9Hcw0O3VZTdQahTuA7wg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:868\/1*gG9Hcw0O3VZTdQahTuA7wg.png 868w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 434px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"1967\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">2.<strong class=\"be nv\"> Mutual Information<\/strong>: In probability and information theory, mutual information is the measurement of mutual dependence between two variables. A measurement of 0 suggests two variables are completely independent of each other. Note this method may only be used on discrete numerical features and targets.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"aade\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">from sklearn.feature_selection import mutual_info_classif\nmutual_info = mutual_info_classif(X,y)\nmutual_data = pd.Series(mutual_info,index=X.columns)\nmutual_data.sort_values(ascending=False)[:10]<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:202\/1*_y49jlIgkboDoCpPnREuRQ.png\" alt=\"\" width=\"202\" height=\"168\"><\/figure><div class=\"mf mg qj\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:404\/format:webp\/1*_y49jlIgkboDoCpPnREuRQ.png 404w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 202px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*_y49jlIgkboDoCpPnREuRQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*_y49jlIgkboDoCpPnREuRQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*_y49jlIgkboDoCpPnREuRQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*_y49jlIgkboDoCpPnREuRQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*_y49jlIgkboDoCpPnREuRQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*_y49jlIgkboDoCpPnREuRQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:404\/1*_y49jlIgkboDoCpPnREuRQ.png 404w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 202px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"5083\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\"><strong class=\"be nv\">3. <\/strong><a class=\"af mz\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.SelectKBest.html\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nv\">SelectKBest<\/strong><\/a>: Scikitlearn provides the <code class=\"cw qe qf qg pp b\">SelectKBest<\/code> class for selecting the best (most important) features from a given dataset using a user-defined score metric (default is ANOVA F-value). It selects <code class=\"cw qe qf qg pp b\">K<\/code> top features according to the highest score. Below, we use this method with the <code class=\"cw qe qf qg pp b\">chi2<\/code> scoring function.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"6c76\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">from sklearn.feature_selection import SelectKBest\nfrom sklearn.feature_selection import chi2<\/span><span id=\"5991\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">ordered_rank_features= SelectKBest(score_func= chi2, k= 10)\nordered_feature= ordered_rank_features.fit(X, y)<\/span><span id=\"e1a6\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">dfscores= pd.DataFrame(ordered_feature.scores_, columns= [\"Score\"])\ndfcolumns= pd.DataFrame(X.columns)<\/span><span id=\"a90e\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">features_rank= pd.concat([dfcolumns, dfscores], axis=1)\nfeatures_rank.columns= ['Features', 'Score']\ntop_10 = features_rank.nlargest(10, 'Score')<\/span><span id=\"b353\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">plt.figure(figsize= (15, 8))\nplt.bar(data= top_10,x= 'Features', height= 'Score')<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:203\/1*--x98tQKxbOME4bk48Cekg.png\" alt=\"\" width=\"203\" height=\"297\"><\/figure><div class=\"mf mg ql\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:406\/format:webp\/1*--x98tQKxbOME4bk48Cekg.png 406w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 203px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*--x98tQKxbOME4bk48Cekg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*--x98tQKxbOME4bk48Cekg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*--x98tQKxbOME4bk48Cekg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*--x98tQKxbOME4bk48Cekg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*--x98tQKxbOME4bk48Cekg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*--x98tQKxbOME4bk48Cekg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:406\/1*--x98tQKxbOME4bk48Cekg.png 406w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 203px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"04d5\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Some other methods that provide options for feature selection include <a class=\"af mz\" href=\"https:\/\/towardsdatascience.com\/feature-selection-in-machine-learning-using-lasso-regression-7809c7c2771a\" target=\"_blank\" rel=\"noopener\">Lasso Regression<\/a>, <a class=\"af mz\" href=\"https:\/\/towardsdatascience.com\/anova-for-feature-selection-in-machine-learning-d9305e228476\" target=\"_blank\" rel=\"noopener\">Anova Test<\/a>, and more.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<blockquote class=\"qu\"><p id=\"f39d\" class=\"qv qw fp be qx qy qz ra rb rc rd nu dw\" data-selectable-paragraph=\"\">Innovation and academia go hand-in-hand. Listen to our own CEO Gideon Mendels chat with the Stanford MLSys Seminar Series team about the future of MLOps and <a class=\"af mz\" href=\"https:\/\/www.youtube.com\/watch?v=7XCsi64HLQ8\" target=\"_blank\" rel=\"noopener ugc nofollow\">give the Comet platform a try for free<\/a>!<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<h1 id=\"829a\" class=\"nw nx fp be ny nz re gp ob oc rf gs oe of rg oh oi oj rh ol om on ri op oq or bj\" data-selectable-paragraph=\"\">Correlation<\/h1>\n<p id=\"0dfb\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">Correlation is a statistical measure that expresses the relation between two variables. Variables can be positively or negatively correlated to each other. A positive correlation occurs when an increase in variable A leads to an increase in variable B. On the other hand, a negative correlation occurs when an increase in variable A leads to a decrease in variable B.<\/p>\n<p id=\"ad0a\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The range of correlation values is -1 to 1, where 1 represents completely, positively correlated features, and -1 represents completely negatively correlated features.<\/p>\n<p id=\"2c86\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Having two or more highly correlated features in our training data will lead to the problem of <strong class=\"be nv\">multicollinearity<\/strong>, which affects model performance.<\/p>\n<h2 id=\"f816\" class=\"ox nx fp be ny oy oz pa ob pb pc pd oe ni pe pf pg nm ph pi pj nq pk pl pm pn bj\" data-selectable-paragraph=\"\">How to deal with correlated features?<\/h2>\n<p id=\"a157\" class=\"pw-post-body-paragraph na nb fp be b gn os nd ne gq ot ng nh ni ou nk nl nm ov no np nq ow ns nt nu fi bj\" data-selectable-paragraph=\"\">The pandas DataFrame class offers the <code class=\"cw qe qf qg pp b\">corr<\/code> method, which computes the pairwise correlation of columns, excluding <code class=\"cw qe qf qg pp b\">NaN<\/code> or null values.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"144c\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">DataFrame.corr(method='pearson', min_periods=1)<\/span><\/pre>\n<p id=\"99f0\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The <code class=\"cw qe qf qg pp b\">corr<\/code> method can calculate three types correlation metrics: <a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Pearson_correlation_coefficient\" target=\"_blank\" rel=\"noopener ugc nofollow\">Pearson<\/a>(default), <a class=\"af mz\" href=\"https:\/\/towardsdatascience.com\/kendall-rank-correlation-explained-dee01d99c535\" target=\"_blank\" rel=\"noopener\">Kendall<\/a>, and <a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Spearman%27s_rank_correlation_coefficient\" target=\"_blank\" rel=\"noopener ugc nofollow\">spearman<\/a><em class=\"qk\">.<\/em><\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"83b3\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">import pandas as pd<\/span><span id=\"d2e0\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">import seaborn as sns\nimport matplotlib.pyplot as plt<\/span><span id=\"31eb\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df = pd.read_csv('https:\/\/raw.githubusercontent.com\/Abhayparashar31\/datasets\/master\/titanic_with_no_nan.csv')<\/span><span id=\"b120\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">df = df.drop(['Name','PassengerId','Cabin'],axis=1)\ndf = pd.get_dummies(df,columns=['Embarked','Sex'])\ndf.corr()<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*foPuFG5zrMn0W-8PTgAv5g.png\" alt=\"\" width=\"700\" height=\"262\"><\/figure><div class=\"mf mg rj\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*foPuFG5zrMn0W-8PTgAv5g.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*foPuFG5zrMn0W-8PTgAv5g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*foPuFG5zrMn0W-8PTgAv5g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*foPuFG5zrMn0W-8PTgAv5g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*foPuFG5zrMn0W-8PTgAv5g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*foPuFG5zrMn0W-8PTgAv5g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*foPuFG5zrMn0W-8PTgAv5g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*foPuFG5zrMn0W-8PTgAv5g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"9fa9\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Analyzing the correlation matrix in this format can be a bit hard. Let\u2019s visualize it using a seaborn heatmap instead:<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"6a9c\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">sns.heatmap(df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)\nfig=plt.gcf()\nfig.set_size_inches(20,12)\nplt.show()<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*YhEk8gU08RYydePyG9beZw.png\" alt=\"\" width=\"700\" height=\"470\"><\/figure><div class=\"mf mg rk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*YhEk8gU08RYydePyG9beZw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*YhEk8gU08RYydePyG9beZw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*YhEk8gU08RYydePyG9beZw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*YhEk8gU08RYydePyG9beZw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*YhEk8gU08RYydePyG9beZw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*YhEk8gU08RYydePyG9beZw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*YhEk8gU08RYydePyG9beZw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*YhEk8gU08RYydePyG9beZw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"66de\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">By looking at the above heatmap, we can clearly see the highest positive correlation value is 0.41, which is between our features <code class=\"cw qe qf qg pp b\">Parch<\/code> and <code class=\"cw qe qf qg pp b\">SibSp<\/code>. On the other hand highest negative correlation value is -1 which is between the sex categories <code class=\"cw qe qf qg pp b\">female<\/code> and <code class=\"cw qe qf qg pp b\">male<\/code> (because they read as binary variables by the algorithm in this particular case).<\/p>\n<p id=\"0069\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">To remove highly correlated features from our data we can set a threshold value and filter all the features that are correlated with another feature by, for example, more than 60%.<\/p>\n<p id=\"c065\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Based on our data and heatmap we should get one column between (<code class=\"cw qe qf qg pp b\">Sex_male<\/code>, <code class=\"cw qe qf qg pp b\">Sex_female<\/code>) and one column between (<code class=\"cw qe qf qg pp b\">Embarked_S<\/code>, <code class=\"cw qe qf qg pp b\">Embarked_C<\/code>):<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"d171\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\"># Threshold Value\nthreshold = 0.60<\/span><span id=\"53f4\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\"># Empty List to Store Column Names\ncol_corr = []<\/span><span id=\"3072\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\"># Correlation matrix\ncorr_matrix = df.corr()<\/span><span id=\"2b38\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">for i in range(len(corr_matrix.columns)):\n  for j in range(i):\n    if abs(corr_matrix.iloc[i, j]) &gt; threshold:\n      colname = corr_matrix.columns[i]\n      col_corr.append(colname)<\/span><span id=\"cc99\" class=\"ox nx fp pp b ia pw pu l iq pv\" data-selectable-paragraph=\"\">print(col_corr)<\/span><\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:262\/1*LAEX8nboTjJTXnVElCRI6w.png\" alt=\"\" width=\"262\" height=\"31\"><\/figure><div class=\"mf mg rl\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:524\/format:webp\/1*LAEX8nboTjJTXnVElCRI6w.png 524w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 262px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*LAEX8nboTjJTXnVElCRI6w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*LAEX8nboTjJTXnVElCRI6w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*LAEX8nboTjJTXnVElCRI6w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*LAEX8nboTjJTXnVElCRI6w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*LAEX8nboTjJTXnVElCRI6w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*LAEX8nboTjJTXnVElCRI6w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:524\/1*LAEX8nboTjJTXnVElCRI6w.png 524w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 262px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"05d1\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Now that we have names of highly correlated features (according to our pre-set threshold), we can simply remove them from our data frame using the drop function.<\/p>\n<pre class=\"mi mj mk ml mm po pp pq pr ax ps bj\"><span id=\"df37\" class=\"ox nx fp pp b ia pt pu l iq pv\" data-selectable-paragraph=\"\">df.drop(columns=col_corr, axis=1, inplace=True)<\/span><\/pre>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Elisa Ventur on Unsplash Outliers in data Outliers are unusual data points that differ significantly from other values in the sample of a population. Outliers sometimes represent errors in measurement or data collection, and can have significant effects on descriptive statistics and machine learning model outcomes. There are several ways to detect outliers [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[140],"class_list":["post-7876","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Major Problems of Machine Learning Datasets: Part 2 - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Major Problems of Machine Learning Datasets: Part 2\" \/>\n<meta property=\"og:description\" content=\"Photo by Elisa Ventur on Unsplash Outliers in data Outliers are unusual data points that differ significantly from other values in the sample of a population. Outliers sometimes represent errors in measurement or data collection, and can have significant effects on descriptive statistics and machine learning model outcomes. There are several ways to detect outliers [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-10-06T23:37:09+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:05:39+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk\" \/>\n<meta name=\"author\" content=\"Abhay Parashar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abhay Parashar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Major Problems of Machine Learning Datasets: Part 2 - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/","og_locale":"en_US","og_type":"article","og_title":"Major Problems of Machine Learning Datasets: Part 2","og_description":"Photo by Elisa Ventur on Unsplash Outliers in data Outliers are unusual data points that differ significantly from other values in the sample of a population. Outliers sometimes represent errors in measurement or data collection, and can have significant effects on descriptive statistics and machine learning model outcomes. There are several ways to detect outliers [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-10-06T23:37:09+00:00","article_modified_time":"2025-04-24T17:05:39+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk","type":"","width":"","height":""}],"author":"Abhay Parashar","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abhay Parashar","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"Major Problems of Machine Learning Datasets: Part 2","datePublished":"2023-10-06T23:37:09+00:00","dateModified":"2025-04-24T17:05:39+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/"},"wordCount":1086,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/","url":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/","name":"Major Problems of Machine Learning Datasets: Part 2 - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk","datePublished":"2023-10-06T23:37:09+00:00","dateModified":"2025-04-24T17:05:39+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*mJOf8ke5gq4iCFFk"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/major-problems-of-machine-learning-datasets-part-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Major Problems of Machine Learning Datasets: Part 2"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7876","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7876"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7876\/revisions"}],"predecessor-version":[{"id":15503,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7876\/revisions\/15503"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7876"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7876"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7876"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7876"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}