{"id":8084,"date":"2023-11-02T10:06:06","date_gmt":"2023-11-02T18:06:06","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8084"},"modified":"2025-04-24T17:04:48","modified_gmt":"2025-04-24T17:04:48","slug":"defining-marketing-strategy-using-comet","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet\/","title":{"rendered":"Defining Marketing Strategy Using Comet"},"content":{"rendered":"\n<figure class=\"wp-block-image mk ml mm mn mo mp mh mi paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Photo by <a class=\"af nb\" href=\"https:\/\/unsplash.com\/@campaign_creators?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Campaign Creators<\/a> on <a class=\"af nb\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"aef4\">Without solid marketing efforts companies will have a hard time growing and sustaining their business.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"bd48\">The marketing department of any organization is crucial for building the company\u2019s brand and engaging customers with relevant content with the intention of increasing sales and revenue. But in order to serve their customer base best, the marketing team needs to understand them \u2014 what is it that their customers want, what do they need?<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"1bc0\">If your company is able to determine this, then you can launch a targeted marketing campaign that aims to educate and engage the customer with content that is specific and tailored to their needs. If data about your customers is available, then data science can be applied to perform market segmentation.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"0e68\">Imagine that you\u2019ve been hired as a consultant for a credit card company. One of the objectives for the marketing team this quarter is to launch a targeted ad campaign. In order for marketers to launch a targeted marketing campaign, they\u2019ll need to learn about their customers: what are their spending habits, what patterns of credit usage present themselves, and so on. Over the years, your company has collected a lot of valuable data about how their customers use their credit card products.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"c497\">The marketing team wants to launch a campaign and, in order to target customers appropriately, the team wants to divide the customers into three to five distinctive segments.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"4eb8\">In this blog, we\u2019ll do the following activities:<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"162b\">\u2022 Profile our data set<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"7a33\">\u2022 Do some light data cleaning<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"e836\">\u2022 Find initial segments using a clustering technique known as k-means<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"325d\">\u2022 Use PCA for dimensionality reduction and use k-means to find clusters on the reduced dataset.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"1e6b\">\u2022 Experiment with autoencoders for dimensionality reduction and use k-means to find clusters on the reduced dataset.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"e3df\">We\u2019ll be tracking our experiments and logging any artifacts using <a class=\"af nb\" href=\"\/signup\" target=\"_blank\" rel=\"noopener ugc nofollow\">Comet<\/a>.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"02b1\">The dataset we\u2019re using for this post comes from the Kaggle Credit Card Dataset for Clustering, which you can download <a class=\"af nb\" href=\"https:\/\/www.kaggle.com\/arjunbhasin2013\/ccdata\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote nx ny nz is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"e6a9\">Go ahead and open up a fresh Jupyter notebook, adjust your windows sizes according to your preference, and follow along with me.<\/p>\n<\/blockquote>\n\n\n\n<h1 class=\"wp-block-heading oe of fr be og oh oi gr oj ok ol gu om on oo op oq or os ot ou ov ow ox oy oz bj\" id=\"5b44\">Profiling our dataset<\/h1>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"2c55\">The dataset we\u2019re working with summarizes credit card usage behavior for roughly 9000 active credit card holders over the previous six months. We\u2019ve got 18 behavioral features at our disposal, which are described in the table below.<\/p>\n\n\n\n<figure class=\"wp-block-image mk ml mm mn mo mp mh mi paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*t9u0KBI7eJifIOz8z-5GOA.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Description of features in raw data<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"7c33\">Let\u2019s begin by downloading the dataset. Run the code below to download it to your workspace.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aec pi l\">\n<pre>%pip install gdown --upgrade --quiet\nimport gdown\ndef download_from_gdrive(gid, output):\n\"\"\"Download csv file from Google.\nArgs:\ngid (str): Google Drive's file ID.\noutput (str): Output filename.\n\"\"\"\ngdown.download(id = gid, output = output)\ndownload_from_gdrive(gid = \"1LIIW7rcLExMbsC5RFG87laIL_Nn_yy8Q\", output = 'cc_data.csv')<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Download raw data from Google drive<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"0a90\">We\u2019ll also install Comet and import it:<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aed pi l\">\n<pre>%pip install comet_ml --quiet\nimport comet_ml\nfrom comet_ml import Experiment, Artifact<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Importing Comet<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"803b\">We\u2019ll also import all the other libraries that will help us in this project:<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aee pi l\">\n<pre>%pip install sweetviz --quiet\nimport sweetviz\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport pickle\nimport matplotlib.pyplot as plt\nfrom numpy import save\nfrom sklearn.preprocessing import StandardScaler, normalize\nfrom sklearn.cluster import KMeans\nfrom sklearn.decomposition import PCA\nfrom sklearn.metrics import silhouette_score\nfrom tensorflow.keras.layers import Input, Add, Dense, Activation\nfrom tensorflow.keras.models import Model, load_model\nfrom tensorflow.keras.initializers import glorot_uniform, lecun_normal\nfrom tensorflow.keras.models import Sequential\nfrom tensorflow.keras.optimizers import SGD<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Importing other useful libraries<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"f4dd\">We can now load in our dataset and inspect it:<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aed pi l\">\n<pre>cc_df = pd.read_csv('cc_data.csv')\ncc_df.head()\ncc_df.info()<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Inspecting data<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"b088\">We\u2019ll now initialize an experiment in Comet. Note that you\u2019ll need to <a class=\"af nb\" href=\"https:\/\/www.comet.com\/docs\/rest-api\/getting-started\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">pass your API from Comet<\/a> and change the workspace name to whatever yours is.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"dae6\">Typically, your workspace name is the same as the username you registered with.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote pj is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"pk pl fr be pm pn po pp pq pr ps nw dw\" id=\"2f1f\">You\u2019ll need a Comet account to follow along. It\u2019s totally free, and <a class=\"af nb\" href=\"\/signup\" target=\"_blank\" rel=\"noopener ugc nofollow\">easy to sign up<\/a>.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pt nf ng gs pu ni nj nk pv nm nn no pw nq nr ns px nu nv nw fk bj\" id=\"9b15\">Once you\u2019re up and running there, you\u2019re going to need your API key. To obtain your API key, navigate to your <a class=\"af nb\" href=\"https:\/\/www.comet.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Comet.ml<\/a> dashboard. In the top right corner click on your username and select <strong class=\"be py\">Settings<\/strong> from the dropdown menu. In the <strong class=\"be py\">Settings<\/strong> page, scroll down to the <strong class=\"be py\">Developer Information<\/strong> section and click \u201cGenerate API Key.\u201d<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"bb59\">After running the following line of code, you\u2019ll be prompted to enter your API key.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aef pi l\">\n<pre>comet_ml.login()\nexperiment = Experiment(workspace='team-comet-ml', project_name='cc-clustering')<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Initializing Comet Experiment<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"5c72\">We\u2019ll now use <code class=\"cw pz qa qb qc b\">sweetviz<\/code> to perform some high level EDA and profiling of our dataset.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aeg pi l\">\n<pre>experiment.add_tag(\"sweetviz\")\nexperiment.set_name('profiling-data')\nreport = sweetviz.analyze(cc_df, pairwise_analysis = 'on')\nreport.log_comet(experiment)\nexperiment.end()<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Comet and sweetviz for<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"9fd9\">By following <a class=\"af nb\" href=\"https:\/\/www.comet.com\/team-comet-ml\/cc-clustering\/238ea12bad00450da4f4893ae328649a?experiment-tab=chart&amp;showOutliers=true&amp;smoothing=0&amp;transformY=smoothing&amp;xAxis=step\" target=\"_blank\" rel=\"noopener ugc nofollow\">the link in the <\/a><code class=\"cw pz qa qb qc b\"><a class=\"af nb\" href=\"https:\/\/www.comet.com\/team-comet-ml\/cc-clustering\/238ea12bad00450da4f4893ae328649a?experiment-tab=chart&amp;showOutliers=true&amp;smoothing=0&amp;transformY=smoothing&amp;xAxis=step\" target=\"_blank\" rel=\"noopener ugc nofollow\">url<\/a><\/code><a class=\"af nb\" href=\"https:\/\/www.comet.com\/team-comet-ml\/cc-clustering\/238ea12bad00450da4f4893ae328649a?experiment-tab=chart&amp;showOutliers=true&amp;smoothing=0&amp;transformY=smoothing&amp;xAxis=step\" target=\"_blank\" rel=\"noopener ugc nofollow\"> section <\/a>and navigating to <code class=\"cw pz qa qb qc b\">HTML<\/code> on the bottom left in the panel, we can see that our profiling report was automatically saved to Comet. This allows us to easily share with our colleagues without them having to run this code themselves.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"eb07\">We can also view the report right here in our notebook by running the following line of code or following this <a class=\"af nb\" href=\"https:\/\/www.comet.com\/team-comet-ml\/cc-clustering\/238ea12bad00450da4f4893ae328649a?experiment-tab=html\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a>.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aeh pi l\">\n<pre>report.show_notebook(w=900, h=500, scale=0.8)<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">You can show the EDA report right in your notebook by running the above line of code<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"2e14\">Some high level observations we can glean from the profiling report:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Average balance is 1,564<\/li>\n\n\n\n<li>On average balance frequency is 0.88, indicating that balances are typically frequently updated<\/li>\n\n\n\n<li>Average purchases are 1,003, though there does seem to be a wide range of purchase prices as indicated by the standard deviation of $2,137.<\/li>\n\n\n\n<li>One-off purchases have an average value of $592, though there does seem to be some large one-off purchases.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"d9aa\">I encourage you to go through the dataset on your own, exploring what seems interesting to you and seeing if you can find some interesting relationships that go beyond the surface.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote nx ny nz is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"190c\">Something I\u2019d encourage you to explore are the characteristics of the customer(s) who made the largest one-off purchases and largest cash advances. It would also be interesting to bin the credit limit and examine how the spending habits of those with a higher limits differ from those of lower credit limits. When it comes to exploring this dataset, you\u2019re only limited by your own creativity.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"95b0\">Since data exploration isn\u2019t the main focus of this post, I\u2019ll leave it up to you to uncover interesting facts and new features you could engineer. Make sure you drop a comment below to share what you\u2019ve learned and what features you decided to engineer.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"5a83\">Examining the profiling report above, we can see that we have some 313 missing values for the <code class=\"cw pz qa qb qc b\">MINIMUM_PAYMENTS<\/code> feature. We also have one row where <code class=\"cw pz qa qb qc b\">CREDIT_LIMIT<\/code> is missing so we&#8217;ll simply drop this row.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"e867\">I\u2019d be interested in looking at the distribution of the credit limit for those with missing values for minimum payments. My intuition is telling me that these people will have large credit limits.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"0979\">Running the following line of code, we can see a box plot of the credit limit.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aeh pi l\">\n<pre>cc_df[cc_df['MINIMUM_PAYMENTS'].isnull() ['CREDIT_LIMIT'].plot(kind='box')<\/pre>\n<\/div>\n<\/div>\n<\/figure>\n\n\n\n<figure class=\"wp-block-image mk ml mm mn mo mp mh mi paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:526\/1*fKauGkrUebIq1-FR9WBueQ.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"de7a\"><a class=\"af nb\" href=\"https:\/\/gist.github.com\/harpreetsahota204\/402f3fa8bd4d45ac6d80dc0f15d9e554\" target=\"_blank\" rel=\"noopener ugc nofollow\">The credit limits for these people seem to be wide ranging, how does this compare to those who do not have a missing value?<\/a><\/p>\n\n\n\n<figure class=\"wp-block-image mk ml mm mn mo mp mh mi paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:526\/1*G6pY3Gj_1jdXC5Oijs1jZQ.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"c6c5\">I don\u2019t see anything too surprising here. My initial thoughts were that folks who had no value for minimum payments would have extremely large credit limits. And if that were the case then perhaps the bank would expect them to pay off the entire balance every month, kind of like with those high-end American Express cards.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"a73c\">Questions like this are worth exploring and trying to answer with data. I encourage you to come up with some questions of your own and leave them in the comments below, bonus points if you can link to your notebook where we can see how you performed your analysis.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote nx ny nz is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"a0ab\">At this point I\u2019d encourage you to research missing data mechanisms and see if you can run some statistical tests to understand the missing data in our dataset.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"15d8\">Here\u2019s a handy visual from The Data Professor of various imputation techniques for the different missing data mechanisms.<\/p>\n\n\n\n<figure class=\"wp-block-image mk ml mm mn mo mp mh mi paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*cxEyCc6E5B9mx3x0Jm6Eqg.jpeg\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Shoutout to The Data Professor for this image.<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"9c96\">I\u2019m going to make the assumption that our data is missing completely at random (MCAR) and impute the missing values with the median. If you\u2019re interested in learning more about the three missing data mechanisms, then check out <a class=\"af nb\" href=\"https:\/\/stefvanbuuren.name\/fimd\/sec-MCAR.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">this post<\/a>.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aei pi l\">\n<pre>cc_df.loc[(cc_df['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = cc_df['MINIMUM_PAYMENTS'].median()\ncc_df = cc_df[cc_df['CREDIT_LIMIT'].isnull() == False]<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Imputing missing data<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"935a\">There\u2019s one row in which the credit limit is missing, let\u2019s go ahead and drop that row. We\u2019ll also drop the <code class=\"cw pz qa qb qc b\">CUST_ID<\/code> feature as it won&#8217;t be useful in determining clusters.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"479a\">Since we\u2019ve made some significant changes to our raw dataset, we\u2019ll go ahead and log our resulting dataset to Comet as an Artifact. This is useful because it allows us to share the modified dataset in a central location for our team to access. No more having to send files in email, Slack messages, or other such dodgy means.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"7953\">Go ahead and run the following line of code in your notebook:<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aej pi l\">\n<pre>cc_df.drop(columns='CUST_ID', inplace=True)\ncc_df.to_csv('cc_df_imputed.csv')\n\n# Since k-means uses Euclidean distance, it would be a good to scale the data\nscaler = StandardScaler()\ncreditcard_df_scaled = scaler.fit_transform(cc_df)\nsave('cc-data-scaled.npy', creditcard_df_scaled)\n\ndata_artifacts = {\n'cc_df':{'df':'cc_df_imputed.csv',\n'type':'data-model',\n'alias':['raw-features'],\n'metadata':{'filetype':'csv', 'notes':'This dataset contains median imputed values for MINIMUM_PAYMENTS'}\n},\n'cc_df_scaled':{'df':'cc-data-scaled.npy',\n'type':'numpy-array',\n'alias':['scaled-features'],\n'metadata':{'filetype':'npy', 'notes':'Scaled dataset saved as numpy ndarray.'}\n},\n}\n\ndef artifact_logger(artifact_dict:dict, key: dict, ws:str ,exp_name:str, exp_tag:str):\n\"\"\"Log the artifact to Comet\nArgs:\nartifact_dict (dict): dictionary containing metadata for artifact\nws(str): Workspace name\nkey (str): The key from which to grab dictionary items\nexp_name (str): Name of the experiment on Comet\nexp_tag (str) : Experiment tag\n\"\"\"\nexperiment = Experiment(workspace=ws,project_name=exp_name)\nexperiment.add_tag(exp_tag)\nexperiment.set_name('log_artifact_' + key)\n\nartifact = Artifact(\nname = key,\nartifact_type = artifact_dict[key]['type'],\naliases = artifact_dict[key]['alias'],\nmetadata = artifact_dict[key]['metadata']\n)\n\nartifact.add(artifact_dict[key]['df'])\n\nexperiment.log_artifact(artifact)\nexperiment.end()\n\n# Log training and testing sets to Comet as artifacts\nfor key in data_artifacts:\nartifact_logger(data_artifacts,key, ws='team-comet-ml', exp_name='cc-clustering', exp_tag=\"imputed-data\")<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Logging Artifacts to Comet<\/figcaption>\n<\/figure>\n\n\n\n<h1 class=\"wp-block-heading oe of fr be og oh oi gr oj ok ol gu om on oo op oq or os ot ou ov ow ox oy oz bj\" id=\"aeec\">How we\u2019ll use k-means to find clusters<\/h1>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"b2d2\">The legendary Josh Starmer (who has also been a guest on <a class=\"af nb\" href=\"https:\/\/theartistsofdatascience.fireside.fm\/joshua-starmer-phd\" target=\"_blank\" rel=\"noopener ugc nofollow\">my podcast<\/a>) created an excellent overview of what k-means clustering is and how it works. I highly recommend <a class=\"af nb\" href=\"https:\/\/www.youtube.com\/watch?v=4b5d3muPQmA\" target=\"_blank\" rel=\"noopener ugc nofollow\">checking it out<\/a>.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote nx ny nz is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"ee04\">The k-means algorithm is an unsupervised learning algorithm that works by grouping similar data points together. This idea of similarity is based on the Euclidean distance between points.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"fa87\">At a high-level the algorithm involves five steps.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"bb1c\"><strong class=\"be py\">First<\/strong>, you choose the number of clusters you wish to identify, let\u2019s call this <code class=\"cw pz qa qb qc b\">k<\/code>. <strong class=\"be py\">Second<\/strong>, you select <code class=\"cw pz qa qb qc b\">k<\/code> random points that are going to be the centroids of each cluster. <strong class=\"be py\">Third<\/strong>, you assign each data point to the nearest centroid. This will allow you to create <code class=\"cw pz qa qb qc b\">k<\/code> clusters. <strong class=\"be py\">Fourth<\/strong>, you calculate a new centroid for each cluster. <strong class=\"be py\">Fifth<\/strong>, you reassign each data point to the closest cluster.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"3015\">In our case we know that the marketing department wants to identify between three and five clusters, but what if you didn\u2019t know how many clusters you were looking for beforehand?<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"7c8b\">Then you\u2019d want to use a technique called the elbow method. We won\u2019t discuss that in detail in this post, but you can learn more about it <a class=\"af nb\" href=\"https:\/\/www.geeksforgeeks.org\/elbow-method-for-optimal-value-of-k-in-kmeans\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"dbc0\">So, how can you tell if your clusters are good?<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"c670\">There are two qualities that indicate whether your method of clustering is of high quality. First, you\u2019ll observe \u201chigh intra-class similarity.\u201d This means that the average squared distance of all the points within a cluster to its centroid is minimized (within clusters sum of squares, which is also captured by the <code class=\"cw pz qa qb qc b\">inertia_<\/code> attribute of the kmeans model). Second, you&#8217;ll observe &#8220;low inter-class similarity.&#8221; This means that the average squared distance between the centroids is maximized (between clusters sum of squares).<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"a915\">One metric we can use to capture how good our clusters are is the <a class=\"af nb\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.silhouette_score.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">silhouette score<\/a>. This score allows us to quantify how well samples are clustered with other samples that are similar to each other.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"59df\">According to the scikit-learn documentation:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote nx ny nz is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"df85\"><em class=\"fr\">The Silhouette Coefficient is calculated using the mean intra-cluster distance (<\/em><code class=\"cw pz qa qb qc b\"><em class=\"fr\">a<\/em><\/code><em class=\"fr\">) and the mean nearest-cluster distance (<\/em><code class=\"cw pz qa qb qc b\"><em class=\"fr\">b<\/em><\/code><em class=\"fr\">) for each sample.<\/em><\/p>\n\n\n\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"1009\"><em class=\"fr\">The Silhouette Coefficient for a sample is <\/em><code class=\"cw pz qa qb qc b\"><em class=\"fr\">(b - a) \/ max(a, b)<\/em><\/code><em class=\"fr\">. To clarify, &gt; <\/em><code class=\"cw pz qa qb qc b\"><em class=\"fr\">b<\/em><\/code><em class=\"fr\"> is the distance between a sample and the nearest cluster that the sample is not a part of.<\/em><\/p>\n\n\n\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"c41d\"><em class=\"fr\">The best value is 1 and the worst value is -1.<\/em><\/p>\n\n\n\n<p class=\"nc nd oa be b gp ne nf ng gs nh ni nj ob nl nm nn oc np nq nr od nt nu nv nw fk bj\" id=\"551f\"><em class=\"fr\">Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"c676\">What we\u2019ll do from here is use k-means on the full dataset to find three, four, and five clusters (we can consider this as our baseline methodology). Then we\u2019ll save the results to Comet.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"8d6a\">After that we\u2019ll apply PCA to our dataset to find the top two principal components (to make it easier to visualize) and use k-means to find three, four, and five clusters.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"4d2c\">Finally, we\u2019ll use an autoencoder network to perform dimensionality reduction down to two features and then use k-means to find three, four, and five clusters.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"c7ed\">The methodology which results in the best silhouette score (as close to 1 as possible) will be our chosen method. We can then use some descriptive statistics to try to understand the customer behavior for those clusters.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"ac58\">The following function defines how we will find clusters:<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aek pi l\">\n<pre>def find_clusters(df:pd.DataFrame, file:str):\n\"\"\"\nRun an experiment to find 3, 4, and 5 clusters.\nParameters:\ndf: The dataframe on which clustering will take place\nfile: A string to help add tags, and identifying information for the experiment\n\"\"\"\nfor k in range(3,6,1):\nfile_string = file + \"_\" + str(k)\nexperiment = Experiment(workspace='team-comet-ml', project_name='cc-clustering')\nexperiment.add_tag(file + \"_\" + str(k) + \"_clusters\")\n\nkmeans = KMeans(k, random_state=42, algorithm='elkan', n_init = 100)\npickle.dump(kmeans, open(file_string + \".pkl\", \"wb\"))\nkmeans.fit(df)\nlabels = kmeans.labels_\nclusters = pd.DataFrame(labels, columns = [\"cluster_label\"])\ncc_df_clusters=pd.concat([cc_df, clusters], axis=1)\ncc_df_clusters.to_csv(f'cc_df_{k}_clusters.csv')\nscore = silhouette_score(df, labels, metric='euclidean')\nmetrics = {\"silhouette_score\": score, \"inertia\": kmeans.inertia_}\n\nexperiment.log_model(file_string, file_string + \".pkl\")\nexperiment.log_parameters(k)\nexperiment.log_metrics(metrics)\nexperiment.log_table(f'cc_df_{k}_clusters.csv', tabular_data=True, headers=True)\nexperiment.end()<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Running an experiment on Comet to find clusters using the k-means algorithm<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"121e\">Clicking into each individual experiment, for example <a class=\"af nb\" href=\"https:\/\/www.comet.com\/team-comet-ml\/cc-clustering\/aa782b8353144ef9bdec0f912811473b?experiment-tab=chart&amp;showOutliers=true&amp;smoothing=0&amp;transformY=smoothing&amp;xAxis=wall\" target=\"_blank\" rel=\"noopener ugc nofollow\">this one<\/a>, we can see all that we\u2019ve logged to Comet.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"d974\">Comet automatically tracks all of our hyperparameters for us. In addition to that, we\u2019ve logged our silhouette score metric. If you look at the last panel, <code class=\"cw pz qa qb qc b\">Assets and Artifacts<\/code>, you&#8217;ll see that we&#8217;ve saved the resulting k-means model and a csv file which has our cluster definitions appended to the raw data.<\/p>\n\n\n\n<h1 class=\"wp-block-heading oe of fr be og oh oi gr oj ok ol gu om on oo op oq or os ot ou ov ow ox oy oz bj\" id=\"23ce\">Using autoencoders to find clusters<\/h1>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"dd69\">An autoeconder is a type of neural network that is used for feature learning (representation learning).<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"6f4e\">At a high level, here\u2019s how they work: an autoencoder network takes an input, breaks it down to a compressed version, and uses that to reconstruct the original input. An interesting feature of these networks is that they use the same input data for both input and output. These networks work by adding a bottleneck layer which forces the network to create a compressed (or encoded) version of the original input data. The hope is the bottleneck layer is able to encode useful characteristics of the original data in some compressed format. This works much like principal component analysis in that we can take a higher dimension feature space and reduce the data our data to a lower dimension latent space.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"5eea\">For more detail on how autoencoders work, I recommend <a class=\"af nb\" href=\"https:\/\/www.youtube.com\/watch?v=3jmcHZq3A5s\" target=\"_blank\" rel=\"noopener ugc nofollow\">this easy to understand video from WelcomeAIOverlords on YouTube<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image mk ml mm mn mo mp mh mi paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*i-DqWf-ySLZ8_gCycLH7qA.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\"><a class=\"af nb\" href=\"https:\/\/miro.medium.com\/max\/829\/1*ViBG49eTCKqqO2UVRL9mEw.png\" rel=\"noopener\">Source<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"ad10\">We\u2019ll experiment with one autoencoder architecture and track our experimental runs to Comet.<\/p>\n\n\n\n<h1 class=\"wp-block-heading oe of fr be og oh oi gr oj ok ol gu om on oo op oq or os ot ou ov ow ox oy oz bj\" id=\"a95f\">How to train autoencoders<\/h1>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"ae46\">You need to set four hyperparameters before training an autoencoder:<\/p>\n\n\n\n<h2 class=\"wp-block-heading qr of fr be og qs qt qu oj qv qw qx om nk qy qz ra no rb rc rd ns re rf rg rh bj\" id=\"da65\">Bottleneck size<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"c598\">The bottleneck size, which is the last layer in the encoder, is the most important hyperparameter used to tune the autoencoder.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"e519\">This determines the number of features that the autoencoder will be compressed into, and can serve as a regularization term.<\/p>\n\n\n\n<h2 class=\"wp-block-heading qr of fr be og qs qt qu oj qv qw qx om nk qy qz ra no rb rc rd ns re rf rg rh bj\" id=\"8646\">Number of layers<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"2102\">As with every neural network out there, an important hyperparameter for autoencoders is the depth of the encoder network and depth of the decoder network.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"7bfa\">Deeper networks will increases model complexity and time to train, while a shallower network will be faster to train.<\/p>\n\n\n\n<h2 class=\"wp-block-heading qr of fr be og qs qt qu oj qv qw qx om nk qy qz ra no rb rc rd ns re rf rg rh bj\" id=\"197f\">Number of nodes per layer<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"e6cf\">The number of nodes per layer defines the weights we use per layer.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"c68e\">Typically, the number of nodes decreases with each subsequent layer in the autoencoder as the input to each of these layers becomes smaller across the layers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading qr of fr be og qs qt qu oj qv qw qx om nk qy qz ra no rb rc rd ns re rf rg rh bj\" id=\"c25e\">Loss<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"95c3\">The loss function you use to train the autoencoder will depend on the type of input and output you want the autoencoder to learn a representation for.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"ea2c\">Since we\u2019re working with tabular data, the most popular loss functions for reconstruction are MSE Loss and L1 Loss.<\/p>\n\n\n\n<h1 class=\"wp-block-heading oe of fr be og oh oi gr oj ok ol gu om on oo op oq or os ot ou ov ow ox oy oz bj\" id=\"2281\">Encoder<\/h1>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"70d0\">Let\u2019s define the encoder part of the model, which compresses input data into an encoded representation that will be fewer features than the original data.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"68cf\">It takes an array of size 17 (because that\u2019s how many features our full dataset has) as input and passes it through a multi-layer dense network. The final layer of the encoder has only two neurons, this layer is expected to represent each given example with two float numbers. We\u2019ll use the <code class=\"cw pz qa qb qc b\"><a class=\"af nb\" href=\"https:\/\/arxiv.org\/abs\/1706.02515v5\" target=\"_blank\" rel=\"noopener ugc nofollow\">selu<\/a><\/code> activation function with the <code class=\"cw pz qa qb qc b\">lecun_normal<\/code> kernel initializer.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"3fb5\">The final layer in the encoder network is the \u201cbottleneck\u201d layer that will contain the compressed representation of our full feature. This is the absolute most important part of this network.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"451b\">Note that this is a point where you can experiment.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"66cd\">I encourage you to play around with this code and try different number of layers, different number of neurons in each layer, different activation functions and different kernal initializers.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote pj is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"pk pl fr be pm pn po pp pq pr ps nw dw\" id=\"de74\">When you play around with the code and log your experiment to <a class=\"af nb\" href=\"https:\/\/www.comet.com\/site\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Comet<\/a>, you won\u2019t have to remember all the details yourself.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pt nf ng gs pu ni nj nk pv nm nn no pw nq nr ns px nu nv nw fk bj\" id=\"b546\">You simply build out the network, take the resulting dataset, use the <code class=\"cw pz qa qb qc b\">find_clusters<\/code> function to find clusters, and log results to Comet.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"ael pi l\">\n<pre>encoder_network = Sequential(\n[\nDense(17, activation=\"selu\", kernel_initializer = 'lecun_normal'),\nDense(8, activation=\"selu\", kernel_initializer = 'lecun_normal'),\nDense(4, activation=\"selu\", kernel_initializer = 'lecun_normal'),\nDense(2, activation=\"selu\", kernel_initializer = 'lecun_normal'),\n]\n)<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Defining the layers of the encoder network<\/figcaption>\n<\/figure>\n\n\n\n<h1 class=\"wp-block-heading oe of fr be og oh oi gr oj ok ol gu om on oo op oq or os ot ou ov ow ox oy oz bj\" id=\"84d5\">Decoder<\/h1>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp pa nf ng gs pb ni nj nk pc nm nn no pd nq nr ns pe nu nv nw fk bj\" id=\"c14b\">The decoder part of the autoencoder network is typically a mirror image of the encoder model.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"2681\">The decoder takes the reduced feature space as its input and reconstructs the original features by expanding the dimensions out to the original number of inputs. It essentially decompresses the information that was captured with the encoded data. The end result in this case will be an array of dimension 17 as output. This output array is expected to be similar to the input array.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"b00b\">Just like the encoder part of the network, I encourage you to try different values for the various hyperparameters.<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"ael pi l\">\n<pre>decoder_network = Sequential(\n[\nDense(2, activation=\"selu\", kernel_initializer = 'lecun_normal'),\nDense(4, activation=\"selu\", kernel_initializer = 'lecun_normal'),\nDense(8, activation=\"selu\", kernel_initializer = 'lecun_normal'),\nDense(17, activation=\"selu\", kernel_initializer = 'lecun_normal'),\n]\n)<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Defining the layers of the decoder network, which mirror the encoder.<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"b5fd\">We can go ahead and log this experiment to Comet with the following code:<\/p>\n\n\n\n<figure class=\"mk ml mm mn mo mp\">\n<div class=\"pg iu l ee\">\n<div class=\"aem pi l\">\n<pre>experiment = Experiment(workspace='team-comet-ml', project_name='cc-clustering')\nexperiment.add_tag(\"autoencoder\")\n\nautoencoder_network = Sequential([encoder_network, decoder_network])\nautoencoder_network.compile(optimizer= 'adam', loss='mean_squared_error')\nautoencoder_network.fit(creditcard_df_scaled, creditcard_df_scaled, batch_size = 128, epochs = 150,  verbose = 0)\n\npred_df = pd.DataFrame(encoder_network.predict(creditcard_df_scaled), columns=['encoding1', 'encoding2'])\npred_df.to_csv('encoded_df.csv')\n\nautoencoder_network.save_weights('autoencoder.h5')\nexperiment.log_model(\"autoencoder\", \"autoencoder.h5\")\n\nexperiment.log_table(\"encoded_df.csv\", tabular_data=True, headers=True)\nexperiment.end()<\/pre>\n<\/div>\n<\/div>\n<figcaption class=\"mw mx my mh mi mz na be b bf z dw\">Training the autoencoder and logging the experiment to Comet<\/figcaption>\n<\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"304f\"><a class=\"af nb\" href=\"https:\/\/www.comet.com\/team-comet-ml\/cc-clustering\/view\/new\/experiments\" target=\"_blank\" rel=\"noopener ugc nofollow\">Examining the output from our experimental runs<\/a>, we can see that the best (value closest to 1) for silhouette score is 0.61, which is achieved using the autoencoder with three clusters.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"0725\">As a starting point for understanding our clusters, we can examine the average values for all the features we have, which we can then pass along to the marketing team for further analysis.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"f612\" class=\"qr of fr qc b ic rm rn l is ro\" data-selectable-paragraph=\"\">cc_df_3_clusters.groupby('cluster_label').agg('mean')<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"45f8\">That\u2019s all there is to it!<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"3f2e\">You\u2019ve done a lot in this mini lesson, let\u2019s recap:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Performed some automated data exploration and light data cleaning.<\/li>\n\n\n\n<li>Built a baseline clustering models using the k-means algorithm with the full feature set.<\/li>\n\n\n\n<li>Used PCA to reduce the number of dimensions to two and applied k-means clustering to the resulting dataset.<\/li>\n\n\n\n<li>Built an autoencoder network to perform dimensionality reduction and applied k-means to find clusters.<\/li>\n\n\n\n<li>Determined that using the autoencoder network with three clusters resulted in the best silhouette score.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"5581\">Go ahead and play around with the code examples here, build a different network with differing layers, try various activation functions and kernel initializers, try finding more clusters, or perform some feature engineering or feature selection.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" id=\"28f9\">I\u2019m excited to see what you come up with, so be sure to share your results in the comments!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Without solid marketing efforts companies will have a hard time growing and sustaining their business. The marketing department of any organization is crucial for building the company\u2019s brand and engaging customers with relevant content with the intention of increasing sales and revenue. But in order to serve their customer base best, the marketing team needs [&hellip;]<\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[9],"tags":[],"coauthors":[166],"class_list":["post-8084","post","type-post","status-publish","format-standard","hentry","category-product"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Defining Marketing Strategy Using Comet - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Defining Marketing Strategy Using Comet\" \/>\n<meta property=\"og:description\" content=\"Without solid marketing efforts companies will have a hard time growing and sustaining their business. The marketing department of any organization is crucial for building the company\u2019s brand and engaging customers with relevant content with the intention of increasing sales and revenue. But in order to serve their customer base best, the marketing team needs [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-02T18:06:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:04:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_\" \/>\n<meta name=\"author\" content=\"Harpreet Sahota\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Harpreet Sahota\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Defining Marketing Strategy Using Comet - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet","og_locale":"en_US","og_type":"article","og_title":"Defining Marketing Strategy Using Comet","og_description":"Without solid marketing efforts companies will have a hard time growing and sustaining their business. The marketing department of any organization is crucial for building the company\u2019s brand and engaging customers with relevant content with the intention of increasing sales and revenue. But in order to serve their customer base best, the marketing team needs [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-11-02T18:06:06+00:00","article_modified_time":"2025-04-24T17:04:48+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_","type":"","width":"","height":""}],"author":"Harpreet Sahota","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Harpreet Sahota","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet\/"},"author":{"name":"Harpreet Sahota","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6"},"headline":"Defining Marketing Strategy Using Comet","datePublished":"2023-11-02T18:06:06+00:00","dateModified":"2025-04-24T17:04:48+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet\/"},"wordCount":2906,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_","articleSection":["Product"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet\/","url":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet","name":"Defining Marketing Strategy Using Comet - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_","datePublished":"2023-11-02T18:06:06+00:00","dateModified":"2025-04-24T17:04:48+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FuhvmSLythRZgyM_"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/defining-marketing-strategy-using-comet#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Defining Marketing Strategy Using Comet"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6","name":"Harpreet Sahota","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/2d21512be19ba7e19a71a803309e2a88","url":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","caption":"Harpreet Sahota"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/theartistsofdatasciencegmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8084","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8084"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8084\/revisions"}],"predecessor-version":[{"id":15467,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8084\/revisions\/15467"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8084"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8084"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8084"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8084"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}