{"id":368,"date":"2020-10-28T15:40:38","date_gmt":"2020-10-28T23:40:38","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=368"},"modified":"2025-04-24T17:30:36","modified_gmt":"2025-04-24T17:30:36","slug":"metrics-in-production-ml-models","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/","title":{"rendered":"Industry Q&#038;A: Tracking Metrics for In-production ML Models"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Comet recently hosted the online panel,&nbsp;<a href=\"https:\/\/info.comet.ml\/panel-addressing-ml-challenges\/\">\u201cHow do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?\u201d<\/a>&nbsp;This is the second post in a series where we recap the questions, answers, and approaches that top AI teams in the world are taking to critical machine learning challenges. You can access the&nbsp;<a href=\"https:\/\/www.comet.com\/site\/how-to-start-the-machine-learning-research-process\/\">first post here,<\/a><\/em>&nbsp;<em>and the&nbsp;<a href=\"https:\/\/www.comet.com\/site\/industry-qa-where-most-machine-learning-projects-fail\/\">second here.<\/a><\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>We would like to thank&nbsp;<a href=\"https:\/\/twitter.com\/ambarish_jash?lang=en\">Ambarish Jash<\/a>,&nbsp;<a href=\"https:\/\/ai.google\/\">Google<\/a><\/em>;&nbsp;<em><a href=\"https:\/\/twitter.com\/w4nderlus7?lang=en\">Piero Molino<\/a>,&nbsp;<a href=\"https:\/\/ai.stanford.edu\/\">Stanford<\/a><\/em>&nbsp;+&nbsp;<em><a href=\"https:\/\/twitter.com\/ludwig_ai\">Ludwig<\/a><\/em>;&nbsp;<em>and&nbsp;<a href=\"https:\/\/twitter.com\/sanhestpasmoi?lang=en\">Victor Sanh<\/a>,&nbsp;<a href=\"https:\/\/huggingface.co\/\">Hugging Face<\/a><\/em>;<em>&nbsp;for their participation.<\/em><\/p>\n\n\n\n<figure class=\"wp-block-embed-vimeo aligncenter wp-block-embed is-type-video is-provider-vimeo wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\"><iframe loading=\"lazy\" id=\"vm-6b5c2822-70bb-4b69-9999-cccebf582942\" title=\"Industry Q&amp;amp;A - What metrics do you track for in-production ML models?\" src=\"https:\/\/player.vimeo.com\/video\/472724384?dnt=1&amp;app_id=122963\" width=\"500\" height=\"281\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\" data-ready=\"true\" data-mce-fragment=\"1\"><\/iframe><\/div>\n<\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You\u2019ve built a model. It\u2019s been trained. It\u2019s going to production. But how do you ensure it\u2019s working as expected? It\u2019s not enough to simply send your model to production. You have to understand how it performs, if it\u2019s successful, and where adjustments need to be made. These steps are critical to the long-term success of every machine learning team. It also begs the question \u2013 what metrics should you be tracking once your model is in production?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Gideon Mendels, Comet<\/strong><br>\nWhat do you all consider when you monitor models in production? Are you looking at distribution of features? The business OKR? Everything? Is there a certain process where you say, \u201cThis is the point where it makes sense to train a model\u201d?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Piero Molino, Stanford &amp; Ludwig<\/strong><br>\nI can give you an example. One project at Uber that I worked on was for customer support. The model was helping customer support reps by classifying tickets, answer tickets, what actions would be needed, and what templates would be used.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In that case, what we cared about was \u201chow much faster can we make customer support representatives without sacrificing accuracy?\u201d The more accurate your model is, the more you can help them be fast because the suggestions are impactful. But we had a situation where we could be 95-97% accurate for the top three questions, or 97%+ for a single question. In this case, being able to support three questions at 95-97% accuracy was more impactful to making customer support faster than getting that additional 2-3% increase for a single question.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In terms of monitoring and retraining, we would run the experiments based on a certain amount of data. We\u2019d then separate the data into bins, take some as training data, others as prediction. Then we\u2019d shift the window to get an understanding of how much older data you need to add to your model until it becomes noise due to the change in distribution. Eventually we understood that if we had more than 1.5 months of data, it would become noise. We were also able to understand that we would see a drop in the prediction after about a month, so we learned we needed to retrain the model every month or so.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This general approach for the data and how long you need to wait for retraining is a pretty good one, and it\u2019s dynamic. There will be months where it shifts more, and others where it shifts less.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Ambarish Jash, Google AI<br>\n<\/strong>I agree with Piero. He alluded to some long-term pull back as well. You want to know how your model ages. Does it age well or is it all noise after two weeks?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What happens many times is you deploy a model in the wild, all of your metrics light up, and you\u2019re very happy. But after two weeks, those metrics turn red. There could be any number of reasons. Sometimes it just doesn\u2019t age well. Other times the model is optimized for something, but the user learns to ignore things. The long-term pull back is important to understand.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The other item is to have a continuous retraining pipeline.This will help you understand how much fresh content you need to serve, or how much is coming to you. For example, in the restaurant recommendation game, there may be hundreds of new restaurants every week, so you need to retrain weekly. But if you\u2019re in the YouTube recommendation game, and there are millions of videos every minute, you may need to retrain a few times a day.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Freshness of content is a business metric that can drive how often you want to retrain your model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Want to watch the full panel? It\u2019s available&nbsp;<a href=\"https:\/\/info.comet.ml\/panel-addressing-ml-challenges\/\">on-demand here.<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Comet recently hosted the online panel,&nbsp;\u201cHow do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?\u201d&nbsp;This is the second post in a series where we recap the questions, answers, and approaches that top AI teams in the world are taking to critical machine learning challenges. You can access the&nbsp;first post here,&nbsp;and [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":370,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10],"tags":[],"coauthors":[109],"class_list":["post-368","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-industry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Q&amp;A What metrics do you track for in-production ML models?<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Industry Q&amp;A: Tracking Metrics for In-production ML Models\" \/>\n<meta property=\"og:description\" content=\"Comet recently hosted the online panel,&nbsp;\u201cHow do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?\u201d&nbsp;This is the second post in a series where we recap the questions, answers, and approaches that top AI teams in the world are taking to critical machine learning challenges. You can access the&nbsp;first post here,&nbsp;and [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2020-10-28T23:40:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:30:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1085\" \/>\n\t<meta property=\"og:image:height\" content=\"590\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Ken Hoyle\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ken Hoyle\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Q&A What metrics do you track for in-production ML models?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/","og_locale":"en_US","og_type":"article","og_title":"Industry Q&A: Tracking Metrics for In-production ML Models","og_description":"Comet recently hosted the online panel,&nbsp;\u201cHow do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?\u201d&nbsp;This is the second post in a series where we recap the questions, answers, and approaches that top AI teams in the world are taking to critical machine learning challenges. You can access the&nbsp;first post here,&nbsp;and [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2020-10-28T23:40:38+00:00","article_modified_time":"2025-04-24T17:30:36+00:00","og_image":[{"width":1085,"height":590,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png","type":"image\/png"}],"author":"Ken Hoyle","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Ken Hoyle","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/"},"author":{"name":"Matt Peternell","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/85aa446f8be987e848ea929ef524b67b"},"headline":"Industry Q&#038;A: Tracking Metrics for In-production ML Models","datePublished":"2020-10-28T23:40:38+00:00","dateModified":"2025-04-24T17:30:36+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/"},"wordCount":736,"commentCount":0,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png","articleSection":["Industry"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/","url":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/","name":"Q&A What metrics do you track for in-production ML models?","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png","datePublished":"2020-10-28T23:40:38+00:00","dateModified":"2025-04-24T17:30:36+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png","width":1085,"height":590,"caption":"Industry Q And A | Comet ML"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/metrics-in-production-ml-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Industry Q&#038;A: Tracking Metrics for In-production ML Models"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/85aa446f8be987e848ea929ef524b67b","name":"Matt Peternell","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/da003ee51bbeeccfb95147ec69139879","url":"https:\/\/secure.gravatar.com\/avatar\/36058153d701caaf237a96d5d6fb9c2d1678325c3ed0d8e88bf5e487019a2a53?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/36058153d701caaf237a96d5d6fb9c2d1678325c3ed0d8e88bf5e487019a2a53?s=96&d=mm&r=g","caption":"Matt Peternell"},"description":"We re-implemented the architecture of this model to incorporate patient and study information. By comparing our updated model to the original Github repository, we were able to quantify the benefits of classifying by patient as opposed to classifying by individual X-ray. We observed a 0.0254 increase in AUROC when evaluating the DenseNet121 on patients instead of on individual scans.","sameAs":["http:\/\/atre.net"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/mpeternellatre-net\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/02\/Screen-Shot-2020-10-27-at-9.41.26-AM.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/368","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=368"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/368\/revisions"}],"predecessor-version":[{"id":15692,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/368\/revisions\/15692"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/370"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=368"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}