{"id":3792,"date":"2022-08-15T08:46:59","date_gmt":"2022-08-15T16:46:59","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=3792"},"modified":"2025-04-29T12:22:00","modified_gmt":"2025-04-29T12:22:00","slug":"reclist-the-better-way-to-evaluate-recommender-systems","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/","title":{"rendered":"RecList: The better way to evaluate recommender systems"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">How the team behind RecList is moving ML forward<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">When it comes to evaluating ML models, there\u2019s debate about which metrics are the best to check and optimize for. There\u2019s always another F1 or mAP score. There\u2019s also a very healthy debate about how the metrics should be customized for their respective use cases. This debate exists because of how complex the real world is. We strive to get the best out of ML so that it delivers great end-user experiences and reaps the business ROI that our stakeholders are looking for.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">While measuring the performance of the model is a core activity, as a community, we don\u2019t have it all figured out yet. That\u2019s okay. We move ML forward by working together.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The challenge with recommender systems<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">With model evaluation, the typical toolkit often involves looking at a variety of metrics. Depending on the project and use case, some metrics will be more relevant than others. The truly rigorous evaluations will also ensure good performance on unseen data, review for overfitting and underfitting, describe the complete performance of a model and set up data drift detection.&nbsp;&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">What\u2019s still missing from this is a rounded evaluation. As we are all painfully aware, the metrics rarely tell the whole story. No single number will help us catch silent failures or avoid racial bias or reveal all the intricacies of data or concept drift. Moreover, the content of your test set may seriously overestimate the performance of your model in the real world: <\/span><a href=\"https:\/\/aclanthology.org\/2020.acl-main.442\/\"><span style=\"font-weight: 400;\">researchers in NLP<\/span><\/a><span style=\"font-weight: 400;\"> found that state-of-the-art models with \u201chuman performance\u201d actually fail at very simple NLP tasks. If you were to only know their accuracy, you would significantly misjudge the ability of the model to generalize and not produce harmful responses.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What is RecList<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Of all ML systems in production, recommender systems are arguably some of the most impactful ones. They help us navigate most aspects of our digital life from what movies to watch, what book to read, what shoes to buy for that special handbag, and what news articles to open. How can we be sure (or \u201cmore sure\u201d) that recommender systems in production generalize properly? <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This is where <\/span><a href=\"https:\/\/reclist.io\/\"><span style=\"font-weight: 400;\">RecList<\/span><\/a><span style=\"font-weight: 400;\"> comes in. It\u2019s an open source library with plug-and-play test cases and datasets that make it easy to scale up behavioral testing. Behavioral testing is not new, but this project does provide another great tool for your model evaluation toolbox. It allows anyone to test their models on a wide variety of metrics which provides a more holistic evaluation of model performance. It\u2019s designed for recommender systems, with ready-made connectors for popular datasets in the field. In the future, it could be applied to other types of models as well. How cool is that?&nbsp;<\/span><\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">RecList is built on two fundamental principles: <\/span><\/h5>\n\n\n\n<ol class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">There is no one single test that will tell you how the system behaves in the wild; <\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Writing tests is mostly a boring, hard-to-scale activity. It needs to be fun and easy-to-use so that doing the right thing is scalable. <\/span><\/li>\n<\/ol>\n\n\n\n<p><span style=\"font-weight: 400;\">In a nutshell, RecList won\u2019t tell you if model A or B is better (that\u2019s for you to say), but it will remove the repetitive, boilerplate code. This will help quickly compare and debug models from a variety of perspectives. For example, does your model treat genders equally? Is it robust to small perturbations?<\/span><\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><a href=\"https:\/\/www.linkedin.com\/in\/jacopotagliabue\/\"><span style=\"font-weight: 400;\">Jacopo Tagliabue<\/span><\/a><span style=\"font-weight: 400;\"> is leading the charge with RecList. <\/span><\/h5>\n\n\n\n<p><span style=\"font-weight: 400;\">Along with his colleagues, <a href=\"https:\/\/www.linkedin.com\/in\/jacopotagliabue\/\">Jacopo<\/a> and the team bring deep expertise on building recommender systems and putting them into production. When we asked Jacopo why they\u2019re building RecList, he said:&nbsp;<\/span><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><span style=\"font-weight: 400;\">Everybody agrees that behavioral testing is useful, but then in practice it is just hard to do it well, so in the best case you end up writing lots of ad-hoc, untested code for error analysis and debugging, in the worst, you just don\u2019t do it and hope for the best. We didn\u2019t set out to write \u201cyet another package\u201d, but we couldn\u2019t find anything that was good enough for our B2B scenario, with hundreds of models in production; so we started RecList as a fully open source tool, and summarized our findings for the <\/span><a href=\"https:\/\/arxiv.org\/abs\/2111.09963\"><span style=\"font-weight: 400;\">academic<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/towardsdatascience.com\/ndcg-is-not-all-you-need-24eb6d2f1227\"><span style=\"font-weight: 400;\">industry community<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">RecList is now supported by Comet<\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The open source approach means that Jacopo and the team need support. That\u2019s why Comet is excited to sponsor RecList, to support the development of a beta of their RecList library, with a focus on ease of use. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Comet\u2019s VP of Strategic Projects, <\/span><a href=\"https:\/\/www.linkedin.com\/in\/nikolaskaris\/\"><span style=\"font-weight: 400;\">Niko Laskaris<\/span><\/a><span style=\"font-weight: 400;\">, shared: \u201cWhen we first met Jacopo, we knew he was up to great things, and we\u2019re excited to support him in these endeavors.\u201d<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Jacopo added, \u201cI\u2019m so moved by the positive response in the MLOps community, and I\u2019m proud of Comet\u2019s support and excited to connect RecList with the platform!\u201d<\/span><\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Here\u2019s how you can participate, contribute or just see more of RecList:<\/span><\/h5>\n\n\n\n<ol class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Check out <\/span><a href=\"https:\/\/github.com\/jacopotagliabue\/RecList\">RecList\u2019s GitHub repo<\/a><span style=\"font-weight: 400;\"> and give it a star<\/span><\/li>\n\n\n\n<li>Follow Jacopo Tagliabue <a href=\"https:\/\/www.linkedin.com\/in\/jacopotagliabue\/\"><span style=\"font-weight: 400;\">on LinkedIn<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/github.com\/jacopotagliabue\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Join the <strong>CIKM data challenge<\/strong> happening now through October 2022. The challenge is a first-of-its kind with the intent to make a long-lasting contribution to the community. Over 30 teams have been formed! The challenge is open for anyone and there are prizes for best systems and student work. Winners will receive $5K <\/span><span style=\"font-weight: 400;\">\ud83c\udfc6<\/span><span style=\"font-weight: 400;\"><br><br><\/span><a href=\"https:\/\/reclist.io\/cikm2022-cup\/\"><span style=\"font-weight: 400;\">https:\/\/reclist.io\/cikm2022-cup\/<\/span><\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>How the team behind RecList is moving ML forward When it comes to evaluating ML models, there\u2019s debate about which metrics are the best to check and optimize for. There\u2019s always another F1 or mAP score. There\u2019s also a very healthy debate about how the metrics should be customized for their respective use cases. This [&hellip;]<\/p>\n","protected":false},"author":112,"featured_media":3984,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[10,5,12],"tags":[],"coauthors":[131],"class_list":["post-3792","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-industry","category-partners-integrations","category-thought-leadership"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>RecList: The better way to evaluate recommender systems - Comet<\/title>\n<meta name=\"description\" content=\"Evaluating recommender systems is not easy. Metrics rarely tell the whole story, so the team behind RecList found a better way.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"RecList: The better way to evaluate recommender systems\" \/>\n<meta property=\"og:description\" content=\"Evaluating recommender systems is not easy. Metrics rarely tell the whole story, so the team behind RecList found a better way.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2022-08-15T16:46:59+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T12:22:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/08\/RecList_LinkedIn_Post_1200x627.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"627\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Claire Pena\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Claire Pena\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"RecList: The better way to evaluate recommender systems - Comet","description":"Evaluating recommender systems is not easy. Metrics rarely tell the whole story, so the team behind RecList found a better way.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/","og_locale":"en_US","og_type":"article","og_title":"RecList: The better way to evaluate recommender systems","og_description":"Evaluating recommender systems is not easy. Metrics rarely tell the whole story, so the team behind RecList found a better way.","og_url":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2022-08-15T16:46:59+00:00","article_modified_time":"2025-04-29T12:22:00+00:00","og_image":[{"width":1200,"height":627,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/08\/RecList_LinkedIn_Post_1200x627.png","type":"image\/png"}],"author":"Claire Pena","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Claire Pena","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/"},"author":{"name":"Claire Pena","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/b73b3ffc304cf8bec8866340329c5e89"},"headline":"RecList: The better way to evaluate recommender systems","datePublished":"2022-08-15T16:46:59+00:00","dateModified":"2025-04-29T12:22:00+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/"},"wordCount":916,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/08\/RecList_LinkedIn_Post_1200x627.png","articleSection":["Industry","Partners &amp; Integrations","Thought Leadership"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/","url":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/","name":"RecList: The better way to evaluate recommender systems - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/08\/RecList_LinkedIn_Post_1200x627.png","datePublished":"2022-08-15T16:46:59+00:00","dateModified":"2025-04-29T12:22:00+00:00","description":"Evaluating recommender systems is not easy. Metrics rarely tell the whole story, so the team behind RecList found a better way.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/08\/RecList_LinkedIn_Post_1200x627.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/08\/RecList_LinkedIn_Post_1200x627.png","width":1200,"height":627,"caption":"RecList graphic"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/reclist-the-better-way-to-evaluate-recommender-systems\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"RecList: The better way to evaluate recommender systems"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/b73b3ffc304cf8bec8866340329c5e89","name":"Claire Pena","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/6c42de20d82274b5bcc55f12d2480401","url":"https:\/\/secure.gravatar.com\/avatar\/0158b496f72fba29753917da405441fa923b21dec99134ee8818143fc4113fe4?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/0158b496f72fba29753917da405441fa923b21dec99134ee8818143fc4113fe4?s=96&d=mm&r=g","caption":"Claire Pena"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/clairep\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/3792","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/112"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=3792"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/3792\/revisions"}],"predecessor-version":[{"id":15785,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/3792\/revisions\/15785"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/3984"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=3792"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=3792"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=3792"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=3792"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}