How the team behind RecList is moving ML forward
When it comes to evaluating ML models, there’s debate about which metrics are the best to check and optimize for. There’s always another F1 or mAP score. There’s also a very healthy debate about how the metrics should be customized for their respective use cases. This debate exists because of how complex the real world is. We strive to get the best out of ML so that it delivers great end-user experiences and reaps the business ROI that our stakeholders are looking for.
While measuring the performance of the model is a core activity, as a community, we don’t have it all figured out yet. That’s okay. We move ML forward by working together.
The challenge with recommender systems
With model evaluation, the typical toolkit often involves looking at a variety of metrics. Depending on the project and use case, some metrics will be more relevant than others. The truly rigorous evaluations will also ensure good performance on unseen data, review for overfitting and underfitting, describe the complete performance of a model and set up data drift detection.
What’s still missing from this is a rounded evaluation. As we are all painfully aware, the metrics rarely tell the whole story. No single number will help us catch silent failures or avoid racial bias or reveal all the intricacies of data or concept drift. Moreover, the content of your test set may seriously overestimate the performance of your model in the real world: researchers in NLP found that state-of-the-art models with “human performance” actually fail at very simple NLP tasks. If you were to only know their accuracy, you would significantly misjudge the ability of the model to generalize and not produce harmful responses.
What is RecList
Of all ML systems in production, recommender systems are arguably some of the most impactful ones. They help us navigate most aspects of our digital life from what movies to watch, what book to read, what shoes to buy for that special handbag, and what news articles to open. How can we be sure (or “more sure”) that recommender systems in production generalize properly?
This is where RecList comes in. It’s an open source library with plug-and-play test cases and datasets that make it easy to scale up behavioral testing. Behavioral testing is not new, but this project does provide another great tool for your model evaluation toolbox. It allows anyone to test their models on a wide variety of metrics which provides a more holistic evaluation of model performance. It’s designed for recommender systems, with ready-made connectors for popular datasets in the field. In the future, it could be applied to other types of models as well. How cool is that?
RecList is built on two fundamental principles:
- There is no one single test that will tell you how the system behaves in the wild;
- Writing tests is mostly a boring, hard-to-scale activity. It needs to be fun and easy-to-use so that doing the right thing is scalable.
In a nutshell, RecList won’t tell you if model A or B is better (that’s for you to say), but it will remove the repetitive, boilerplate code. This will help quickly compare and debug models from a variety of perspectives. For example, does your model treat genders equally? Is it robust to small perturbations?
Jacopo Tagliabue is leading the charge with RecList.
Along with his colleagues, Jacopo and the team bring deep expertise on building recommender systems and putting them into production. When we asked Jacopo why they’re building RecList, he said:
Everybody agrees that behavioral testing is useful, but then in practice it is just hard to do it well, so in the best case you end up writing lots of ad-hoc, untested code for error analysis and debugging, in the worst, you just don’t do it and hope for the best. We didn’t set out to write “yet another package”, but we couldn’t find anything that was good enough for our B2B scenario, with hundreds of models in production; so we started RecList as a fully open source tool, and summarized our findings for the academic and industry community.
RecList is now supported by Comet
The open source approach means that Jacopo and the team need support. That’s why Comet is excited to sponsor RecList, to support the development of a beta of their RecList library, with a focus on ease of use.
Comet’s VP of Strategic Projects, Niko Laskaris, shared: “When we first met Jacopo, we knew he was up to great things, and we’re excited to support him in these endeavors.”
Jacopo added, “I’m so moved by the positive response in the MLOps community, and I’m proud of Comet’s support and excited to connect RecList with the platform!”
Here’s how you can participate, contribute or just see more of RecList:
- Check out RecList’s GitHub repo and give it a star
- Follow Jacopo Tagliabue on LinkedIn and GitHub
- Join the CIKM data challenge happening now through October 2022. The challenge is a first-of-its kind with the intent to make a long-lasting contribution to the community. Over 30 teams have been formed! The challenge is open for anyone and there are prizes for best systems and student work. Winners will receive $5K 🏆
 https://reclist.io/cikm2022-cup/