January 13, 2023
I've been a long time reader of Ben Thompson's newsletter called Stratechery. Ben Thompson…
Deploying your models into production is only half the battle in machine learning. Once a model moves to the production environment, you expect it to deliver value. However, ML models can fail and often do. The real-world environment presents factors that can cause models to underperform and fall short of business expectations. Changing customer behavior and market conditions could lead your model astray. Your model might encounter data that diverges from what was used during training, inevitably leading to decreased accuracy and reliability. ML models often struggle to meet business KPIs even after optimization and deployment. Model monitoring helps ensure that only effective and useful ML models are in production.
Model monitoring is a stage in the ML lifecycle that ensures the model performs as expected in the production environment. It is the process of monitoring model changes for data drift, concept drift, and model degradation. Model monitoring is complex. Without proper monitoring tools, undiscovered bugs or erroneous models may slip through the cracks.
ML models may degrade over time and lose their predictive power due to various factors, such as:
Data skew occurs when the data used to train the model is not representative of the live data. When this happens, the model becomes less accurate and produces unexpected results. There are multiple reasons why data skews occur. For example, the distributions of the variables in our training data do not match the live data’s distribution of variables. You could have built a model with a feature unavailable during prediction. Or our models may have ingested features created by other systems and may have changed how they produce data.
Data skews are something that may happen gradually over time. Luckily, model monitoring helps you catch this issue before it occurs.
ML engineers and data scientists often face model staleness. Model staleness happens when there are environmental shifts, changes in consumer behavior, or adversarial scenarios. For example, when we use historical data to train models, we need to anticipate that the same behavior may not happen at present times. Training financial models using data from a recession may not produce useful results when the economy is healthy. Shopping behaviors and customer preferences drastically changed during the pandemic, so models trained on pre-pandemic data may perform differently than expected today.
Monitoring your models helps you determine when there are shifts so you can evaluate what input data has changed and retrain the model with a new dataset.
Sometimes, models are trained based on data collected in the production environment, which can create a negative feedback loop that corrupts training data quality. This negative feedback loop makes the subsequent models perform poorly.
For example, a fraud detection system requires users with a high-risk score to complete additional verification. If the verification is effective and discourages fraudulent users from completing the extra step, they will never get to commit fraud and get identified as such. This tendency will cause a model’s future training to be biased because the existing system will likely stop the majority of fraud from happening, reducing the number of (already few) positive examples in the training set.
A lot of work still needs to be done after deployment. Monitoring helps you understand if your model addresses the problem you initially sought to solve. Establishing best practices can help your team streamline model monitoring and fix issues before they negatively affect end users. Here are some best practices your team can use.
ML systems can go wrong in two ways: data science and operations. Stakeholders in each area should collaborate to establish an effective monitoring workflow. Let everyone handle their areas of expertise and enable them to communicate effectively. Additionally, sharing the responsibility of model monitoring helps build knowledge sharing and prevents one person or team from getting overwhelmed when an issue arises.
Logging everything in your machine learning pipeline is critical. Pay attention to data versions, performance visualizations, and, most importantly, hyperparameters. Metadata stores can hold a lot of this information which improves auditing, lineage traceability, compliance, and troubleshooting. Stores effectively capture configurations of the feature preprocessing steps in the model training pipeline. And when done right, using a metadata store for hyperparameters can increase the reproducibility of the process.
In model monitoring, there are several metrics to track, but you don’t need to check them constantly. Every ML model is unique, so some metrics will not be as important to another model. However, the most common metrics to track are data drift, missing values, accuracy, RMSE, output distribution, number of predictions, and number of ingestion errors.
Aside from data and model quality metrics, tracking the system’s health is essential. Teams can track latency, system reliability, and memory/disk utilization. It’s also vital to add business KPIs, but it will entirely depend on the nature of your model and organization.
Setting alerts is crucial to know when something has gone wrong. If you track your metrics but don’t get notified when there is a problem, then why even track them?
Test your alerts before they go into production to ensure they’re effective. Clarify with your team who gets what alert. For example, data quality issues should be directed to the data science team, while system performance alerts should be sent to the engineering or IT team. Agree on the channel where these alerts get sent. Does your team prefer emails, text messages, or Slack? It’s also helpful to set a context for the alert by providing short, descriptive information. Remember to set alerts that require intervention and for issues that have a real business impact—other application or system issues are optional, depending on your team’s preferences.
You may encounter faults, errors, and inconsistencies that need immediate troubleshooting when pushing a model to production. A model production monitoring platform can help capture errors like data drift, ensure accuracy, ground truth labels, and other metrics before it’s too late.
An MPM can also provide drill-down capabilities to fix anomalies and pinpoint data augmentation issues, saving ML teams a lot of time and resources. Using a robust production monitoring platform, you can track systems, ML pipelines, and even costs.
Many organizations that rapidly jumped on the ML bandwagon ended up frustrated because their models didn’t perform as expected after real-world deployment. Moreover, since machine learning is relatively nascent, there’s yet to be a clear foolproof playbook for effectively monitoring models. It can also be challenging to gauge the expected results without the tools to understand your model’s performance in production.
Comet’s Model Production Monitoring provides insight into how production models perform in real-time. With Comet, you can track and visualize model performance at any scale. See the Comet difference today.