Retraining is Not All You Need

Your machine learning (ML) model’s performance will likely decrease over time. In this blog, we explore which steps you can take to remedy your model and get it back on track.

Retraining is Not All You Need
Do not index
Do not index
Canonical URL

Introduction

According to a 2022 paper by MIT, Cambridge, and Harvard researchers, 91% of Machine Learning (ML) models they studied degraded over time. Therefore, it is crucial to monitor your models in production continuously. When a drop in performance is detected, you must take the right steps to improve your model’s performance to ensure your predictions remain reliable. This begs the question: what is the best course of action to fix a degraded model?
Most data scientists ensure their model’s performance remains steady by regularly retraining their models. In this blog, I explore whether retraining your model is the best solution for handling performance drops, depending on the causes of these drops.

How model performance drops are usually addressed

Explaining how to retrain your models is a vast topic that we won’t be able to cover in great detail in this blog. However, we will discuss some common approaches, best practices, and pitfalls associated with model retraining for supervised learning. This explanation draws much from chapter 9 of Chip Huyen’s excellent book Designing Machine Learning Systems, which describes an approach called continual learning. This approach consists of a set of practices aimed at updating ML models in micro-batches as new data becomes available.
Retraining is traditionally done using new data by either retraining the model from scratch or updating the current model. Afterwards, the updated models are compared to the existing ones to evaluate whether the new model should replace the current one. Retraining also involves setting a schedule, maintaining a robust streaming infrastructure, and addressing common challenges. We will explore some of these aspects in more detail later in this blog.
Meme by author
Meme by author
A critical difference between traditional retraining methods and our approach is the triggers for updating the model. Companies often retrain based on the following triggers:
  • Time-based: Retraining at periodic intervals.
  • Performance-based: Retraining whenever the model's performance drops in production. This can be calculated using access to labeled data. When target data is unavailable, we can estimate performance using DLE or PAPE, methods developed by NannyML.
  • Volume-based: Retraining when the amount of new data available for training represents a certain percentage of the total data.
  • Drift-based: Retraining when data drift is observed (though we will explain why this might not be the best trigger for retraining).
Retraining models can be time-consuming and resource-intensive. Retraining could also require preparing new data so the retrained model performs well. Even with this, there is no guarantee that the new model will converge as well as the previous model. Therefore, it is crucial to limit retraining to when it is truly necessary.
Concept drift is a reliable trigger because its detection indicates a change in the relationship between the model's inputs and outputs, which the model must learn to make accurate predictions. Other triggers do not always justify the need for retraining. For instance, while covariate shift often causes performance degradation, this is not always the case. Furthermore, even if covariate shift did cause a performance drop, retraining might not solve the problem, as we will see.

How you should address performance drop

The performance-based monitoring workflow developed by NannyML comprises three steps:
  1. Continuous performance monitoring
  1. Automated root cause analysis (carried out once a performance drop is detected)
  1. Issue resolution
NannyML’s performance-centric monitoring workflow
NannyML’s performance-centric monitoring workflow
In this section, we focus on step 3 of this workflow and explore different ways to address performance drops, depending on their root cause. We concentrate on the three common causes of model degradation: covariate shift, concept drift, and data quality.
Note that we do not address failures of ML systems caused by software issues such as server downtime, dependency failures, deployment failures, or other similar problems. Our focus is on ML-specific failures.
Note that we do not address failures of ML systems caused by software issues such as server downtime, dependency failures, deployment failures, or other similar problems. Our focus is on ML-specific failures.

When performance drop is caused by covariate shift

Given that the labeled training data for an ML model can be considered as samples from the joint distribution , where represents the input features and denotes the output. Covariate shift is defined as a change in the distributions of the input features of a model, while remains unchanged. Covariate shift detection can further be broken down into univariate and multivariate drift. The former deals with shifts in a single feature, while the latter refers to shifts in the joint distributions of some or all features.
Univariate drift visualized. Plot by author
Univariate drift visualized. Plot by author
While both univariate and multivariate drift can occur independently, in practice, we often observe both types of drift simultaneously. Moreover, covariate shift is a common cause of model degradation. There are several tools available for detecting both univariate and multivariate drift, which you can implement easily using NannyML OSS or NannyML Cloud.
When covariate shift causes your model to deteriorate in production, retraining might only partially solve the issue or might not solve it at all. This is because a shift in the distribution of the input data doesn’t necessarily mean that the relationship between and that the model is trying to learn has changed. In supervised learning, the model is trying to learn . If this relationship has not changed, retraining might not remedy your degraded model.
Instead, adjusting prediction thresholds might be the best solution for ensuring your model in production can still make key predictions and deliver value. The approach to adjusting prediction thresholds varies depending on the specific task at hand. We discuss the steps you can take when dealing with classification models.
For binary classification problems, most models output a probability that determines the classification. The default threshold is often set at 0.5, meaning an instance is classified as class 1 if the predicted probability is greater than or equal to 0.5, and as class 0 otherwise. However, this threshold can be adjusted to improve model performance and ensure more accurate predictions, especially in the presence of covariate shift.
Other approaches for adjusting the decision boundary include modifying loss functions, penalizing false positives or false negatives, and adjusting class weights. Adjusting class weights is particularly effective for handling imbalanced datasets.
For multiclass classification problems, methods for adjusting the decision threshold include:
  • Adjusting the threshold of each binary classifier when using a One-vs-Rest strategy.
  • Adjusting the threshold of each class when dealing with models like softmax classification.
When adjusting the decision boundary for classification problems, it is important to consider the precision-recall tradeoff.
When adjusting the decision boundary for classification problems, it is important to consider the precision-recall tradeoff.

When performance drop is caused by concept drift

Concept drift refers to the phenomenon where the input distribution remains the same, but the conditional distribution of the output given an input changes. Mathematically, this can be expressed as follows: given that the training data with labels can be described as samples from , where are the input features and is the output, in supervised learning we aim to model . Concept drift occurs when there are changes in while remains unchanged.
Intuitively, this means that the output changes while the input doesn’t. Consider, for example, a model tasked with predicting someone’s eligibility for a loan. The model might make predictions based on features such as credit score, debt-to-income ratio, and employment status. If the eligibility criteria for the loan are updated, some people who used to be eligible for a loan may no longer qualify, and vice versa. In this scenario, the distribution of the input features hasn’t changed, but the conditional distribution of the output given an input has changed. This is a mathematical way of saying that while the characteristics of the applicants haven’t changed, their eligibility has.
Visualizing concept drift. Image by Michał Oleszak
Visualizing concept drift. Image by Michał Oleszak
One challenge associated with detecting concept drift in production is the need for access to labeled data (also called target data or ground truth). The problem is that labels are often unavailable or become available only after the model’s predictions have impacted business decisions. However, if and when ground truth becomes available, you can use Reverse Concept Drift (RCD), an algorithm developed by NannyML, to detect concept drift.
When concept drift is detected, we recommend retraining your model as an effective way to rectify performance degradation. The reason is that when concept drift occurs, the relationship between the model’s inputs and output changes. This change must be learned, which can involve finding a new decision boundary for classification problems or learning a new regression line for regression tasks.

How you should retrain your model

Above, we discussed how most companies use time-based, performance-based, volume-based, and drift-based signals to trigger retraining. While our approach uses concept drift detection as a trigger for retraining, the methodology remains the same.
When updating a model, we refer to the original model as the champion model. We create a replica of the champion model, which we then update and call the challenger model. Much like in boxing, if the challenger performs better than the champion, it takes its place. This simple workflow ensures that you don’t push an underperforming model to production. It should also be noted that in reality, this process can be more complicated, as there might be several challenger models competing for the champion's position.
Simplification of continual learning in production. Image reproduced from Chip Huyen’s Designing Machine Learning Systems
Simplification of continual learning in production. Image reproduced from Chip Huyen’s Designing Machine Learning Systems
There are several ways to compare the challenger and champion models. One of them is shadow deployment, a practice where the challenger model is deployed in parallel to the champion model, and their predictions are compared. However, only the champion model’s predictions are used in production.
There are several ways to compare the challenger and champion models. One of them is shadow deployment, a practice where the challenger model is deployed in parallel to the champion model, and their predictions are compared. However, only the champion model’s predictions are used in production.
Now, when I say updating the model, I mean retraining it using newly available data. Here we consider two approaches: stateless retraining and stateful training. The former involves retraining the model from scratch, while the latter involves the model continuing to learn from new data. Stateful training requires less data to update your model than stateless retraining and also requires fewer computational resources. Sometimes, using both approaches can be beneficial. For example, you might opt for stateful training but still retrain your model from scratch occasionally.
For a deeper dive into optimizing your retraining schedules, Chip Huyen’s book is an excellent resource. Specifically, chapter 9 discusses the challenges associated with retraining, different levels of continual learning maturity, how to evaluate updated models, etc.

When performance drop is caused by data quality issues

Often, the cause of a degrading model can be traced back to data quality issues. These issues include (but are not limited to):
  • Broken data pipelines that fail to stream all the required data necessary for making predictions.
  • Changes in data collection processes, leading to discrepancies between the available data and the data the model was trained on. This might occur due to changes made by your data vendor.
  • For categorical features, encountering values in production that were not present during training, referred to as unseen values.
  • Data inconsistencies within the same dataset.
For the most part, such issues should be handled upstream. For example, a data engineer can fix broken data pipelines to ensure you have access to the necessary data. If there are changes in data collection, you might need to update data pipelines to accommodate those changes.
Sometimes, as a data scientist, you can fix these issues on your own. For instance, if data that used to be delivered in Celsius is suddenly given in Fahrenheit, a simple feature transformation can resolve this. However, if the data source decides to stop recording temperature altogether, you might need to build a new model.

Conclusion

In this blog, we discussed how to address model performance degradation. Instead of triggering retraining runs periodically, as is often done in practice, we propose a more nuanced approach that chooses the appropriate course of action based on the cause of the performance drop. While retraining can help get your model back on track, it can also lead to unnecessary resource use and model overfitting. Hence the title of the blog: Retraining Is Not All You Need 😉.
How to address performance degradation based on its root cause
How to address performance degradation based on its root cause
Remedying a degraded model is only a small part of a robust monitoring strategy. At NannyML, we advocate for a performance-centric monitoring workflow as an easy, repeatable, and effective way to ensure your model’s optimal value delivery. NannyML Cloud possesses all the tools required to fully implement such a monitoring workflow easily, so you don’t need to worry about your model's performance drops going unnoticed. To find out how NannyML Cloud can suit your organization’s needs, book a call with one of our founders.

Further reading

This blog focuses on one part of the monitoring workflow devised by NannyML, namely issue resolution. To learn more about this performance-centric workflow, read the following blogs. The first one explains the different steps of the workflow in depth, while the second one provides a hands-on example.

References

Designing Machine Learning Systems by Chip Huyen (O’Reilly). Copyright 2022 Huyen Thi Khanh Nguyen, 978-1-098-10796-3

Ready to learn how well are your ML models working?

Join other 1100+ data scientists now!

Subscribe

Written by

Miles Weberman
Miles Weberman

Data Science Writer at NannyML