ML Models Aren’t Forever, but Why?

Do not index

Canonical URL

Parmenides argued that change is an illusion and that reality is unchanging and eternal. I guess he never deployed a model into production. If he had, he would have observed how a static machine learning model fails to generalize on production data. But why does this actually happen?

This is what I would like to explore in this blog post.

Nature of ML models

A machine learning model captures the probability of the target variable based on the patterns in the features from the training data. That’s why it’s important that your training data is representative, balanced, accurate, diverse, varied, relevant, non redundant and up-to-date. If all of the conditions are met, your model will be a robust reflection of the reality and will be able to generalize well on the production data.

Well, at least to a certain point - as long as the production data is a sample from a reality that the training data was extracted from. But in contrary to what Parmenides claimed, reality is changing and the production data is like a river - it might look the same but it’s different.

Fig. 1 A philosophical representation of ever changing reality.

In this blog post we will have a look at what can change about the river, how it can influence our model performance and how to detect and fix it.

Why does a model's performance degrade over time?

Data Quality

Data quality is a measure that tells us if the data meets the following requirements:

Accuracy - Is the data free from errors?

Completeness - Are all necessary data points present?

Consistency - Does the data remain coherent across different datasets and systems?

Validity - Does the data conform to the required formats?

Uniqueness - Are there any duplicate records or entries in the dataset?

Timeliness - Is the data up-to-date ?

If your training data doesn’t fulfill those requirements, the resulting model will likely struggle to provide any value whatsoever and won’t even get the chance to fail in production.

Not meeting those criteria in your production data can lead to inaccurate predictions and increased uncertainty in the model's outputs. You can infer the importance of data quality from the amount of money and effort invested in data engineering teams to ensure that the collected data is not garbage.

Inaccurate production data or erroneous data may be caused by failing hardware or mistakes in the data processing pipelines. For example, a broken sensor could pass data that doesn’t reflect the real state of a process. Predictions based on such false data would be misleading.

The same holds true for incomplete data. If you want to make reliable predictions, you should ensure that the rate of missing values is similar during training and at inference time. This issue can arise not only due to errors or mistakes but also because of regulations that can affect which types of data are legal to use. For example, if you trained your model using a rich dataset but later, in production, you’re not allowed to use a certain column due to regulatory changes, it will impact your model’s performance.

The opposite scenario can also happen if you suddenly gain access to more specific data. For instance, if the data collection process for a gender variable changes—moving from binary gender options to a more inclusive system that accepts non-binary values—it will influence the distribution of that variable and affect the accuracy of your predictions. That’s why data consistency and validity are so important. After all, your model can’t generalize well on unseen data.

Two important things to bear in mind when handling the quality of your production data are keeping the rate of missing values stable and accounting for unseen values.

How data quality issues impact model performance depends on the severity of the issues. The higher the rate of missing values, the stronger the negative impact on the model's performance.

Fig. 3 Influence of the number of missing values on model's performance, given the feature is informative.

However, it also depends on feature importance. If you leave an uninformative, correlated, or duplicated feature in your dataset, a higher rate of missing values might actually improve your model’s performance. In such cases, you might want to rethink your approach to feature selection.

Fig. 4 Influence of the number of missing values on model's performance, given the feature is uninformative.

Data Drift

Changes in the data that you pass to the model for predictions can also be more subtle. There are two types of data drift that can occur in production data: univariate and multivariate.

Univariate drift means the distribution of a single variable has changed.

Multivariate drift means the joint distribution of multiple variables has changed, even though the distributions of individual variables have not.

Univariate Data Drift

What are the reasons for covariate shift? To understand it, we have to realize that the batches of production data are small samples of data—much smaller than the training data our model was trained on. If the training data was representative, balanced, diverse, and varied, the production data can come from different regions of the distributions: those that were well represented and those that were less represented.

Let’s imagine that we try to predict if someone will buy insurance based on their age and earnings. If the campaign was aimed at younger target groups and we collected a lot of data from the 16–18 age group who work part-time jobs, they would probably be underrepresented, and the prediction for them would be less certain. Conversely, if another campaign focused on middle-aged men—a group that dominated our training dataset and now dominates even more—it wouldn’t have a negative influence on the results from our model. In fact, it might even improve its predictions and decrease model uncertainty.

So, the sampling effect can influence the distributions of our data but will only impact our model’s performance if the data points come from underrepresented regions.

Univariate drift can also occur not only because of sampling effects but also due to seasonal changes. Think about holidays when young people decide to work more; the distributions of their income will change for those two months.

If you want to keep track of univariate data drift, use methods like the Kolmogorov-Smirnov test, Jensen-Shannon divergence, Wasserstein distance, and Hellinger distance for continuous variables, and chi-squared test, Jensen-Shannon divergence, L-infinity distance, and Hellinger distance for categorical variables.

How bad is univariate data drift? You definitely shouldn't panic once you discover data drift in your production data. The impact of data drift on your model's performance will largely depend on where the data drifted. Did it drift to regions that are well represented in your training dataset? That’s great! It can actually improve your model's performance and how certain the model is about its predictions. Only when the production data drifts to regions that were poorly represented during training might your model performance drop. What will drop for sure, however, is how certain your model will be about its predictions.

Fig. 6 Production data drifting to known regions results in more confident predictions from the model.

Fig. 7 Production data drifting to less known regions results in less confident predictions from the model.

Multivariate Data Drift

Another type of covariate shift is multivariate drift. This means that while the distributions of individual variables don’t change, the relationship between them does. Imagine we trained our model on data that had a positive correlation between age and income—the older a person, the more they earn per year.

However, one of the data batches consists of data drawn from another region, where the older you get, the less you earn. The distributions of age and income would remain the same, but the correlation between the two is different, which can impact the certainty and accuracy of your model’s predictions.

Fig. 8 Relation between PCA reconstruction error and changes in the joint distribution of variables, while the univariate distributions remain unchanged.

To detect that tricky kind of drift, you can leverage the correlation factor between the variables using Principal Component Analysis. You reduce the dimensions of your dataset and reverse that process using the inverse_transform method. Based on the results of the reconstruction, you can compute the reconstruction error, which is a good proxy for the presence of multivariate data drift.

Another approach is to use a binary classifier. In this method, we assign labels 0 and 1 to our reference (test dataset) and analysis dataset (production data) and check how well the estimator can differentiate between the two. If it has difficulties discerning between your test data and production data, we might have found a multivariate data drift.

How bad is it for your model's performance? Just as in the case of univariate shift, it depends.

I created a simple dataset to set the influence of multivariate data drift on model performance. This is how I changed the reference dataset using a correlation factor ranging from -1 to 1. As we can see, the distributions of single covariates don't change, or change just slightly, but their joint distribution changes significantly, and I control the change using the correlation factor.

Fig. 9 Experiment results. Changes in the joint distribution of variables, while the univariate distributions remain unchanged.

Let's now plot the model's performance for the data samples I created. We can see that the negative correlation between features had a positive effect on model performance. The model's performance tested on the data samples with positive correlation dropped from 0.88 to 0.75 for the ROC AUC score, possibly because high positive correlation aligns the data points close to the decision boundary, where the model can be less confident.

Fig. 10 Experiment results. Changes in model performance in relation to the changes in the joint distribution of covariates.

And we can quickly confirm this by having a look at the model confidence heatmap for that specific correlation factor—0.9.

Fig. 11 Experiment results. Model confidence heatmap and the data points that underwent multivariate data drift which resulted in poor model performance.

Concept Drift

The final boss that your model can encounter while serving is concept drift. And there is no good news: your model won’t survive the encounter.

To understand what concept drift is, we need to start with 'concept.' The concept is what the model learns and captures—the relationship between features and the given target. We talk about concept shift when the (joint) probability distributions of the features don’t change, but the probability of the target given the input features does change

In the case of binary classification, the model learns a decision boundary that divides the inputs into two groups, represented by labels 0 and 1. In our case, diamonds and stars are the ground truth labels. When we talk about pure concept drift, the distributions of covariates don’t change, but the way targets are distributed in relation to the covariates does. Our concept no longer reflects the reality in the new data, resulting in false predictions.

How does that happen, and what can cause concept drift? Remember the river? Yes, the changing reality or moving your model from one reality to another. For example, emerging new trends can change the preferences of users in the streaming industry or customers in e-commerce. Inflation can lead to a concept shift in loan default predictions.

Whenever the relationship between your covariates and the target changes, and the decision boundary drifts away from its position at the time of training, your model's performance will drop. The only thing that can bring your model back to life is retraining.

Fig. 13 Experiment results. Impact of concept drift on model accuracy and AUROC

Conclusion

Data quality issues, data drift, and concept drift might sound like the plagues of Egypt. They can lead to a "datastrophe," but not necessarily. These challenges are just part of the reality of working with machine learning models in production and we should be aware of their impact and learn to manage them.

To recap:

Data Quality refers to how accurate, complete, consistent, valid, unique, and timely your data is. Poor data quality can lead to unreliable models from the start. Once you notice increased missing or unseen values rates, you should investigate what's the reason in the upstream data pipelines.

Data Drift happens when the distribution of your input data changes over time. This can be univariate (distribution of a single feature changes) or multivariate (relationships between features change). Their impact on models performance depends on where the data drifted to. Better known regions will improve model confidence and performance but the lesser known might indeed have a negative influence.

Concept Drift occurs when the relationship between your input features and target variable shifts, meaning the model's learned "concept" no longer matches reality. Once you detect this kind of drift, you need to prepare for model retraining.

By keeping a close eye on your data and understanding how and why it changes, you can catch problems early and keep your model performing well.

Reality is always changing—just like the river that’s never the same twice. Your model won’t last forever without updates, and that’s okay. The key is to embrace the change and stay ready to flow with it.