Table of Contents
- Introduction
- Data drift in a nutshell
- DRE vs. DC
- Experiment one: quantifying the magnitude of multivariate drift
- Experiment two: detecting small shifts
- Experiment three: performance with categorical variables
- Edge cases where DRE doesn’t capture multivariate drift
- Conclusions and further research
- Conclusion
- Further reading
Do not index
Do not index
Canonical URL
Introduction
Unfortunately, your machine learning models in production will not age like fine wine. ML models are trained on historical data, and as the world changes rapidly, so does the data. Fortunately, many tools are available to help monitor your models in production and detect when and why things go awry. Integrating these tools into an effective monitoring strategy is essential to ensure your model doesn’t age like milk.
One common cause of ML model degradation is data drift. Several tools can help detect multivariate drift, a specific type of data drift. In this blog, we explain data drift and multivariate drift, and experimentally compare two multivariate drift detection methods developed by our team at NannyML: Data Reconstruction Error (DRE) and Domain Classifier (DC).
Data drift in a nutshell
Data drift is defined as a change in the statistical properties of data over time. In machine learning, data drift typically refers to changes between the statistical properties of the data used to train and test the model and the data the model encounters in production.
Mathematically, we can express this as follows: Given that X represents the features and the target of an ML model, data drift is described as a change in their joint distribution .
There are several types of data drift. In this article, we focus on covariate shift and, more specifically, on multivariate drift.
Covariate shift is a type of data drift that refers to changes in the statistical properties of a model's features, that is, changes in the distribution of . Covariate shift can be further categorised into univariate drift and multivariate drift. The former refers to changes in the distribution of a single feature, while the latter refers to changes in the joint distribution of some or all of the model's features.
While it is common to monitor univariate drift, multivariate drift detection tends to be overlooked. However, it is crucial to monitor both, as multivariate drift can sneakily occur even when univariate drift doesn’t. Here’s an example to illustrate how this might happen:
Imagine a bank using an ML model to predict whether an applicant will be approved for a loan. Two features that the dataset might contain are applicant age and applicant credit score. Individually, these features are not particularly prone to drift. After all, we can expect the age distribution of loan applicants to remain relatively stable over time, and the historical distribution of credit scores to be stable as well. However, their joint distribution might change. For example, due to technological advancements and increasingly high-paying technical jobs, often held by younger individuals, the relationship between age and credit score might shift, as we may observe younger applicants with higher credit scores. This is an example of multivariate drift occurring without univariate drift, and it could lead to a degrading model in production.
NannyML has developed two methods for multivariate drift detection: Domain Classifier (DC) and Data Reconstruction Error (DRE). Both methods have dedicated blogs and documentation, so we will not delve into explaining them here, as the focus of this blog is to compare both methods experimentally.
DRE vs. DC
In this section, we describe the experiments and findings from our comparison of DRE and DC with the objective of evaluating the suitability of these methods for different models.
We hypothesize that there are differences in:
- The ability to quantify the magnitude of drift.
- The ability to detect small drifts.
- Compatibility with categorical and continuous data.
Further, we also explore edge cases where DRE is unable to detect multivariate drift.
All experiments and their results can be reproduced using the notebook provided. The notebook contains detailed steps, code, and data necessary to replicate our findings and further investigate the nuances of DRE and DC methods.
Experiment one: quantifying the magnitude of multivariate drift
For the first experiment, we create a reference dataset that we compare with a monitored dataset in which we introduce drift. Both distributions contain four features with data sampled from a normal distribution.
The reference data is sampled from a standard normal distribution, with a mean of 0 and a standard deviation of 1. We conduct experiments where the mean is increased in the monitored dataset, and then where the standard deviation is increased in the monitored dataset.
We conduct two sets of experiments. In the first, we increase the mean in the monitored dataset—first across all features, then for only one feature. In the second set, we do the same with the standard deviation: increasing it across all features, and then for just one feature.
NannyML algorithms are evaluated at the chunk level, meaning we assess multivariate drift for each chunk in the dataset based on a given chunk size. For our experiment, we chose a chunk size of 3,000, and for every 3,000 entries in our monitored dataset, we increase the mean or standard deviation by 0.1, depending on the respective experiment.
The results are visualized below, with the scores output by DRE and DC shown on the y-axis. The red diamonds indicate alerts when the DRE and DC scores surpass a predefined threshold.
Results for DRE:
Results for DC:
Visually, it appears that DRE is better at quantifying the magnitude of drift, as the DRE value increases linearly as the drift increases linearly.
We confirm this numerically by calculating the correlation between the magnitude of the drift and the DRE and DC scores, respectively, where we observe a near-perfect correlation for DRE.
Correlation between shift magnitude and DRE result for:
mean shift in all features: 0.9969185002630667
std shift in all features: 0.9993715352628257
mean shift in one feature: 0.9822234948714345
std shift in one feature: 0.9972635786669057
Correlation between shift magnitude and DC result for:
mean shift in all features: 0.7140678879780403
std shift in all features: 0.766544440688671
mean shift in one feature: 0.8908390779426745
std shift in one feature: 0.9276907062412154
Experiment two: detecting small shifts
The experiments conducted to determine if there is a difference in the ability to capture small shifts between DRE and DC follow a methodology similar to the first experiments. However, instead of increasing the mean or standard deviation by 0.1 for each chunk, we increase it by 0.02 and look for shifts of up to 1 mean and 1 standard deviation.
Visually, there seems to be a strong linear relationship between the small increases in shifts for both DRE and DC, with DC detecting even smaller shifts than DRE. Note that in the images below, the alert threshold is set to its default value, but lowering it could be useful if detecting tiny shifts is important for a specific use case.
Results for DRE:
Results for DC:
We validate these findings numerically once again and find that, while the results of both methods are highly correlated with the magnitudes of the small shifts, DC slightly outperforms DRE. Additionally, we observe that for the smallest shifts, DC's results increase slightly faster than DRE's, indicating that DC may be more suitable for detecting tiny shifts.
Correlation between shift magnitude and DRE result for:
small mean shift in all features: 0.957660573191225
small std shift in all features: 0.998291953793524
small mean shift in one feature: 0.9397971726778482
small std shift in one feature: 0.9893110878539874
Correlation between shift magnitude and DC result for:
small mean shift in all features: 0.9938854170248562
small std shift in all features: 0.9784490989849838
small mean shift in one feature: 0.9924553058355137
small std shift in one feature: 0.9853070875087371
Experiment three: performance with categorical variables
To analyze how well both methods can detect multivariate drift, we created reference and monitored datasets, each with four categorical features containing eight categories. We generated an instance of the monitored dataset where all features had shifted, as well as an instance where only one feature had shifted.
Additionally, we analyzed a case with two categorical features and two continuous features, where all the features had shifted in the monitored dataset.
We observed that while DRE was effective at detecting shifts when all categorical features had shifted, it struggled to detect smaller multivariate drifts where only one feature had shifted. It was also unable to detect shifts in instances with mixed categorical and continuous features, although it has been effective in datasets with more features.
DC, on the other hand, was able to detect multivariate drifts in all the above-mentioned instances. This could be because the algorithm underlying DC is LightGBM, which is particularly good at distinguishing between categorical features, including those with non-linear relationships.
Results for DRE:
Results for DC:
Edge cases where DRE doesn’t capture multivariate drift
DRE works by fitting PCA on the reference data and then performing a transform followed by an inverse transform on chunks of both the reference and monitored datasets. We then measure the reconstruction errors and compare them between the reference and monitored datasets to determine if there is multivariate drift.
This can happen, for example, if the data has undergone some non-linear transformation, where the correlation between features and the mean and standard deviation of each feature have been preserved. The following plots show an example where each feature was transformed using the square root and then scaled to preserve the standard deviation. The provided notebook contains the code to reproduce these results. We observe that DRE is unable to detect the multivariate drift, whereas DC successfully identifies it.
Conclusions and further research
The experiments outlined above demonstrate that DRE and DC behave differently depending on the properties of the data and the drift present within it.
We conclude that DRE is more suitable than DC for quantifying the magnitude of the drift. If, for a particular model, it is important not only to detect the presence of multivariate drift but also to measure the strength of the drift, then DRE is likely the better choice for you.
For our second experiment, we conclude that DC slightly outperforms DRE in detecting small distribution shifts. It also appears that both algorithms are better at detecting small shifts in standard deviation than shifts in the mean, although this has not been thoroughly examined and remains an area for future research.
Our third experiment demonstrated that, in instances where we are dealing mostly with categorical features, DC appears to be more effective.
We also note that DRE is more computationally efficient and therefore faster than DC for a particular instance. For a more comprehensive analysis of the speed of NannyML’s algorithms, check out the blog written by Taliya.
For the purpose of this blog, we focused on four hypotheses for which we conducted experiments. However, a more rigorous approach would be necessary to gain a deeper understanding of the performance of both methods. For example, we only tested our methods on data drawn from normal distributions for continuous features. For categorical features, we used eight categories with arbitrarily chosen distributions. A more comprehensive study could extend these experiments to other distributions that are commonly encountered in practice.
Another hypothesis we omitted from this study is that DC is likely more robust to noisy data, which presents an interesting area for future research. Additionally, it would be valuable to explore other edge cases where either method proves ineffective, as well as to examine the scalability of both methods by introducing more features into the analysis.
Conclusion
In this article, we explained multivariate drift and highlighted why detecting it is so important. We compared two methods, Domain Classifier (DC) and Data Reconstruction Error (DRE), and outlined the pros and cons of each to help you decide which method is right for you.
If you want to try these methods yourself, you can easily implement them using NannyML OSS in Python. Moreover, you can head over to NannyML’s blog to learn about other available methods and explore topics related to post-deployment data science.
For more advanced capabilities, NannyML Cloud offers a variety of data drift detection methods, along with a suite of tools to help you implement a comprehensive model monitoring system with round-the-clock checks and alerts. To find out how NannyML can support your organization, schedule a call with one of our founders.
Further reading
This blog highlighted the importance of monitoring multivariate drift and provided insight into two methods for detecting it. Detecting data drift is a small yet crucial part of an effective post-deployment monitoring strategy. At NannyML, we have developed a comprehensive monitoring workflow. If you want to learn more about it, the following blogs are a great resource:
If you’re interested in learning more about the appropriate steps to take after discovering that covariate shift is the cause of your degrading model, check out the following blog: