There's Data Drift, But Does It Matter?

There's Data Drift, But Does It Matter?
Do not index
Do not index
Canonical URL
A couple of months ago, a paper comparing NannyML with other ML monitoring tools (Evidently and Alibi Detect) popped up on arXiv. Naturally, I read it a couple of times—cover to cover—to really understand their comparison methodology and look for ways to improve NannyML. After going through it with my super-biased lens, I couldn’t help but feel like there were a few things the authors could’ve done better when comparing the tools.
This week, I finally had some time to write those thoughts down and even ran a few experiments using the same data and models they did. In this post, I’ll point out what I liked about the paper and discuss what I think could’ve been done better. I’ll back up my opinions with experimental results to show how I’d approach things differently, and compare the outcomes.
But before jumping into that, let me summarize what the paper is about.

TLDR of the paper

In it, the authors examined the capabilities of NannyML, Evidently AI, and Alibi-Detect. They used two real-world datasets and analyzed the tools’ ability to detect univariate data drift. They also compared non-functional criteria, such as integrability with ML pipelines, adaptability to diverse data types, user-friendliness, computational efficiency, and resource demands.
General architecture of the comparison framework.
General architecture of the comparison framework.
Runtime and RAM consumption results of the compared tools.
Runtime and RAM consumption results of the compared tools.
It's a pretty easy-to-read paper. I encourage you not only to stick with what I mention here but also to take a look at the paper yourself!

What was great about the paper?

What I liked about this paper was how the authors used a research methodology to compare industry-relevant tools. Another standout for me was their focus on non-functional comparisons. You hardly ever see benchmarks digging into stuff like runtime and RAM consumption between tools with a “researchy” lens. That’s a huge plus, especially since those are the kind of details that can make or break tool adoption in actual production environments.

Things I wish had been done better.

While reading the paper, I noticed four main patterns that I consider red flags in the context of ML monitoring. Let’s go over them one by one.

Red flag 1: using training data as reference data

The first one, and probably the most problematic, is that the authors used training data as reference data for monitoring. This is one of the most common mistakes when monitoring an ML model. ML monitoring software typically needs a reference dataset where things work as expected to fit their internal methods. This is useful so the tool knows what the data usually looks like, so if in the future things start looking very different from that behaviour, the tool can detect the changes.
Many people and tools advocate using the training set as the reference dataset. At first glance, this doesn’t sound like a bad idea. The issue is that machine learning models tend to overfit on their training data. Therefore, if we use training data as reference for monitoring, the expectations for model performance will be unrealistic. A much better approach is to use a dataset that is totally separated from the model, but we know how the model performs on it. For this, you can’t go wrong with the model’s test set!
ℹ️
For newly deployed models, the reference dataset is usually the test dataset. For a model that has been in production for some time, the reference dataset is usually a benchmark dataset selected from the model's production data, during which the model performed as expected.
 
I ran some experiments using the models from the paper to show just how different things look when you use training data vs. testing data as the reference for monitoring.
notion image
We can see how the one using the training data produces more alerts because the monitored methods set unrealistic expectations by being fitted with training data; this makes the expected thresholds narrower, producing multiple alarms during the monitored part. On the right, we have a more realistic result; this time, the monitoring method was fitted using the model’s test set.

Red flag 2: not providing the chunk size of their experiments

Chunk size is how you tell NannyML, or most monitoring software, to aggregate your monitoring results. The chunk size might vary depending on how frequently you want to monitor or how much data your model sees. One team might be interested in monitoring the model every week. Others might want more granular results and monitor the model every day. The relevant bit to keep in mind is that the chunk size should be big enough so that we get reliable results.
When the chunks are too small, statistical results become unreliable. In such cases, results are governed by sampling noise rather than the actual signal. For example, when the chunk size is small, what could look like a significant drift may only be a sampling effect.
The authors didn’t provide the chunk size they used in the analysis, so the overall results are hard to trust.
notion image
Here, we are using the same data that was used in the paper to show how chunk size can completely change the results. On the left, we used a chunk size of 500 data points; the results are wiggly and not very reliable, while on the right, we used a chunk size of 2000 data points; the reference period looks more stable, so we are more confident about the results seen in the analysis period.

Red flag 3: comparing only univariate drift methods

Univariate drift detection is just a small part of what ML monitoring is about. From my perspective, it would have been nice to see a comparison where the authors followed a typical monitoring workflow to compare the three tools.
notion image
I understand that the scope of such comparison would be too big and they probably wanted to only check data drift methods. But, in that case it would have been nice to also add multivariate drift detection to the mix.
While techniques for detecting univariate drift are relatively straightforward, they might overlook relationship changes between features. That is, while the distributions of individual features remain unchanged, their joint distributions might have shifted. This is why we need multivariate drift methods to identify changes in the joint distributions of some or all features.
notion image
The above image illustrates a multivariate drift between two features without univariate drifts. The univariate distributions of and remain practically unchanged, as seen in the overlapping density curves of both features. However, the joint distributions differ significantly. This example demonstrates how the relationship between variables can change (multivariate drift) even when their individual distributions do not.
NannyML has two methods to help data scientists detect multivariate drift when it occurs. The first method employs Principal Component Analysis (PCA), which is extensively discussed with a practical example in another article on NannyML’s blog. The second approach is Domain Classifier, which is explained in this other article.

Red flag 4: focusing only on data drift and not on model performance

This! This is a really important point. Every day, I see LinkedIn and Twitter posts about people asking for solutions to monitor data drift and retrain ML models when drift occurs. But hear me out: Not. Every. Drift. Affects. Model. Performance.
When we monitor our models by tracking data drift alone, we don't know how it will affect model performance.
Three possible outcomes can happen when you have data distribution drift:
  1. Model performance stays the same ― (shift occurs in similar regions)
  1. Model performance improves ↗️ (likely, the data moved to more certain regions)
  1. Model performance degrades ↘️ (likely, the data moved closer to the class boundary)
Most people and tools assume that data drift always correlates with performance degradation. However, just by looking at drift results, it’s impossible to tell which direction the changes will affect model performance. That’s why monitoring estimated performance metrics instead of data drift can be more important.
I used one of NannyML’s performance estimation methods, DLE, to be precise, to check if we could estimate any performance degradation caused by data drift. To my surprise, there was none.
notion image
Since we have the ground truth data, I could compare how well DLE estimated the metric RMSE—and it did a pretty decent job. Look how close the estimated and actual RMSE values are to each other.
notion image
It is impressive how DLE is able to estimate RMSE without knowing the actual values.
Under the hood, DLE trains an internal ML model that predicts the loss of a monitored model. Both models use the same input features, but the internal model uses the loss of the monitored model as its target. After estimating the loss, the DLE method turns it into a performance metric. This provides us with a way to observe the expected performance of an ML model, even when we don't have access to the actual target values. Check out our docs to learn more about how DLE works.
I ran the same process with the other model used in the paper and got a similar result. This time, the model was a classification one, so I used NannyML’s PAPE method. And I got zero estimated alerts!
notion image
Again, since we have the ground truth data, I can compare how well PAPE estimated the metric F1. And it worked like a charm!
notion image
This means that none of the data drift alerts reported in the paper were actually relevant, as none of them affected model performance!
Hopefully, this clarifies why monitoring data drift alone is pointless. It’s like expecting to go north by following directions from a broken compass.

Conclusion

If there’s one thing I want you to take away from this post, it’s this: never monitor data drift alone. We care about model performance, so that’s what we should be tracking. And with methods to estimate performance metrics, the old excuse of “we don’t have ground truth yet” is no longer valid. Also, never use your training set as the reference for monitoring.
I’d like to wrap up by thanking the paper’s authors for being open to collaboration and for sharing the data with me. I hope my comments come across as constructive and not harsh.
Lastly, you can find all the code for these experiments and plots in this GitHub repository. Unfortunately, I can’t share the data since I don’t have permission from the authors.

Ready to learn how well are your ML models working?

Join other 1100+ data scientists now!

Subscribe

Written by

Santiago Víquez
Santiago Víquez

Machine Learning Developer Advocate at NannyML