NannyML’s Guide to Data Quality and Covariate Shift

Do not index

Canonical URL

Introduction

So you just did the job. You gathered some data, put a lot of effort into cleaning and wrangling to get the most out of it. After thorough experimentation, you’ve built a model; you tuned the heck out of your hyperparameters and achieved satisfying performance. It’s time to put the model into production and turn its predictions into business value. But then—sooner or later, the results start to deteriorate¹. The decisions made based on your model's predictions either bring no value or even cause losses. You need answers: what went wrong?

The Zillow Example

A similar thing happened at Zillow in 2021. The company providing real estate information to buyers had to close its house-flipping department and lay off 25% of its workforce due to the failure of the Zestimate algorithm. They reported a loss of more than $600M². The Zestimate algorithm was proprietary software that underwent a serious revamping. However, the details of why it failed have never been disclosed. It has been speculated that the problems arose from manifold factors, including data latency and lack of robustness to handle the volatility in the market caused by the pandemic. What we know for sure is that it could have been prevented if they had monitored their systems.

Monitoring Workflow

By thorough monitoring of the model’s performance, you can mitigate its decay. You are able to counteract the impact of delayed or absent ground truth. You can also ensure that your model is robust enough to withstand volatility in the market you’re aiming to model. With a good monitoring system at your disposal, you should be able to track your model's performance, and whenever it deteriorates perform Root Cause Analysis, which is crucial to resolve the issues that might have arisen.

Reasons for Model Failure

A machine learning model can fail in various ways and for different reasons. It can stop producing output due to a bug in the infrastructure, or it can produce inaccurate results. The reason might be hidden in poor project scoping; for example, if you overlook data quality issues when defining your project, the data at hand might be an inaccurate proxy for the reality you’re attempting to model, resulting in poor model performance and lost resources.

Even if you get ML-ready and clean data to build a project with promising results, data quality issues can arise again after deployment. Missing or new and unseen values can strain the adaptability of your estimator and lead to poor performance. Additionally, when reality changes—and thus, our data changes—our model may fail to make accurate predictions based on the new distributions of our variables and the relationships between them.

Whenever your model’s performance starts showing signs of degradation, it's crucial to investigate the root cause by checking the data quality and the presence of covariate shift. This can help identify and address the issues impacting your model's accuracy and reliability, ensuring that your model remains robust and effective.

In this blog, we will focus on identifying data quality issues and two types of data drift. I will show you how to use NannyML OSS to address these concerns whenever they pop up and endanger your project.

Data Quality

Collecting and extracting the right data is an art in itself. You either have to ensure that automated systems, like sensors, work properly, or make sure that data entry personnel don't make mistakes and introduce inaccuracies. Unfortunately, even if everything works perfectly, you can still face data quality issues because the world is constantly changing. New and unseen data will be passed to your estimator, testing its robustness. This can, in turn, degrade the accuracy of your predictions, increase maintenance and operation costs, and, in the worst-case scenario, undermine the trust and reliability you have earned with your stakeholders.

Univariate Data Drift

Fig.2 Distribution change in a covariate. Image by the author

Univariate data drift means that there have been changes in the probability distributions of a certain variable. This happens quite often, as data is a reflection of the ever-changing reality. These changes can be due to seasonality, new trends, or expanding markets.

Once detected, univariate data drift doesn't necessarily mean trouble. If our production data contains more data points in the areas where our model is more certain, the drift won't have any negative impact on the model's performance. The situation changes if the production data shifts to regions that were under-represented during training or moves to less certain regions close to the decision boundary.

Thus, triggering alerts every time univariate data drift occurs can cause alert fatigue—a situation where the team starts to ignore the warnings, which can be the true danger of univariate data drift.

That's the reason why it is crucial to track estimated performance metrics and measure the impact that data drift has on model performance.

If you’d like to learn more about it, have a look at this blog post:

Don't be Fooled by Data Drift | ML Monitoring Tools

While being a key component of a healthy monitoring workflow, data drift is a suboptimal alerting tool. Not all data drift impacts model performance.

https://www.nannyml.com/blog/when-data-drift-does-not-affect-performance-machine-learning-models

Multivariate Data Drift

Fig. 3 GIF illustration of changes in joint distributions of variables

Data drift can also occur when there's no apparent change in the distributions of single covariates. This peculiar situation arises when the of a single variable doesn't change, but the correlation between multiple variables does. Therefore, detecting multivariate data drift is trickier than comparing the probability distributions of a single variable.

NannyML to the Rescue

NannyML is a powerful post-deployment Data Science tool that helps keep an eye on your model's performance, discover when your model is experiencing problems, and detect the root cause of those issues. Let's have a look at how it works.

For this purpose, I used a dataset I initially obtained from Sourcestack while participating in a hackathon last year. I tried to answer the following question: "Did ChatGPT replace interns and juniors in engineering jobs?" This year, I obtained another sample of the data, which makes it a perfect dataset to compare if the model trained on data from last year experienced any trouble from univariate or multivariate data drift or if there are any issues with data quality, specifically an increased number of missing values or new unseen data.

The dataset contains information about vacancies, specifically engineering jobs. It consists of variables describing the hourly type of the job, whether the job is remote, what education is required, the seniority expected from the candidate, and the country where the job offer is available. The target variable is compensation estimation in dollars. I sampled the data on two different occasions, in June 2023 and May 2024, so both samples contain published and active job postings from the previous three months. This will have an impact on the continuity of the plots but shouldn't make it difficult to interpret the results.

To run data quality checks and univariate and multivariate drift detection algorithms, we need a reference dataset and an analysis dataset. The reference dataset is our baseline performance we will compare the analysis dataset to. We should use our test dataset as the reference and not the training dataset. The analysis dataset consists of our production data, the data we want to monitor.

Let’s start with installing NannyML, importing the libraries, and then loading the reference and analysis datasets.

If you’re curious about how to prepare both datasets, check my notebook.

pip install nannyml

import pandas as pd
import nannyml as nml
from IPython.display import display

reference = pd.read_csv('data/reference.csv')
analysis = pd.read_csv('data/analysis.csv')

reference['job_published_at'] = pd.to_datetime(reference['job_published_at'])
analysis['job_published_at'] = pd.to_datetime(analysis['job_published_at'])

Data Quality

NannyML gives you an automated way to monitor data quality. It consists of two calculators that allow you to track changes in the number of missing values and unseen values for categorical variables passed to the model in production.

Missing Values Calculator

The Missing Values Calculator counts the number of missing values in each data chunk and, by default, normalizes it. As with the previous calculator, we specify the features we would like to monitor. Specifying the chunk size and timestamp column is optional. If you’d like to see the absolute count, set normalize to False.

mv_calc = nml.MissingValuesCalculator(
    column_names=['hours', 'remote', 'education', 'seniority', 'country'],
    timestamp_column_name='job_published_at',
    chunk_size=500,
    # uncomment the code below to see the absolute count
    # normalize = False, 
)

mv_calc.fit(reference)
results = mv_calc.calculate(analysis)

results.plot().show()

Fig. 11 Output from MissingValuesCalculator for chunk_size=500

The results show the missing values rate (blue line) and the threshold (red dashed line) for that rate based on the reference dataset.

The results indicate that the missing values rate didn’t change in the hours, remote, education, and seniority columns, as there have been no missing values in either the reference or the analysis datasets. The missing values rate crossed the threshold in the country column, although their relative share has decreased in the analysis dataset. This improvement can result from the company improving its data collection processes, which improves data quality. Another possibility is a change in how missing data is handled; for instance, the data team may have started imputing missing values instead of leaving them as is. Such a change can increase the signal-to-noise ratio in the data. Both scenarios should be investigated to identify the root cause of this behavior.

Unseen Values Calculator

The Unseen Values Calculator helps you keep track of changes in the categorical variables.

New values that our model was not trained on might be a result of an error but could also be an early sign of emerging trends. In both cases, it’s crucial to track these changes, identify the possible causes, and prevent performance degradation before it becomes serious.

This time, I’ll look at the absolute count of unseen categories and set the normalize argument to False. I will pass the features that I want to monitor along with the optional timestamp and chunk size.

uv_calc = nml.UnseenValuesCalculator(
    column_names=['hours', 'remote', 'education', 'seniority', 'country'],
    timestamp_column_name='job_published_at',
    chunk_size=500,
    normalize = False,
)

uv_calc.fit(reference)
results = uv_calc.calculate(analysis)

results.plot().show()

Fig. 11 Output from UnseenValuesCalculator for chunk_size=500

Since the calculator detects unseen values by comparing the observed categories in the analysis set to the reference dataset, there will be no “unseen” values in the reference dataset.

In the analysis dataset, the remote, education, and seniority columns are free of unseen categories. However, in the hours column, a single occurrence of an unseen category appeared. Additionally, the country variable contains between 1 and 12 new values, depending on the chunk we’re examining.

This insight is quite valuable. In the sample from April 2024, we encountered job offers from countries that were not in the sample from June 2023. This could indicate one of two things: either the company started scraping data from new sources, or the job market has changed, with engineering jobs beginning to appear in new countries. This behavior should be further investigated to ensure our model can generalize and capture these changes.

Univariate Drift Detection

Now, we can start with the Univariate Drift Detection. I initialize the calculator, set the chunk size, specify categorical variables, and the timestamp column. I also pass statistical methods that the calculator can utilize to measure the distance between the data chunks from my reference and analysis datasets. I use a versatile Jensen-Shannon metric to assess both categorical and continuous distributions. Let’s also inspect the distribution of our target variable, as this is the only continuous variable in this dataset. For this purpose, I employ the Kolmogorov-Smirnov Test. If you'd like to dive deep into the details of various methods, this blog post is a great source of knowledge:

A Comprehensive Guide to Univariate Drift Detection Methods

Discover how to tackle univariate drift with our comprehensive guide. Learn about key techniques such as the Jensen-Shannon Distance, Hellinger Distance, the Kolmogorov-Smirnov Test, and more. Implement them in Python using the NannyML library.

https://www.nannyml.com/blog/comprehensive-guide-univariate-methods

After fitting the UnivariateDriftCalculator on the reference set, I can now feed it to the calculate method and visualize the results.

ud_calc = nml.UnivariateDriftCalculator(
    column_names=reference.columns.to_list(),
    treat_as_categorical = ['hours', 'remote', 'education', 'seniority', 'country'],
    timestamp_column_name='job_published_at',
    continuous_methods=['kolmogorov_smirnov'],
    categorical_methods=['jensen_shannon'], 
    chunk_size=500
)

ud_calc.fit(reference)
results = ud_calc.calculate(analysis)

figure = results.filter(
	column_names=results.categorical_column_names, 
	methods=['jensen_shannon']
	).plot(kind='drift')
figure.show()

Fig. 4 Drift plot for Categorical Variables from UnivariateDriftCalculator

The Jensen-Shannon distance between the data in the remote column of our reference and analysis set didn’t trigger any alert, indicating no data drift has been detected. However, the threshold for alert has been exceeded in the last chunk of the education variable, in three chunks of the seniority variable, and quite a few chunks showed signs of data drift in the hours and country variables. To better understand the distributions of our categorical covariates, let’s examine them closely.

figure = results.filter(
			column_names=results.categorical_column_names, 
			methods=['jensen_shannon']
			).plot(kind='distribution')
figure.show()

Fig. 5 Distribution plot for Categorical Variables from UnivariateDriftCalculator

The distributions for remote changed slightly but not enough to trigger a drift warning. We are also able to identify the cause of data drift. In two data chunks for the hours feature, the number of data points from 'part-time' and 'full-time' categories increased, while the number of occurrences in the 'unclear' category decreased. The same applies to seniority, where the number of occurrences for ‘unclear seniority’ increased, especially in the last two chunks. The most data drift can be observed in the country columns—there have been significant changes in the number of jobs in the United Kingdom and India, and the number of ‘Missing’ entries has decreased.

Let’s now look at the side-by-side comparison of the distributions of our only continuous variable, the predicted values (y_pred), and the ground-truth values (comp_dol).

figure = results.filter(
	column_names=results.continuous_column_names, 
	methods=['kolmogorov_smirnov']
	).plot(kind='distribution')
figure.show()

Fig. 6 Distribution Plot from UnivariateDriftCalculator for the target variable

Again, we can notice changes in distribution patterns, yet they remain consistent across both the reference and analysis sets, triggering an alert only in the last chunk of our analysis dataset.

figure = results.filter(
	column_names=results.continuous_column_names, 
	methods=['kolmogorov_smirnov']
	).plot(kind='drift')
figure.show()

Fig.7 Drift Plot from UnivariateDriftCalculator for the target variable

A quick look at the drift plot for the UnivariateDriftCalculator confirms that the Kolmogorov-Smirnov test resulted in a drift alert for the last 500 data points of our analysis dataset.

Multivariate Drift Detection

However, no univariate drift doesn’t mean that our model won’t suffer from multivariate data drift, as mentioned earlier in this blog post. Unfortunately, detecting this type of drift isn’t as straightforward as applying some basic statistical tests. NannyML offers two methods to capture occurrences of multivariate data drift.

Does the sample from 2024 show signs of multivariate data drift?

Data Reconstruction Drift Calculator

The Data Reconstruction Drift Calculator method, uses Principal Component Analysis - a compression algorithm that reduces the dimensionality of data, transforming it into a new set of uncorrelated variables, called principal components, which retain the most important information while discarding noise and redundancy.

Fig. 8 PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 By Nicoguaro, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46871195

The trick is to reverse that compression and compare the original and decompressed dataset and compute the decompression error—how much information has been lost. If we do that to our analysis dataset, we can then compare the error to our baseline dataset. If the error exceeds a given threshold, it means there must have been changes in the correlation between the variables of our dataset.

Fig. 9 GIF demonstrating relation between joint distributions and PCA reconstruction error.

Let’s initialize the Data Reconstruction Drift Calculator, passing the column names we want to monitor, timestamp data, and chunk period or, in this case, chunk size. Let’s fit it on the reference dataset and calculate the reconstruction error for the analysis dataset.

drd_calc = nml.DataReconstructionDriftCalculator(
    column_names=reference.columns.to_list(),
    timestamp_column_name='job_published_at',
    chunk_size=200
)
reference['job_published_at']=reference['job_published_at'].astype('object')
analysis['job_published_at']=analysis['job_published_at'].astype('object')

drd_calc.fit(reference)
results = drd_calc.calculate(analysis)

figure = results.plot()
figure.show()

Fig. 10 Output from DataReconstructionDriftCalculator for chunk_size=200

The reconstruction error exceeded the threshold in two data chunks. The difference from the reconstruction error in our reference dataset isn’t huge. Let’s check the results for a chunk size of 500 to exclude the sampling effect as the culprit.

drd_calc = nml.DataReconstructionDriftCalculator(
    column_names=reference.columns.to_list(),
    timestamp_column_name='job_published_at',
    chunk_size=500
)

Fig. 11 Output from DataReconstructionDriftCalculator for chunk_size=500

After changing the chunk size to 500, the alert isn’t triggered, although we can see that the error values are really close to the threshold. This highlights the importance of proper chunking to counteract the sampling effect. Let’s now check if the Domain Classifier Calculator will yield similar results.

Domain Classifier Calculator

Another way of detecting multivariate data drift is the Domain Classifier Calculator. This method utilizes the LGBM Classifier to discern between data samples coming from our reference and analysis datasets. The AUROC metrics give insight into how easy it is for the model to differentiate between those two datasets. If it’s easy (high AUROC value), it means the multivariate data shift occurred in our production dataset. If the model has trouble telling the difference, it means there haven’t been any significant changes to our datasets.

This time, I will initialize the calculator with a chunk size equal to 500.

dc_calc = nml.DomainClassifierCalculator(
    feature_column_names=['hours', 'remote', 'education', 'seniority', 'country'],
    timestamp_column_name='job_published_at',
    chunk_size=500
)

dc_calc.fit(reference)
results = dc_calc.calculate(analysis)

figure = results.plot()
figure.show()

Fig. 12 Output from DomainClassifierCalculator for chunk_size=500

The Domain Classifier Calculator with a chunk size of 500 detected multivariate data drift in the last two chunks. Although it exceeded the threshold by only a few hundredths and might be considered negligible from a broader perspective, the Domain Classifier's heightened sensitivity to minor changes and its requirement for more data than the PCA-based Drift Detector suggests that we should approach the production data with caution in the near future.

Both algorithms provide a solution for identifying minor and subtle shifts in correlations between variables—shifts that univariate methods might overlook. However, it's important to note that these algorithms are less explainable, and pinpointing where the shift happened can require more work.

If you're interested in which algorithm suits your use case best, check this blog post:

Comparing DRE and DC for multivariate drift detection

In this blog, we compare two multivariate drift detection methods, Data Reconstruction Error (DRE) and Domain Classifier (DC), to help you determine which one is better suited for your needs.

https://www.nannyml.com/blog/dre-vs-dc

Despite this, their ability to detect nuanced changes in data patterns makes them handy tools for maintaining model accuracy and reliability.

When to use data drift detection?

Once deployed, a machine learning model should be continuously monitored for its performance.

The moment a drop in performance is detected, the next step is to conduct a root cause analysis to find the culprit of the change. This is where data drift detection methods can be helpful. However, designing your monitoring system based on data drift detection only will be ineffective, as not every data drift leads to performance degradation, and performance degradation can also result from other causes such as data quality issues and concept drift.

Summary

Thorough and continuous monitoring of deployed machine learning models is crucial to prevent performance degradation and ensure robustness, especially in volatile markets. The Zillow case is an example of how costly it can be to neglect it. The moment you see signs of degrading performance, it is essential to conduct root cause analysis to determine if the culprit is data quality or drift.

NannyML offers methods to detect these issues, providing safeguards against the risks of inconsistent data quality and changing data distributions. It provides tools for ensuring data consistency and out-of-the-box algorithms for detecting different types of covariate shift.

If you want to learn more about NannyML and how to leverage its functionalities to keep an eye on your business value, check out:

NannyML on GitHub

NannyML’s Guide to Data Quality and Covariate Shift

Introduction

The Zillow Example

Monitoring Workflow

Reasons for Model Failure

Data Quality

Univariate Data Drift

Multivariate Data Drift

NannyML to the Rescue

Data Quality

Missing Values Calculator

Unseen Values Calculator

Univariate Drift Detection

Multivariate Drift Detection

Data Reconstruction Drift Calculator

Domain Classifier Calculator

When to use data drift detection?

Summary

Join other 1100+ data scientists now!

Join other 1100+ data scientists now!