Do not index
Do not index
Canonical URL
Introduction
Detecting data drift is an efficient way of getting to the bottom of why your model, which was doing so well in development, is suddenly underperforming. The Population Stability Index, or PSI for short, is a technique for quantifying data drift that has been widely used over the last 15-20 years. It is a univariate drift detection method popular in the financial sector, with many applications for credit scoring models.
Although this method has its merits, it definitely has shortcomings. In this blog, we explore and explain PSI, implement it in Python, and finally compare it to other univariate drift algorithms.
Covering the basics
NannyML’s blog contains a gold mine of information about ML monitoring and data drift, among other things. However, briefly defining some of the concepts covered in this blog is always a good way to kick things off.
Monitoring, in the context of MLOps, refers to the actions you take to continuously check your model’s performance once it has been pushed into production. The aim is to proactively ensure that your ML model is working correctly so that the intended outcome is being fulfilled. There are different techniques for monitoring your model's performance. Some methods don’t rely on access to target data (also called ground truth). These include DLE and CBPE, both developed by our team at NannyML.
Data drift is a term used to describe changes over time in the joint distribution of model inputs and model targets, denoted . Simply put, this means that the relationship between your model’s input features and output has changed. When this change is specifically related to changes in the input features , we call this phenomenon covariate shift.
Population Stability Index (PSI)
The Population Stability Index, or PSI, is a method used to quantify covariate shifts between two univariate distributions. We are usually interested in finding changes in the distribution of an input feature in our ML model over time, as this might cause our model performance to drop.
Here, we will refer to the two distributions we compare using PSI as the reference and monitored data. The reference data represents the data (excluding the training set) on which we know our model performs well. We use this data as a baseline for comparing new data obtained in production. The monitored data, on the other hand, is the new data collected while the model is deployed in production.
Note that univariate drift detects data drift for a single input feature. When we deal with data drift of the joint distribution of some or all features, we call that multivariate drift.
One of PSI’s benefits is that it can be easily computed for continuous and categorical variables.
PSI for continuous variables
We first need to discretise the data to calculate the PSI for continuous variables. We split the data into bins to represent it as a histogram. There are many methods for binning. In the following example, we will use Doane’s formula to determine the number of bins.
For each bin we compute:
- the proportion of the reference data in the -th bin,
- the proportion of the monitored data in the -th bin,
Finally, we obtain the PSI with the following fairly straightforward equation:
The PSI gives us a non-negative value, which is commonly interpreted as follows:
- indicates there are no significant population changes,
- points to moderate changes in the distribution of a given input feature,
- suggests significant changes in distribution, which can negatively impact your ML model’s performance.
Remember, these values are used as a rule of thumb and are not definitive benchmarks.
PSI for categorical variables
Discretizing the data is unnecessary to calculate the PSI for categorical variables. We simply consider , the union of the categories from the reference and the monitored population. For each category , we compute:
- the proportion of the reference data in category ,
- the proportion of the monitored data in category ,
Now, we obtain the PSI with the following (slightly modified) equation:
The same rules of thumb as in the continuous case are also commonly used for interpreting categorical PSI.
How to calculate PSI in Python
Here, we discuss a simple implementation of PSI in Python. We put the method to the test using the Credit Card Fraud Detection dataset available on Kaggle. A collaboration between Worldline and the Machine Learning Group of ULB beautifully assembled this dataset. You can access the provided notebook to run the code yourself.
Note that this is an implementation and discussion of PSI for continuous variables; however, the notebook also contains an implementation of categorical PSI.
Each entry in the dataset represents one transaction. For security reasons, the dataset does not contain the original features. We have access to the 28 principal components obtained through PCA, as well as the transaction amount, transaction time, and the target, which indicates whether or not fraud occurred.
Univariate drift can occur in credit card transaction data for many reasons, including seasonal trends (spending behaviours change with holidays, for example, or during sales), economic conditions that can further impact spending behaviour, changes in the demographic adopting credit cards, and many other reasons.
First things first, we need to import the necessary libraries and our dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')
# Read the CSV files into pandas DataFrames
# Note: you might need to adjust the path depending on where the data is stored
df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/PSI/creditcard_transactions.csv")
Then we proceed to do some minor data preprocessing. The
Time
column gives us the number of seconds elapsed between a given transaction and the first transaction in the dataset. The following code snippet converts it to DateTime format using an arbitrarily chosen start date.# Define the base datetime
base_time = datetime(2024, 3, 20, 0, 0, 0)
# Convert 'Time' column to timedelta and add to the base_time
df['Time'] = pd.to_datetime(df['Time'].apply(lambda x: base_time + timedelta(seconds=x)))
We proceed to split the data in two, the first half of the transactions represents the reference data, while the second half represents the monitored data:
# Sort DataFrame by the new 'Time' column
df_sorted = df.sort_values(by='Time')
# Split DataFrame into reference_data and monitored_data with a 50% split
split_index = len(df_sorted) // 2
reference_data = df_sorted.iloc[:split_index]
monitored_data = df_sorted.iloc[split_index:]
The kernel density estimates for one of the features (
V15
) for both the reference and monitored data allow us to visually observe the univariate drift (the provided notebook contains the code for plotting the KDEs).Finally, we implement the PSI method, which takes a reference and monitored dataset, and optionally a number of bins. It outputs the PSI.
If the number of bins is not parameterized, Doane’s formula is used to compute the number of bins. All the methods use equal-width binning, although other strategies, including quantile binning, can be used.
def psi(reference, monitored, bins=None):
"""
Calculate the Population Stability Index (PSI) between a reference dataset and a monitored dataset.
Parameters:
reference (numpy.array): The reference dataset, representing the baseline distribution.
monitored (numpy.array): The monitored dataset, representing the distribution to compare against the reference.
bins (int, optional): The number of bins to use for the histograms. If set to None, Doane's formula will be used to calculate the number of bins. Default is None.
Returns:
float: The calculated PSI value. A higher value indicates greater divergence between the two distributions.
"""
# Get the full dataset
full_dataset = np.concatenate((reference, monitored))
# If bins is not parametrized, use Doane's formula for calculating number of bins
if bins is None:
_, bin_edges = np.histogram(full_dataset, bins="doane")
else: # If number of bins is specified
bin_edges = np.linspace(min(min(reference), min(monitored)), max(max(reference), max(monitored)), bins + 1)
# Calculate the histogram for each dataset
reference_hist, _ = np.histogram(reference, bins=bin_edges)
monitored_hist, _ = np.histogram(monitored, bins=bin_edges)
# Convert histograms to proportions
reference_proportions = reference_hist / np.sum(reference_hist)
monitored_proportions = monitored_hist / np.sum(monitored_hist)
# Replace zeroes to avoid division by zero or log of zero errors
monitored_proportions = np.where(monitored_proportions == 0, 1e-6, monitored_proportions)
reference_proportions = np.where(reference_proportions == 0, 1e-6, reference_proportions)
# Calculate PSI
psi_values = (monitored_proportions - reference_proportions) * np.log(monitored_proportions / reference_proportions)
psi = np.sum(psi_values)
return psi
When running this method using Doane’s formula for binning, we get . According to the benchmark values, a indicates a significant drift. Now that we’ve shown how to implement PSI, we will test it against other univariate methods and compare the results to establish how good of a measure of covariate shift it is.
PSI vs other methods
NannyML OSS contains many univariate drift methods. To explore them, refer to our documentation or comprehensive blog comparing the methods. Now, we see how PSI holds up against Wasserstein distance and Jensen-Shannon (JS) distance. We can easily implement those methods using NannyML OSS.
We conduct our experiments as follows: We compare JS distance, Wasserstein distance, and PSI for two normal distributions with varying means and standard deviations. The provided notebook contains the code of our experiments, including a more numerical approach and further experiments using other distributions, such as gamma distributions, but those results are not discussed here. Furthermore, previous research suggests that the PSI's distribution is unaffected by the underlying distribution of the analysed variable. [1]
Both JS and Wasserstein distance are robust ways of detecting univariate drift. JS is known to detect small drift and will give a high value as soon as drift is detected. However, its value plateaus after a while, making it less suitable for quantifying the drift. Wasserstein, on the other hand, is less sensitive to small drifts than JS but increases linearly as the drift increases, making it perfect if your aim is to determine exactly how much drift there is in your data.
Our experiments reveal that PSI is less sensitive to small drifts than JS distance and less suitable than Wasserstein for quantifying drift intensity, as it does not increase linearly with drift. Additionally, we found that the commonly used threshold for interpreting drift — where a indicates significant drift — is often exceeded even with small drifts. For instance, this threshold is surpassed with two normal distributions having a mean shift of just 0.6.
However, we observe that the PSI grows slowly for small drifts, indicating that it might be suitable for obtaining more fine-grained results when quantifying very small shifts.
It should be noted that while these experiments provide a good indication of how PSI compares to other methods, they are not part of a rigorous research effort. These results serve as a good indication of how univariate methods can be compared, but further research is required to determine whether we decide to add PSI to NannyML OSS and NannyML Cloud.
Limitations of PSI
A limitation of PSI is that its performance depends on the binning strategy used. In this implementation, we used Doane’s formula by default to determine the number of bins. Doane’s formula is based on the data's skewness and sample size. When comparing it to other binning strategies (10, 15, ..., 30 bins), we noticed only minor differences in the results.
However, a comprehensive paper on PSI discusses the importance of selecting an appropriate binning strategy and its impact on the value of PSI. Simulations carried out as part of the study suggest that percentile binning might be more appropriate than uniform binning (which we use in our implementation). The study suggests that the number of bins should be carefully chosen according to the sample size to ensure the method’s reliability and avoid bins with zero or few observations, which can impact the PSI. [1]
Further research is necessary before determining whether PSI should be implemented into NannyML OSS and NannyML Cloud.
Conclusion
Like every method, PSI has its strengths and weaknesses, and choosing to use it over other methods depends on your particular use case. The experiments described in this blog indicate that JS and Wasserstein distances might be more suitable than PSI for detecting small drifts and accurately quantifying the amount of drift. However, specific applications might find that PSI is more suitable.
One thing that is certain, however, is that detecting data drift is an important aspect of monitoring your deployed models, as it might reveal the cause of performance drops. NannyML OSS has many methods that allow you to monitor data drift easily. NannyML Cloud offers a way to monitor drift round-the-clock and receive alerts when necessary. NannyML Cloud also contains a suite of tools for monitoring models and is an incredible platform to integrate as part of your MLOps strategy.
Find out if NannyML is the right tool for your company today! Schedule a call with one of our founders to discover exactly what we can offer your company.
Want to learn more about post-deployment data science?
Here are a couple of other blogs for you to enjoy:
References
[1] Yurdakul, Bilal, "Statistical Properties of Population Stability Index" (2018). Dissertations. 3208. https://scholarworks.wmich.edu/dissertations/3208