How to Monitor a Credit Card Fraud Detection ML Model

Learn common reasons why fraud detection models degrade after deployment in production, and follow a hands-on tutorial to resolve these issues.

How to Monitor a Credit Card Fraud Detection ML Model
Do not index
Do not index
Canonical URL

Introduction

Machine Learning (ML) is widely used for credit card fraud detection and has proven to be an efficient tool. However, such models can degrade after deployment. In this blog, we explore what can go wrong with these models and how to keep them performing effectively to ensure that fraud does not go undetected. Credit card fraud is a significant problem, with some findings indicating that worldwide losses could reach a staggering $38.5 billion by 2027. Given this, fraud detection is a priority for many businesses. While approaches to fraud detection continue to evolve, so do the fraudsters' methods. To stay ahead of the curve, let’s dive into it!

What could cause your model to fail?

Concept drift, covariate shift, and data quality issues are common causes of model degradation over time. This section explores how these problems can occur in a credit card fraud detection model.

Concept drift

Intuitively, concept drift is a phenomenon where the relationship between the target variable, that a model tries to predict, and its input variables changes over time.
Given that the training data with labels can be described as samples from , where represents the input features and is the target, an ML model learns . Concept drift occurs when there are changes in .
Credit card fraud can broadly be defined as the use of credit cards where the cardholder does not authorize the charge. However, having a more detailed, unified definition in practice is challenging due to the various strategies fraudsters use. Here are just a few common examples of credit card fraud.
  • Card-present fraud: For example, creating counterfeit credit cards using real credit card information.
  • Card-not-present (CNP) fraud: Making transactions without physical possession of the credit card, typically via phone or email.
  • Account takeover: Refers to instances where fraudsters gain unauthorized access to a legitimate user’s credit card account. This can happen because of phishing attacks or data breaches, for example.
A defining aspect of fraud detection is its adversarial nature. As methods for detecting fraud grow more robust and complex, so do the methods used by scammers, making the list above non-exhaustive and ever-changing.
Essentially, what is considered fraud changes over time; that is, our model’s output, given some input, changes. This is concept drift. The adversarial and ever-changing nature of fraud techniques makes fraud detection particularly prone to concept drift.

Covariate shift

Another phenomenon that can cause your model to drop in performance is the changing distribution of input features, known as covariate shift. These changes can occur due to seasonal variations or evolving trends in customer spending habits. They can also result from changes in transaction processes, often implemented to combat fraud.
For example, if a company notices a high number of fraudulent transactions occurring when the time between login and transaction is short, they might decide not to allow transactions within the first minute after login. Alternatively, the type of transactions (e.g., online, swipe, Apple Pay etc.), which can be a feature of an ML model, might change. These changes will result in a shift in the distribution of the data used for inference in the future. These are just two examples illustrating how covariate shifts might happen in transaction data, leading to a model deteriorating in production.

Data quality issues

Credit card fraud detection is fundamentally a data-centric problem. Based on transactional data, the aim is to find patterns and abnormalities to identify fraudulent transactions, preferably as soon as they occur. The type of data used for credit card fraud detection presents a couple of challenges.
For example, inconsistent data formatting can occur when merchants use different vendors to facilitate credit card payments in their businesses.
Furthermore, we never have access to perfect data. Transactional data can contain errors, inconsistencies, and irrelevant information. Often, some transactions marked as non-fraudulent were actually fraudulent but were never identified as such.

Tutorial

For this tutorial, we will use a public dataset from Kaggle. The dataset consists of real transactions and was compiled by the Machine Learning Group at ULB and Worldline. For privacy reasons, the real features are unavailable; instead, we have access to the principal components obtained through PCA, the transaction amount, time, and whether or not the transaction was fraudulent. We will select the latest 30% of the transactions in the set to simulate a production environment where data is collected from live transactions.
We will also use a model found on Kaggle as an example to demonstrate how to monitor it in production. This repository uses oversampling to address the imbalanced nature of the dataset and logistic regression as a binary classifier. Since this blog focuses not on building the model but on showing how to monitor it effectively, we will jump right to that part.
page icon
We follow NannyML’s monitoring workflow to monitor our model in NannyML Cloud.
The performance-based monitoring workflow developed by NannyML comprises three steps:
  1. Continuous performance monitoring
  1. Root cause analysis (carried out once a performance drop is detected)
  1. Issue resolution
notion image
The first step in our monitoring workflow is performance monitoring. This step needs to be carried out continuously and triggers us to take action when a performance drop is detected. As with many ML applications, access to ground truth is often delayed for credit card fraud detection models in production. In other words, we do not immediately know if a particular transaction was indeed fraudulent or not. This delay makes it particularly challenging to evaluate the performance of our model, as we cannot compare the model’s prediction to the real outcome in real time.
Luckily, NannyML offers tools for estimating model performance. For a classification model, we use PAPE, an algorithm developed by our team at NannyML. As seen below, we track two metrics: ROC AUC and recall. Recall is particularly important for credit card fraud detection because it measures the proportion of actual fraudulent transactions that are correctly identified by the model. A high recall indicates that the model is effectively catching most fraudulent transactions, which is crucial in minimizing financial losses and protecting customers. We notice that while the estimated performance is initially stable, it decreases at some point.
estimated ROC AUC
estimated ROC AUC
Estimated Recall
Estimated Recall
After noticing a drop in performance, we perform root cause analysis to understand why our model’s performance has dropped. Since we don’t have access to target data, we are unable to detect concept drift. Therefore, we investigate whether covariate shift has occurred.
We only have access to principal components (for privacy reasons), which limits our ability to interpret shifts in specific features. However, detecting covariate shifts can help us choose an appropriate resolution strategy to remedy our degraded model. When detecting univariate drift (changes in distributions of individual features), we observe that some of them do shift, and this corresponds with the drop in our model's performance. In the example below, we show just four of the features where we detect covariate shift using Jensen-Shannon distance. NannyML Cloud and NannyML OSS contain many other methods for detecting covariate shift. Find out all about them in our blog on the topic.
Covariate shift feature v12
Covariate shift feature v12
Covariate shift feature v15
Covariate shift feature v15
Covariate shift feature v16
Covariate shift feature v16
Covariate shift feature v24
Covariate shift feature v24
Further, we investigate if there are any data quality issues. We notice that there is no missing data. However, this does not guarantee that our data is flawless. Data scientists should always communicate with other teams to ensure the highest reliability of the data.
No missing values for ‘amount’ feature
No missing values for ‘amount’ feature

Issue resolution

In a previous blog, I discussed how to fix underperforming models by exploring various issue resolutions depending on the cause of the performance drop. In this case, because covariate shift is likely the cause of the performance drop, simply retraining the model may not solve the issue. Furthermore, retraining may not be feasible at all if we don’t have access to the target data. We recommend either omitting features that are prone to shift from our predictive models or changing the decision thresholds. Here, we will discuss the second option.
Most binary classification models output a probability. Typically, if this probability is greater than 0.5, an instance is classified as fraud; otherwise, it is classified as a legitimate transaction. However, it is possible to change the decision boundary, meaning we can raise or lower the predicted probability at which an instance is classified as fraud. Let’s explore what changing the decision boundary would imply.
Possible outcomes for credit card fraud detection model
Possible outcomes for credit card fraud detection model
Adjusting the decision boundary affects the balance between true positives (correctly identified frauds) and false positives (non-fraudulent transactions incorrectly flagged as fraud).
Raising the decision boundary will result in fewer false positives, meaning fewer transactions are incorrectly classified as fraudulent. This reduces inconvenience to customers and decreases the cost of manual reviews. On the flip side, it also means that more fraudulent transactions will go undetected. Conversely, lowering the boundary will catch more fraudulent transactions but will also increase the number of legitimate transactions flagged as fraud, potentially impacting customer satisfaction.
Ultimately, finding the right balance when changing the decision boundary involves carefully considering business strategy and objectives. After adjusting the classification boundary, it is important to continue monitoring your model using the above-mentioned workflow to validate the changes made to it.

Conclusion

This blog served to illustrate what could go wrong with your credit card fraud detection model in production. We went over examples and showed how to resolve a failing model. This is crucial as it ensures the model’s optimal value delivery, which is particularly important in fraud detection, where a faltering model can have severe consequences for an organization.
At NannyML, we offer a wide suite of model monitoring tools. These are available both as an open-source Python library and as NannyML Cloud, a fully automated cloud application that provides round-the-clock automated monitoring, an alert system, and more. If you are interested in finding out how NannyML can help your organization fight against credit card fraud (or help you monitor any other ML model), book a call with one of our founders!

Further reading

In this blog, we focused on a common use case of ML: fraud detection. To ensure a repeatable and effective way of continuously monitoring your ML models and maintaining their performance after deployment into production, NannyML has developed a monitoring workflow. To learn more about this workflow, take a look at some of our other blogs.

References

Ready to learn how well are your ML models working?

Join other 1100+ data scientists now!

Subscribe

Written by

Miles Weberman
Miles Weberman

Data Science Writer at NannyML