Top 3 Custom Metrics Data Scientists Should Build for Finance: A Tutorial

Do not index

Canonical URL

Introduction

As a data scientist, you've probably encountered scenarios where standard ML metrics fail to reflect the true complexity of your business KPIs. In those moments, a custom metric is needed to provide more meaningful insights, but knowing how to build and track one isn’t always straightforward.

With NannyML Cloud’s new Custom Metrics feature, you can now effortlessly define, monitor, and even estimate the impact of personalised metrics—all within a single platform.

In this blog, we’ll explore the differences between traditional and custom metrics and examine finance-specific classification and regression models. Also, there is a step-by-step tutorial on how to set up and utilize these features in NannyML Cloud.

What You Miss Out On With Traditional Metrics

Traditional metrics typically provide a one-dimensional view of model performance, failing to account for the nuanced landscape of financial decision-making.

One significant limitation is the inability of standard metrics to reflect the true business impact of model decisions. They often lack sensitivity to the vastly different costs associated with various types of errors in financial predictions or classifications. This cost-insensitivity can lead to models that appear statistically sound but fail to optimize for actual business outcomes such as profitability or risk mitigation.

For your finance models, custom metrics can be designed to reflect specific outcomes, such as profitability, risk management, or customer satisfaction. This provides a clearer picture of how well the model achieves its intended goals.

For the tutorial part, we will be using NannyML Cloud, a monitoring platform with state-of-the-art algorithms designed for your post-deployment issues.

With this understanding, let's discuss some popular situations where the new Custom Metric feature might come in handy.

Credit Card Fraud Detection

Credit card fraud detection models, just like other models, are not immune to changes. Problems such as data drift, evolving fraud patterns, and changes in fraudster tactics can affect model performance over time.

💡

For a more in-depth look at these post-deployment challenges and strategies for addressing them, refer to this blog that details maintaining and improving fraud detection systems.

Monitoring fraud detection models

Learn common reasons why fraud detection models degrade after deployment in production, and follow a hands-on tutorial to resolve these issues.

https://www.nannyml.com/blog/monitor-fraud-detection-models

Credit card fraud detection requires careful evaluation of multiple performance metrics to ensure the model's effectiveness across various aspects of prediction and real-world applicability. Let’s discuss a few metrics.

Accuracy

Accuracy measures the proportion of correct predictions among the total predictions made. While it provides an overall performance overview, it can be misleading, mainly when dealing with imbalanced classes.

You might wonder how estimations are getting calculated for every metric. Here, the estimations are derived using the Probabilistic Adaptive Performance Estimation (PAPE) method which can estimate the performance of classification models without labels, under covariate shift. Read this blog to know how.

In fraud detection, where fraudulent transactions are much less frequent than legitimate ones, high accuracy might not tell the whole story. A model that predicts "no fraud" for all transactions could still achieve high accuracy but miss many actual fraud cases. Therefore, accuracy alone doesn’t fully reflect how well the model identifies fraudulent transactions.

Balanced Accuracy

Balanced Accuracy addresses the challenge of class imbalance by providing a more fair performance measure across different classes. Unlike accuracy, which the majority class can skew, Balanced Accuracy averages the sensitivity (true positive rate) and specificity (true negative rate) of the model.

With this metric, you can understand the model's performance in a context where the minority class is equally important to the majority class.

Balanced Accuracy is not a traditional metric so it is not available in NannyML Cloud by default. The good news is that we can add it!

To add a metric, log into your NannyML dashboard and click the Custom Metrics button at the top right corner. This will help you find all your metric configurations in one place.

You can find them separated according to their problem type i.e. binary/multi-class classification and regression neatly.

📌

The “Used by” button for every metric lists all the models that use that metric. So you can find your models easily.

To add a new metric for binary classification, you will have to prepare two functions, a calculate function and an estimation function.

The calculate function processes inputs such as true labels and predicted labels to compute the metric’s value for a specific chunk of data. It operates like any standard Python function and returns the aggregated result for that chunk. This function is crucial for evaluating the metric on individual data chunks.

import pandas as pd
from sklearn.metrics import balanced_accuracy_score
import numpy as np

def calculate(
    chunk_data,
    y_pred
) -> float:
    y_true = chunk_data['target'].to_numpy()
    y_pred = np.asarray(y_pred)

    data = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
    data.dropna(axis=0, inplace=True, subset=['y_true'])

    y_true, y_pred = data['y_true'].to_numpy(), data['y_pred'].to_numpy()
    
    if len(y_true) == 0:
        return np.nan

    return balanced_accuracy_score(y_true, y_pred)

The estimate function monitors and estimates the metric even when ground truth is unavailable. It relies on estimated probabilities to approximate the metric’s value. Similar to calculate, the estimate function should be a valid Python function that returns the aggregated result for each chunk.

import pandas as pd
import numpy as np

def estimate(
        estimated_target_probabilities,
        y_pred
) -> float:
    y_pred = np.asarray(y_pred)
    estimated_target_probabilities = estimated_target_probabilities.to_numpy().ravel()

    data = pd.DataFrame({
        'estimated_target_probabilities': estimated_target_probabilities,
        'y_pred': y_pred
    })
    data.dropna(axis=0, inplace=True)

    estimated_target_probabilities = data['estimated_target_probabilities'].to_numpy()
    y_pred = data['y_pred'].to_numpy()
    
    tp = np.sum(np.where(y_pred == 1, estimated_target_probabilities, 0))
    fp = np.sum(np.where(y_pred == 1, 1 - estimated_target_probabilities, 0))
    fn = np.sum(np.where(y_pred == 0, estimated_target_probabilities, 0))
    tn = np.sum(np.where(y_pred == 0, 1 - estimated_target_probabilities, 0))
    
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    balanced_accuracy = (sensitivity + specificity) / 2
    
    return balanced_accuracy

It's also a good idea to define the metric’s value range, for example here balanced accuracy is between 0 and 1. NannyML Cloud uses this range to set thresholds, ensuring the metric’s plot scale remains accurate. Without these thresholds, plots may become distorted if values grow too large.

Once you have saved your metric, go to your model's dashboard and open the performance section under settings. There are standard and custom metrics headers. You will find a list of all valid metrics for your model.

Once you choose your metric, save your settings. Then, go to Model Summary and run the algorithms again with the newer settings.

After a successful run, your custom metrics will be added to your monitoring setup.

Prevalence

Prevalence is a metric that represents the proportion of positive cases in a dataset. In binary classification problems, it is calculated as:

In straightforward terms, prevalence tells you how often a particular outcome occurs in your data. If you think of your dataset as a snapshot of a specific problem, prevalence shows how common that problem is within that snapshot.

If you have a dataset of 10,000 transactions and 500 of these are fraudulent, the prevalence of fraud in this dataset is 5%. This metric is important because it helps assess the problem's overall scale.

In our model, a high prevalence indicates that many transactions are fraudulent, which might suggest the need for more thorough fraud detection measures.

Toggle for Code snippet

import pandas as pd
import numpy as np

def calculate(
    chunk_data,
    y_pred
) -> float:
    y_true = chunk_data['target'].to_numpy()
    y_pred = np.asarray(y_pred)

    data = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
    data.dropna(axis=0, inplace=True, subset=['y_true'])

    y_true, y_pred = data['y_true'].to_numpy(), data['y_pred'].to_numpy()

    if len(y_true) == 0:
        return np.nan

    # Calculate Prevalence
    true_positives = np.sum((y_true == 1) & (y_pred == 1))
    false_negatives = np.sum((y_true == 1) & (y_pred == 0))
    total_positive_cases = true_positives + false_negatives
    total_instances = len(y_true)

    prevalence = total_positive_cases / total_instances if total_instances > 0 else np.nan
    return prevalence

import numpy as np
import pandas as pd

def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
) -> float:
    
    estimated_target_probabilities = estimated_target_probabilities.to_numpy().ravel()
    y_pred = np.asarray(y_pred)
    data = pd.DataFrame({
        'estimated_target_probabilities': estimated_target_probabilities,
        'y_pred': y_pred
    })
    data.dropna(axis=0, inplace=True)
    if len(data) == 0:
        return np.nan
    prevalence = np.mean(data['estimated_target_probabilities'])
    return prevalence

In short, for a credit card fraud detection use case, balanced accuracy proves more informative than standard accuracy when prevalence is low, as it prevents the model from achieving artificially high scores by simply predicting the majority class.

Adding Business Context in the Mix

NannyML’s business value metric is about linking the performance of a model to its real-world financial impact. Each classification prediction—has a different financial consequence for your business.

This metric helps quantify these outcomes by assigning a monetary value to each type of prediction.

When you have the ground truth labels (i.e., the actual outcomes), you can use the confusion matrix to calculate the business value directly. However, if you don’t have these labels, you can estimate the business value using performance estimation techniques like CBPE.

You can find the business value in the performance section in the settings tab for your classification models. Configure it and save model settings.

Stock Market Analysis and Prediction

Stock market predictions involve using historical data—such as stock prices, trading volumes, and economic indicators—to model future price movements or returns. By applying techniques like linear regression, decision trees, or more advanced machine learning models, analysts aim to uncover relationships between these variables and forecast future stock prices.

The performance of such a predictive model must be rigorously evaluated before any financial decision is made based on them.

RMSE

RMSE takes the square root of the average squared differences between predicted and actual values. It provides an absolute fit measure. If you predict a stock's closing price and get an RMSE of 2, your model's predicted stock prices are off by around two units from the actual prices.

The bigger the RMSE, the less accurate your model's predictions. Because RMSE squares the errors, it gives more importance to enormous mistakes.

Stock prices fluctuate wildly, so RMSE helps you see how well your model captures these movements. A model with a low RMSE would better predict stock prices, even with the inherent volatility.

RMSE focuses purely on accuracy and makes no consideration for uncertainty or the inherent risk in making predictions. In stock market predictions, where returns are uncertain and volatility plays a critical role, RMSE is not enough.

Sharpe Index

The Sharpe Index or Sharpe Ratio, is a financial metric used to measure an investment's risk-adjusted return. Nobel laureate William F. Sharpe developed it.

It helps investors understand whether the returns from an investment are worth the risks they are taking. It compares the investment’s returns to the returns of a safe, low-risk investment (like government bonds) and adjusts for the volatility of the investment.

The Sharpe Ratio helps answer the question: "Am I being properly rewarded for the amount of risk I'm taking?”

📌

The book every data scientist needs on their desk.

Metrics are arguably the most important part of data science work, yet they are rarely taught in courses or university degrees.

Order here: https://www.nannyml.com/metrics

The Little Book of ML Metrics - Business Section

Theoretically, this metric has the range of negative infinity to positive infinity but for all practical purposes it is not expected to cross +-3.

Toggle to See the code

import numpy as np 
import pandas as pd

def loss(
        y_true: pd.Series,
        y_pred: pd.Series,
        chunk_data: pd.DataFrame
    ) -> np.ndarray:
    y_true = chunk_data['y_true'].to_numpy()
    y_pred = chunk_data['y_pred'].to_numpy()
    
    risk_free_rate = 0.01
    
    # Calculate returns without using pct_change
    actual_returns = np.diff(y_true) / y_true[:-1]
    predicted_returns = np.diff(y_pred) / y_pred[:-1]
    
    # Add a 0 at the beginning to match the original length
    actual_returns = np.insert(actual_returns, 0, 0)
    predicted_returns = np.insert(predicted_returns, 0, 0)
    
    excess_returns = predicted_returns - (risk_free_rate / 252)
    std_excess_return = np.std(excess_returns)
    
    if std_excess_return == 0:
        return np.zeros(len(excess_returns))
    
    excess_returns_by_std = excess_returns / std_excess_return
    
    return excess_returns_by_std

import numpy as np
import pandas as pd 

def aggregate(
    loss: np.ndarray,
) -> float:
    return loss.mean()

Conclusion

This blog introduces you to Custom Metrics, a feature available in NannyML Cloud. Custom Metrics helps us make well-informed decisions and communicate them with stakeholders. We discussed three metrics Balanced Accuracy, Prevalence, Sharpe Ratio that every data scientist working with finance needs in their toolbox.

As you explore and develop custom metrics, remember that debugging is an essential part of the process. For a more streamlined experience, consider using the NannyML OSS in a Jupyter Notebook to code and debug.

Post-deployment Data Science is a data science vertical that deals with monitoring and maintaining production-grade machine learning models. Covariate Shift, Concept Drift, and Data Quality are a trinity of issues that your model can face over time.

To know how you can get started, Speak with a NannyML founder today and get a demo.

References

Toggle to see references

https://fastercapital.com/topics/evaluation-metrics-for-loan-default-prediction-models.html

https://www.reddit.com/r/datascience/comments/1fcwo48/data_scientists_working_in_financial_industries/