Monitoring Custom Metrics without Ground Truth

Setting up custom metrics for your machine learning models can bring deeper insights beyond standard metrics. In this tutorial, we’ll walk through the process step-by-step, showing you how to create custom metrics tailored to classification and regression models.

Monitoring Custom Metrics without Ground Truth
Do not index
Do not index
Canonical URL

Introduction

Setting up custom metrics for your machine learning models can bring deeper insights beyond standard metrics. In this tutorial, we’ll walk through the process step-by-step, showing you how to create custom metrics tailored to classification and regression models.
If you'd prefer a video guide, be sure to check out our webinar.
There are two main ways to set up metrics: through the NannyML Cloud Dashboard or using the SDK. In this post, we’ll explore both options.

What traditional metrics won’t tell you…

Relying on traditional metrics alone limits your ability to fully understand how your model performs beyond raw numbers.
Data scientists need to monitor non-traditional metrics for several reasons. Firstly, custom metrics help communicate a model's value to stakeholders by linking performance directly to business outcomes. Secondly, metrics that focus on profit, cost, or user experience make it easier to align technical success with organisational goals.
Lastly, These metrics provide a clearer signal when your model isn’t delivering the desired results, even if it’s technically sound.
 
📕
Curious about expanding your understanding of machine learning metrics? The Little Book of ML Metrics is an indispensable resource that you need on your desk!
This compact guide covers everything about metrics that can help improve your day-to-day data science work.
notion image

Creating a Custom Metric for Regression

To set up a custom regression metric, you need to implement two core functions: a loss function, which calculates the error or loss at the instance level, and an aggregate function, which summarises these instance-level losses into a single metric. Depending on the business need or the specific behavior of your model, this can be a simple average or a more complex calculation.
notion image
Function can access the following:
  • y_true: A pandas.Series python object containing the target column.
  • y_pred: A pandas.Series python object containing the predictions column
  • chunk_data: A pandas.DataFrame python object containing all columns associated with the model. This allows using other columns in the data provided for the calculation of the custom metric
Have a look at the standard template here:
import numpy as np
import pandas as pd

def loss(
    y_true: pd.Series,
    y_pred: pd.Series,
    chunk_data: pd.DataFrame,
    **kwargs
) -> np.ndarray:
    pass


def aggregate(
    loss: np.ndarray,
    chunk_data: pd.DataFrame,
    **kwargs
) -> float:
    pass
Okay, now that we know all we need to implement a custom metric lets try one by implementing Pinball loss.

Pinball loss

Pinball Loss, also known as Quantile Loss, measures the performance of quantile regression models. While traditional metrics like MSE provide a broad overview of model performance, they may not capture the nuances of predictions across different quantiles. Pinball Loss allows for targeted evaluation.
It penalizes predictions based on their distance from the actual quantile. For example, if you're interested in predicting the 90th percentile of a distribution, Pinball Loss will weigh errors differently depending on whether they fall below or above this threshold. This specificity can lead to better-informed decisions based on risk assessment.
Developing your metrics in a Jupyter Notebook will help you debug issues much faster. I will discuss issues you can run into later in the blog.
We start with creating the loss function with alpha value set at 0.9.
def loss(
    y_true: pd.Series,
    y_pred: pd.Series,
) -> np.ndarray:
    y_true = y_true.to_numpy()
    y_pred = y_pred.to_numpy()

    alpha = 0.9
    factor1 = alpha * np.maximum(y_true - y_pred, 0)
    factor2 = (1 - alpha) * np.maximum(y_pred - y_true, 0)
    return factor1 + factor2
The aggregate function is nothing fancy, just simple mean.
def aggregate(
    loss: np.ndarray,
) -> float:
    return loss.mean()
 
And that’s it. With just two functions, you can monitor and estimate any metric you can think of. Now that you have understood how to write and debug these methods, we’ll learn how to add them to your models.

How to add any metric using your Cloud Dashboard

Log into your NannyML dashboard and find the Custom Metrics button in the top right corner.
Custom Metrics Button
Custom Metrics Button
You'll find all your metrics ordered according to their problem type.
Custom Metrics Dashboard
Custom Metrics Dashboard
On clicking “Add a new metric” you will find a window like the one below:
Create Custom Metric Panel
Create Custom Metric Panel
Once you’ve added the details, paste in the code for your well-tested metrics. NannyML Cloud lets you set Metric limits to control thresholds. If you skip this, the thresholds can get too big and mess up the plot scale. After filling in all the info, just save your metric.
Go to the Model Dashboard> Management > Setting > Performance of your desired machine learning model. You’ll find a list of applicable metrics on clicking the + Add Custom Metric button that you can add to your models.
Model Dashboard>Settings>Performance
Model Dashboard>Settings>Performance
Once added, go to Monitoring > Summary and run performance monitoring to have your custom metric calculated.
If you run into any issues, you can check the Logs section in your Model Management Section. Click on the File Icon to get a detailed txt file that you can use to debug.
Logs
Logs
If everything goes well, your metric will be displayed in the Performance page similar to this:
Pinball Loss Metric
Pinball Loss Metric
It looks like there’s an estimated drop in the Pinball Loss. You can now take action to address this before it materialises.
But how are these estimations derived? For Regression, we use Direct Loss Estimation. It involves training a secondary model, called the "nanny model," to predict the loss of the primary "child model" being monitored.

Creating a Custom Metric for Binary Classification

When designing a custom metric for classification, NannyML Cloud requires two key functions:
  • calculate function – mandatory and used to compute realized performance based on known target values.
  • estimate function – optional, used to estimate performance when target values are unavailable.
To write your function, you have access to the following:
  • y_pred_proba: A pandas.Dataframe with the predicted probabilities for each class.
  • labels: A list of class labels (e.g., [0, 1] for binary classification).
import pandas as pd

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
    pass


def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
    pass

F2 score

The F2 score is a variation of the F1 score, which is widely used in binary classification to balance precision and recall. The F2 score is part of the family of F-beta scores, where different values of the parameter β (beta) adjust the balance between precision and recall. Specifically, the F2 score gives more weight to recall, making it useful in cases where false negatives are more problematic than false positives.
Toggle to see Python code for F2 score
import numpy as np
import pandas as pd
from sklearn.metrics import fbeta_score

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    **kwargs
) -> float:
    return fbeta_score(y_true, y_pred, beta=2)
import numpy as np
import pandas as pd

def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
   
    estimated_target_probabilities = estimated_target_probabilities.to_numpy().ravel()
    y_pred = y_pred.to_numpy()

    # Create estimated confusion matrix elements
    est_tp = np.sum(np.where(y_pred == 1, estimated_target_probabilities, 0))
    est_fp = np.sum(np.where(y_pred == 1, 1 - estimated_target_probabilities, 0))
    est_fn = np.sum(np.where(y_pred == 0, estimated_target_probabilities, 0))
    est_tn = np.sum(np.where(y_pred == 0, 1 - estimated_target_probabilities, 0))

    beta = 2
    fbeta =  (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
    fbeta = np.nan_to_num(fbeta)
    return fbeta
Follow the same steps as explained in the previous section and add this metric to a binary classification model in your setup.
F2 Metric
F2 Metric
Probabilistic Adaptive Performance Estimation or PAPE is the algorithm behind the estimated performance here. Now that you know that the F2 metric is dropping, you can easily compare it with the F1 metric which is a default metric present.
F1 Metric
F1 Metric
The fact that both F1 and F2 are decreasing indicates a general degradation in model performance. This could be due to concept drift, where the patterns of fraudulent behavior evolve over time and the model does not keep up.

Creating a Custom Metric for Multiclass Classification

Similar to Binary Classification, you need to create calculate and estimate function here as well. The purpose of the function remains same as before.
You can also access class_probability_columns: A list of column names in y_pred_proba corresponding to each class's predicted probability.
In this section, we will add the F2 score for a multiclass model using the NannyML Software Development Kit (SDK). The Custom Metrics module is part of the monitoring class, and it can be created by instantiating a new nml_sdk.monitoring.CustomMetric(). Before all this, you will need to set up the NannyML SDK address and your token.
Open a Jupyter file or an editor of your choice and run the following code:
pip install git+https://github.com/NannyML/nannyml-cloud-sdk.git
Go to Account Settings and Create an API token for this task.
Creating an API token
Creating an API token
import nannyml_cloud_sdk as nml_sdk

nml_sdk.url = "nannyml url here"
nml_sdk.api_token = r"api token goes here"

custom_metric = nml_sdk.monitoring.CustomMetric()

Add a new custom metric

Prepare your custom metric before the following steps.
Toggle to See code for F2 Score in Python for Multi-Class Classification
import pandas as pd
from sklearn.metrics import fbeta_score

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
) -> float:
    return fbeta_score(y_true, y_pred, beta=2, average='macro')
import numpy as np
import pandas as pd
from sklearn.preprocessing import label_binarize

def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
):
    beta = 2

    def estimate_fb(_y_pred, _y_pred_proba, beta) -> float:
        est_tp = np.sum(np.where(_y_pred == 1, _y_pred_proba, 0))
        est_fp = np.sum(np.where(_y_pred == 1, 1 - _y_pred_proba, 0))
        est_fn = np.sum(np.where(_y_pred == 0, _y_pred_proba, 0))
        est_tn = np.sum(np.where(_y_pred == 0, 1 - _y_pred_proba, 0))

        fbeta =  (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
        fbeta = np.nan_to_num(fbeta)
        return fbeta

    estimated_target_probabilities = estimated_target_probabilities.to_numpy()
    y_preds = label_binarize(y_pred, classes=labels)

    ovr_estimates = []
    for idx, _  in enumerate(labels):
        ovr_estimates.append(
            estimate_fb(
                y_preds[:, idx],
                estimated_target_probabilities[:, idx],
                beta=2
            )
        )
    multiclass_metric = np.mean(ovr_estimates)

    return multiclass_metric
After defining a function, you can call custom_metric.create to register your custom function as a custom metric.
custom_metric = nml_sdk.monitoring.CustomMetric()

cm = custom_metric.create(
        name="F2_Score_Custom_Metric", 
        description="Implementation of F2_Score",
        problem_type="MULTICLASS_CLASSIFICATION",
        calculation_function=calculate,
        estimation_function=estimate,
        lower_value_limit=0.0, 
        upper_value_limit=1.0, 
    )
When you create a new custom metric, it isn't automatically linked to any model.
To assign it to a model, first retrieve the model's unique identifier (model_id). Then, use monitoring.Model.add_custom_metric with the model_id and the metric_id as parameters.
To get the model_id, call nml_sdk.monitoring.Model.list(). This function lists all available models or filters them by name or problem type. The model_id is found in the value of the id key in the returned dictionaries.
nml_sdk.monitoring.Model.add_custom_metric(model_id=196, metric_id=cm['id'])
Custom metric added through SDK
Custom metric added through SDK
nml_sdk.monitoring.Run.trigger(model_id=196)
From now, every time you run your model, the new custom metric will be calculated among the standard metrics.
F2 metric for multi-class classification
F2 metric for multi-class classification

Removing and Deleting a custom metric

#to remove a custom metric from a model
nml_sdk.monitoring.Model.remove_custom_metric(model_id=196,metric_id=1)
#to delete 
custom_metric.delete(metric_id=1)

Some mistakes to be mindful of

  1. Inconsistent Naming Convention: Variability in column names can lead to AttributeError issues. If you run code on a cloud platform, these discrepancies can disrupt execution. To mitigate this, adopt a consistent naming convention: use y_true for actual values and y_pred for predictions. This approach simplifies the addition of custom metrics across multiple models.
  1. Importing Packages: Remember to import the required packages for every function.
  1. Array Length Mismatch: Ensure that the lengths of your y_true and y_pred arrays align. A mismatch will cause errors during calculations. Additionally, be aware that the loss function should return a numpy array to avoid complications.
  1. Handling Missing Data: Implement checks to manage missing data effectively, either through imputation or removal, to maintain the integrity of your analysis.
  1. Validating Chunk Data: When working with chunked data, watch out for empty chunks. Use nml.chunker to ensure your function does not return NaN values, preserving the reliability of your computations.
    1. Metric getting calculated for only one chunk in the analysis set, rest are NaN values
      Metric getting calculated for only one chunk in the analysis set, rest are NaN values

Conclusion

In this blog, we explored how to add custom metrics using both the NannyML Cloud Dashboard and SDK. You learned how to implement metrics, along with potential mistakes that you can run into. We also discussed how NannyML algorithms can estimate performance even in the absence of ground truth.
Post-deployment data science is a field focused on monitoring and maintaining machine learning models after they’ve been deployed in production. Over time, your models will face challenges such as covariate shift concept drift, and data quality degradation. These issues arise from changes in the underlying data distribution or real-world context, which can lead to reduced model accuracy. The goal of post-deployment data science is to detect and address these problems before they affect the decisions your models are making.
If you're looking for expert guidance, you can schedule a demo with the NannyML founders. They’ll work with you to find tailored solutions for your specific use cases.
Learn how maintaining ML models is easier than ever!

You Might Also Like…

Custom metrics can be tailored to fit industry-specific models, addressing unique challenges and performance standards. To dive deeper into how models are monitored within different industry domains, check out these blogs:

Ready to learn how well are your ML models working?

Join other 1100+ data scientists now!

Subscribe

Written by

Kavita Rana
Kavita Rana

Data Science Intern at NannyML