Table of Contents
- Introduction
- What Traditional Metrics Won’t Tell You…
- Creating a Custom Metric for Regression
- Pinball Loss
- How to Add Any Metric Using Your Cloud Dashboard
- Creating a Custom Metric for Binary Classification
- F2 Score
- Creating a Custom Metric for Multiclass Classification
- Add a New Custom Metric
- Removing and Deleting a Custom Metric
- Some Mistakes to Be Mindful Of
- Conclusion
- Read More…
Do not index
Do not index
Canonical URL
Introduction
Setting up custom metrics for your machine learning models can bring deeper insights beyond standard metrics. In this tutorial, we’ll walk through the process step-by-step, showing you how to create custom metrics tailored to classification and regression models.
If you'd prefer a video guide, be sure to check out our webinar.
There are two main ways to set up metrics: through the NannyML Cloud Dashboard or using the SDK. In this post, we’ll explore both options.
What Traditional Metrics Won’t Tell You…
Relying on traditional metrics alone limits your ability to fully understand how your model performs beyond raw numbers.
Data scientists need to monitor non-traditional metrics for several reasons. Firstly, custom metrics help communicate a model's value to stakeholders by linking performance directly to business outcomes. Secondly, metrics that focus on profit, cost, or user experience make it easier to align technical success with organisational goals.
Lastly, These metrics provide a clearer signal when your model isn’t delivering the desired results, even if it’s technically sound.
Curious about expanding your understanding of machine learning metrics? The Little Book of ML Metrics is an indispensable resource that you need on your desk!
This compact guide covers everything about metrics that can help improve your day-to-day data science work.
Get your copy here: https://www.nannyml.com/metrics
Creating a Custom Metric for Regression
To set up a custom regression metric, you need to implement two core functions: a
loss
function, which calculates the error or loss at the instance level, and an aggregate
function, which summarises these instance-level losses into a single metric. Depending on the business need or the specific behavior of your model, this can be a simple average or a more complex calculation. Function can access the following:
y_true
: Apandas.Series
python object containing the target column.
y_pred
: Apandas.Series
python object containing the predictions column
chunk_data:
Apandas.DataFrame
python object containing all columns associated with the model. This allows using other columns in the data provided for the calculation of the custom metric
Have a look at the standard template here:
import numpy as np
import pandas as pd
def loss(
y_true: pd.Series,
y_pred: pd.Series,
chunk_data: pd.DataFrame,
**kwargs
) -> np.ndarray:
pass
def aggregate(
loss: np.ndarray,
chunk_data: pd.DataFrame,
**kwargs
) -> float:
pass
Okay, now that we know all we need to implement a custom metric lets try one by implementing Pinball loss.
Pinball Loss
Pinball Loss, also known as Quantile Loss, measures the performance of quantile regression models. While traditional metrics like MSE provide a broad overview of model performance, they may not capture the nuances of predictions across different quantiles. Pinball Loss allows for targeted evaluation.
It penalizes predictions based on their distance from the actual quantile. For example, if you're interested in predicting the 90th percentile of a distribution, Pinball Loss will weigh errors differently depending on whether they fall below or above this threshold. This specificity can lead to better-informed decisions based on risk assessment.
Developing your metrics in a Jupyter Notebook will help you debug issues much faster. I will discuss issues you can run into later in the blog.
We start with creating the
loss
function with alpha value set at 0.9. def loss(
y_true: pd.Series,
y_pred: pd.Series,
) -> np.ndarray:
y_true = y_true.to_numpy()
y_pred = y_pred.to_numpy()
alpha = 0.9
factor1 = alpha * np.maximum(y_true - y_pred, 0)
factor2 = (1 - alpha) * np.maximum(y_pred - y_true, 0)
return factor1 + factor2
The aggregate function is nothing fancy, just simple mean.
def aggregate(
loss: np.ndarray,
) -> float:
return loss.mean()
And that’s it. With just two functions, you can monitor and estimate any metric you can think of. Now that you have understood how to write and debug these methods, we’ll learn how to add them to your models.
How to Add Any Metric Using Your Cloud Dashboard
Log into your NannyML dashboard and find the Custom Metrics button in the top right corner.
You'll find all your metrics ordered according to their problem type.
On clicking “Add a new metric” you will find a window like the one below:
Once you’ve added the details, paste in the code for your well-tested metrics. NannyML Cloud lets you set Metric limits to control thresholds. If you skip this, the thresholds can get too big and mess up the plot scale. After filling in all the info, just save your metric.
Go to the Model Dashboard> Management > Setting > Performance of your desired machine learning model. You’ll find a list of applicable metrics on clicking the + Add Custom Metric button that you can add to your models.
Once added, go to Monitoring > Summary and run performance monitoring to have your custom metric calculated.
If you run into any issues, you can check the Logs section in your Model Management Section. Click on the File Icon to get a detailed txt file that you can use to debug.
If everything goes well, your metric will be displayed in the Performance page similar to this:
It looks like there’s an estimated drop in the Pinball Loss. You can now take action to address this before it materialises.
But how are these estimations derived? For Regression, we use Direct Loss Estimation. It involves training a secondary model, called the "nanny model," to predict the loss of the primary "child model" being monitored.
Creating a Custom Metric for Binary Classification
When designing a custom metric for classification, NannyML Cloud requires two key functions:
calculate
function – mandatory and used to compute realized performance based on known target values.
estimate
function – optional, used to estimate performance when target values are unavailable.
To write your function, you have access to the following:
y_pred_proba
: A pandas.Dataframe with the predicted probabilities for each class.
labels
: A list of class labels (e.g., [0, 1] for binary classification).
import pandas as pd
def calculate(
y_true: pd.Series,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
) -> float:
pass
def estimate(
estimated_target_probabilities: pd.DataFrame,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
) -> float:
pass
F2 Score
The F2 score is a variation of the F1 score, which is widely used in binary classification to balance precision and recall. The F2 score is part of the family of F-beta scores, where different values of the parameter β (beta) adjust the balance between precision and recall. Specifically, the F2 score gives more weight to recall, making it useful in cases where false negatives are more problematic than false positives.
Toggle to see Python code for F2 score
import numpy as np
import pandas as pd
from sklearn.metrics import fbeta_score
def calculate(
y_true: pd.Series,
y_pred: pd.Series,
**kwargs
) -> float:
return fbeta_score(y_true, y_pred, beta=2)
import numpy as np
import pandas as pd
def estimate(
estimated_target_probabilities: pd.DataFrame,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
) -> float:
estimated_target_probabilities = estimated_target_probabilities.to_numpy().ravel()
y_pred = y_pred.to_numpy()
# Create estimated confusion matrix elements
est_tp = np.sum(np.where(y_pred == 1, estimated_target_probabilities, 0))
est_fp = np.sum(np.where(y_pred == 1, 1 - estimated_target_probabilities, 0))
est_fn = np.sum(np.where(y_pred == 0, estimated_target_probabilities, 0))
est_tn = np.sum(np.where(y_pred == 0, 1 - estimated_target_probabilities, 0))
beta = 2
fbeta = (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
fbeta = np.nan_to_num(fbeta)
return fbeta
Follow the same steps as explained in the previous section and add this metric to a binary classification model in your setup.
Probabilistic Adaptive Performance Estimation or PAPE is the algorithm behind the estimated performance here. Now that you know that the F2 metric is dropping, you can easily compare it with the F1 metric which is a default metric present.
The fact that both F1 and F2 are decreasing indicates a general degradation in model performance. This could be due to concept drift, where the patterns of fraudulent behavior evolve over time and the model does not keep up.
Creating a Custom Metric for Multiclass Classification
Similar to Binary Classification, you need to create calculate and estimate function here as well. The purpose of the function remains same as before.
You can also access
class_probability_columns
: A list of column names in y_pred_proba
corresponding to each class's predicted probability. In this section, we will add the F2 score for a multiclass model using the NannyML Software Development Kit (SDK). The Custom Metrics module is part of the monitoring class, and it can be created by instantiating a new
nml_sdk.monitoring.CustomMetric()
. Before all this, you will need to set up the NannyML SDK address and your token.Open a Jupyter file or an editor of your choice and run the following code:
pip install git+https://github.com/NannyML/nannyml-cloud-sdk.git
Go to Account Settings and Create an API token for this task.
import nannyml_cloud_sdk as nml_sdk
nml_sdk.url = "nannyml url here"
nml_sdk.api_token = r"api token goes here"
custom_metric = nml_sdk.monitoring.CustomMetric()
Add a New Custom Metric
Prepare your custom metric before the following steps.
Toggle to See code for F2 Score in Python for Multi-Class Classification
import pandas as pd
from sklearn.metrics import fbeta_score
def calculate(
y_true: pd.Series,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
) -> float:
return fbeta_score(y_true, y_pred, beta=2, average='macro')
import numpy as np
import pandas as pd
from sklearn.preprocessing import label_binarize
def estimate(
estimated_target_probabilities: pd.DataFrame,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
):
beta = 2
def estimate_fb(_y_pred, _y_pred_proba, beta) -> float:
est_tp = np.sum(np.where(_y_pred == 1, _y_pred_proba, 0))
est_fp = np.sum(np.where(_y_pred == 1, 1 - _y_pred_proba, 0))
est_fn = np.sum(np.where(_y_pred == 0, _y_pred_proba, 0))
est_tn = np.sum(np.where(_y_pred == 0, 1 - _y_pred_proba, 0))
fbeta = (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
fbeta = np.nan_to_num(fbeta)
return fbeta
estimated_target_probabilities = estimated_target_probabilities.to_numpy()
y_preds = label_binarize(y_pred, classes=labels)
ovr_estimates = []
for idx, _ in enumerate(labels):
ovr_estimates.append(
estimate_fb(
y_preds[:, idx],
estimated_target_probabilities[:, idx],
beta=2
)
)
multiclass_metric = np.mean(ovr_estimates)
return multiclass_metric
After defining a function, you can call
custom_metric.create
to register your custom function as a custom metric. custom_metric = nml_sdk.monitoring.CustomMetric()
cm = custom_metric.create(
name="F2_Score_Custom_Metric",
description="Implementation of F2_Score",
problem_type="MULTICLASS_CLASSIFICATION",
calculation_function=calculate,
estimation_function=estimate,
lower_value_limit=0.0,
upper_value_limit=1.0,
)
When you create a new custom metric, it isn't automatically linked to any model.
To assign it to a model, first retrieve the model's unique identifier (model_id). Then, use
monitoring.Model.add_custom_metric
with the model_id and the metric_id as parameters.To get the model_id, call
nml_sdk.monitoring.Model.list()
. This function lists all available models or filters them by name or problem type. The model_id is found in the value of the id
key in the returned dictionaries.nml_sdk.monitoring.Model.add_custom_metric(model_id=196, metric_id=cm['id'])
nml_sdk.monitoring.Run.trigger(model_id=196)
From now, every time you run your model, the new custom metric will be calculated among the standard metrics.
Removing and Deleting a Custom Metric
#to remove a custom metric from a model
nml_sdk.monitoring.Model.remove_custom_metric(model_id=196,metric_id=1)
#to delete
custom_metric.delete(metric_id=1)
Some Mistakes to Be Mindful Of
- Inconsistent Naming Convention: Variability in column names can lead to
AttributeError
issues. If you run code on a cloud platform, these discrepancies can disrupt execution. To mitigate this, adopt a consistent naming convention: usey_true
for actual values andy_pred
for predictions. This approach simplifies the addition of custom metrics across multiple models.
- Importing Packages: Remember to import the required packages for every function.
- Array Length Mismatch: Ensure that the lengths of your
y_true
andy_pred
arrays align. A mismatch will cause errors during calculations. Additionally, be aware that the loss function should return a numpy array to avoid complications.
- Handling Missing Data: Implement checks to manage missing data effectively, either through imputation or removal, to maintain the integrity of your analysis.
- Validating Chunk Data: When working with chunked data, watch out for empty chunks. Use
nml.chunker
to ensure your function does not return NaN values, preserving the reliability of your computations.
Conclusion
In this blog, we explored how to add custom metrics using both the NannyML Cloud Dashboard and SDK. You learned how to implement metrics, along with potential mistakes that you can run into. We also discussed how NannyML algorithms can estimate performance even in the absence of ground truth.
Post-deployment data science is a field focused on monitoring and maintaining machine learning models after they’ve been deployed in production. Over time, your models will face challenges such as covariate shift concept drift, and data quality degradation. These issues arise from changes in the underlying data distribution or real-world context, which can lead to reduced model accuracy. The goal of post-deployment data science is to detect and address these problems before they affect the decisions your models are making.
If you're looking for expert guidance, you can schedule a demo with the NannyML founders. They’ll work with you to find tailored solutions for your specific use cases.
Learn how maintaining ML models is easier than ever!
Read More…
Custom metrics can be tailored to fit industry-specific models, addressing unique challenges and performance standards. To dive deeper into how models are monitored within different industry domains, check out these blogs: