
Do not index
Do not index
Canonical URL
In 2021, we invented our first performance estimation algorithm, CBPE, short for Confidence-Based Performance Estimation. CBPE estimates any classification metric, like accuracy, precision, and recall—all without needing ground truth data. For us, this invention was a huge milestone. It solidified our view on ML monitoring and brought a way to track what truly matters. Model performance. At any time. With CBPE, we can monitor the model without worrying about missing true labels.
However, there was still one issue: CBPE doesn’t handle regression models. The algorithm relies on predicted probabilities from test data, which regression models don’t provide. So, performance estimation for regression stayed unsolved.
It wasn’t until a year later that we cracked the analogous regression problem. Today, I’ll share the untold story behind developing our performance estimation algorithm for regression, DLE (Direct Loss Estimation), the methods we explored, and how we arrived at its discovery.
The research problem
We had one question: How can we measure a regression model’s performance when ground truth is unavailable? We got obsessed with this question. Mainly because we knew that every time an ML model makes a prediction, its business impact always happens before the ground truth arrives, so performance can only be measured later in the process.

If we could answer the question with a reliable method that can remove the gap between business impact and performance monitoring, we would make a big improvement in how ML models are currently monitored.
When thinking about a potential solution, we set ourselves some rules:
- The method doesn’t have access to the original model. We want a model-agnostic method that can work with any regression model.
- During fitting, the method can have access to test data.
- During inference, the method can only use model inputs and predictions.
Our first approach was inspired by the reasoning behind CBPE, which we initially developed for classification tasks. There, we used the calibrated posterior distribution for each class outcome to estimate any classification metric. Extending this logic to regression, we realized that having the posterior distribution would enable us to estimate any regression metric. As a result, most of our experimentation focused on methods to derive that distribution.
A fun toy experiment
Before jumping into complicated methods, let's see if we can estimate the performance of a single-point prediction, assuming we know the true targets’ distribution.
Let’s say we have a fixed point prediction , and know that the true values follow a certain probability distribution . In this case, we will assume it follows a normal distribution with mean equal to and standard deviation equal to some noise.
n = 30_000 # draw 30k samples
y_hat = 0 # prediction
noise_std = 2
bias = -1
mu = y_hat + bias
y = np.random.normal(mu, noise_std, n) # actual values (oracle knowledge)
Remembering our probability classes, we know that the expected value of a function over a probability distribution is defined as:
If we are interested in estimating Mean Absolute Error (MAE), then our error function would be So, the expected MAE is:
In code, the inner term of the integral can be implemented as:
from scipy import stats
def mae_distribution(y, y_hat, mean, std):
metric = abs(y_hat - y)
probability = stats.norm.pdf(y, mean, std)
value = metric * probability
return value
In practice, we can't integrate from negative infinity to positive infinity, so we choose a reasonable range (in this case, from -10 to 10) and use Scipy’s quad method to compute a definite integral.
from scipy.integrate import quad
lower_limit = -10 # lower integration limit
upper_limit = 10 # upper integration limit
estimated_mae, _ = quad(mae_distribution, # this is g(y)*p(y)
lower_limit,
upper_limit,
args=(y_hat,
mu,
noise_std))
Checking our estimation and comparing it with the actual MAE we realized the actual MAE only differs from the estimated MAE by 0.002, pretty fun right?
mae = np.mean(np.abs(y_hat - y)
print(f"MAE: {mae}")
print(f"Estimated MAE: {estimated_mae}")
print(f"Realized MAE vs Estimated MAE diff: {mae - estimated_mae}")
MAE: 1.7932928860190636
Estimated MAE: 1.7911506680442473
Realized MAE vs Estimated MAE diff: 0.002142217974816285
We can repeat the experiment, but this time for many different point predictions, and use our oracle knowledge to show that if we can design a method that approximates the true then we would have solved performance estimation for regression.

We got such a good result because we did a good job setting up . However, in real life, we don’t know nor , what this toy example confirms is that if we figure out a way to construct a decent we could estimate the performance of any regression metric.
Explored Methods
We explored many ways to tackle the problem of estimating regression performance. We tried classic Bayesian methods. We tried conformalized quantile regression. We even ventured into some creative ideas in between. Let’s begin with where we started: Bayesian approaches.
Bayesian Approaches
In the toy example, we learned that if we have a decent that approximates we could get a good estimation for any regression metric. Our first approach to estimate this posterior distribution was applying Bayesian statistics.
When thinking about linear regression, we generally think about it in frequentist terms, which means modeling the dependent variable as:
where are the independent variables, the coefficients of the model and an error term, which we assume is normally distributed.
On the other hand, Bayesian statistics take a probabilistic point of view and express this model in terms of probability distributions. The above linear regression can then be reformulated as follows:
where is now a random variable that follows a normal distribution whose mean is provided by our linear predictor and variance equals to .
To make this work, we must provide good prior distributions for the unknown variables. For example, if our model takes two features and the Bayesian linear regression would look something like this:
Where we would need to assume the forms of , , and :
Assuming the distribution of all the model features adds a lot of complexity and uncertainty, even more so if we expect the method to generalize well on any customers’ dataset/model.
While trying to build something that generalizes well, we explored many approaches, such as, assuming that is a mixture of Gaussians and that the relationship between parameters is more complex than linear (e.g., higher-order polynomials with interactions). This approach worked well for simple datasets, as seen in the image below, where the estimated MAE looks very similar to the realized one.

However, for much more complex datasets, we had issues setting good prior distributions, convergence wasn’t always guaranteed, and computationally speaking, the method was very extensive. For these reasons, we suspended our research using Bayesian approaches and decided to look into Conformalized Quantile Regression (CQR) to estimate the posteriors.
Conformalized Quantile Regression (CQR)
The idea of exploring CQR came from thinking that maybe we don’t need to estimate the whole posterior distribution, but a “low-resolution” version would be enough. We could also use CQR to estimate the posterior distribution by approximating its quantiles.
Unlike least squares regression, which estimates the conditional mean of the response variable given some inputs , quantile regression can estimate the median or any other quantile.
To ensure the accuracy of these quantiles, we used Conformal Prediction. This approach let us provide prediction intervals, offering more reliable estimates than just a single-point prediction.
However, there was a big challenge. When applying CQR to estimate model performance, we noticed that the quantiles with the greatest impact on performance were also the most uncertain. These quantiles, typically at the distribution's tails (0.1 and 0.9), are notoriously difficult to approximate, mainly due to data sparsity in the tails. By definition, the tails of the distribution contain fewer data points. For instance, only 10% of the data lie below the 0.1 quantile or above the 0.9 quantile. This sparsity reduces the model's ability to estimate the conditional quantile reliably.

We call this phenomenon “flapping tails” The image above shows the Kernel Density Estimation (KDE) plots of the realized MAE and the estimated values at the 0.1 and 0.9 quantiles. We observe that the estimated quantile distributions are more “jumpy” and exhibit greater variance, hence the term “flapping tails”.
While toying with CQR, we asked ourselves, what if we don’t look at the posteriors and simply try to find points of the serving data that look similar in the test set? We came up with a method that, in a way, resembles Importance Weighting but with a few interesting (and probably unnecessary) additions. Check out the Appendix section if you want to learn more about this approach.
After experimenting with these approaches for a while, we realized we were overcomplicating things. We realized we could skip the posterior estimation part and Directly Estimate the Loss. This reasoning led us to invent DLE.
Direct Loss Estimation (DLE)
We developed DLE based on the idea that model loss patterns can be learned directly. Instead of trying to predict actual errors or assuming their distribution, we found that predicting the magnitude of errors (absolute error) was a significantly easier problem to solve.
Let's consider the following example to build intuition about how performance estimation for regression works. We have a simple model with one continuous feature, , and aim to estimate the target variable, . If we plot the model inputs against the target variable, one pattern we might notice is that at lower values of , there is less dispersion in the target variable . Conversely, at higher values of , the dispersion along the y-axis increases.

Given that scenario, we would expect lower absolute errors in the region with low dispersion compared to the region with higher dispersion. We could even plot the rolling mean absolute error to observe how it increases as increases.

This example hints us that the model’s absolute error could be modeled as a function of its features. We took this idea to design DLE and build a method that quantifies the uncertainty of the model in a single metric, given its inputs.
Implementation details
DLE quantifies model uncertainty in a single metric by training an internal ML model, known as the nanny model, to predict the loss of a monitored model. This internal model uses the same input features as the monitored model, along with the model's predictions as additional features. The key difference is that the internal model is trained with the loss of the monitored model as its target.
After estimating it, DLE turns the loss into a performance metric. This provides us with a way to observe the expected performance of an ML model, even when we don't have access to the actual target values.
Let’s study it step by step. We’ll denote with the monitored model and the nanny model. For simplicity, let’s assume we are interested in estimating the Mean Absolute Error (MAE) of for some analysis data for which targets are not available. was trained on train data and used on reference data providing predictions. Targets for reference set are available. The algorithm runs as follows:
- Loss Calculation: For each observation of reference data we calculate the loss which in case of MAE is absolute error of , i.e. .
- Nanny Model Training: Then, DLE trains a nanny model on reference data using as features plus the monitored model predictions . While, as the target, it uses the absolute error calculated in the previous step. So,
- Performance Estimation: Finally, DLE estimates the performance of the monitored model on analysis data using the nanny mode , meaning that , from here we can just finally calculate the mean of to get .
Unlike PESP, which uses local distribution fitting or Bayesian methods that need priors for every feature, DLE learns a direct mapping from features and predictions to expected losses. This makes it particularly easy to implement and generalize over multiple use cases.
Conclusion
After testing Bayesian methods, Conformalized Quantile Regression, and other approaches, we realized that predicting error magnitude is simpler than estimating full error distributions. This realization led us to develop DLE, a more efficient and generalized solution. Since this approach is model-agnostic, it can estimate any regression metric without requiring access to ground truth data. Check out the open-source NannyML library to test DLE on your own model with your own data! Don’t forget to star it. 🌟
Appendix
Performance Estimation from Similar Points (PESP)
After exploring Bayesian methods, Nikos, one of our research data scientists, had the idea that similar inputs tend to produce similar prediction errors. So, instead of trying to model the entire error distribution, we could look at how our model performs on similar data points and use that to estimate performance on new data.
The method consists of four main steps:
- Dimensionality Reduction: First, we combine our features and reference predictions into a single representation and reduce its dimensionality using PCA. This gives us a space where we can meaningfully measure the similarity between data points:
- Local Distribution Fitting: For each point in our reference dataset, we:
- Find the closest neighbors in the reduced space
- Fit a Laplace distribution to their residuals
- Store the location () and scale () parameters
In math terms, we assumed that the residuals followed a Laplace distribution . We also experimented with other distributions such as Gaussian, Exponential, Cauchy, Asymmetric Laplace, ect. Laplace is featured here as it provided decent results.
- Distribution Parameter Modeling: We then train two ridge regression models:
- One to predict the location parameter for any new point
- Another to predict the scale parameter
- Performance Estimation: For new data points, we:
- Use the fitted ridge regression models to estimate Laplace parameters and
- Sample from these distributions to estimate errors
- Calculate our performance metric of interest
We didn’t pursue this method further, and there were several reasons. First, it was highly sensitive to assumptions about the residuals’ distribution. If those assumptions were off, so were the results. Second, it was costly. To make it work, we had to fit multiple models—PCA, a Laplace distribution, and two separate regression models—each adding to the computational load. Third, the method depended heavily on the number of neighbors () chosen, making it unstable if wasn’t just right. These challenges made us look for alternatives.