Quantile Regression on Synthetic DataĀ¶
This notebook shows the diagnostic tools applied to quantile regression. For details, see https://arxiv.org/abs/2202.12780.
First, we prepare some plotting settings.
from cycler import cycler
import matplotlib
import matplotlib.pyplot as plt
cmap = matplotlib.colormaps.get_cmap("tab10")
# Swap 3rd with 7th position to make 3rd gray.
color_cy = cycler("color", cmap([0, 1, 7, 2, 3, 4, 5, 6, 8, 9]))
# This can be checked by running
# from matplotlib.colors import ListedColormap
# display(cmap)
# display(ListedColormap(color_cy.by_key()["color"]))
new_cycler = color_cy + cycler(linestyle=["-"] * 2 + ['--'] + ['-'] * 7)
plt.rc('axes', prop_cycle=new_cycler)
1. Data and ModelsĀ¶
We start by creating an artificial dataset with an asymmetric distribution such that mean and median are very different. Then, we fit a linear model and gradient boosted trees for the 75%-quantile.
import numpy as np
import polars as pl
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import QuantileRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import SplineTransformer
quantile_level = 0.75
def create_data(n, rng):
x = np.linspace(start=0, stop=10, num=200)
y_true_mean = 10 + 0.5 * x * np.sin(x)
a = 5
# Note that numpy uses the Lomax version of pareto with mean 1/(a-1).
y = y_true_mean + 10 * (rng.pareto(a, size=x.shape[0]) - 1 / (a - 1))
y_true_quantile = y_true_mean + 10 * (
np.power(1 - quantile_level, -1/a) - 1 - 1 / (a - 1)
)
return y, x, y_true_quantile
rng = np.random.RandomState(42)
y_train, x_train, y_quantile_train = create_data(200, rng)
y_test, x_test, y_quantile_test = create_data(100, rng)
X_train, X_test = x_train[:, np.newaxis], x_test[:, np.newaxis]
m_linear = GridSearchCV(
estimator=make_pipeline(
SplineTransformer(degree=3),
QuantileRegressor(quantile=quantile_level, solver="highs"),
),
param_grid={
"splinetransformer__n_knots": [3, 5, 8, 10],
"quantileregressor__alpha": np.logspace(-4, 0, 20)
},
cv=5,
).fit(X_train, y_train)
print(
"best (alpha, n_knots) for linear quantile regression = "
f"({m_linear.best_params_['quantileregressor__alpha']}, "
f"{m_linear.best_params_['splinetransformer__n_knots']})"
)
m_hgbt = HistGradientBoostingRegressor(
loss="quantile", quantile=quantile_level
).fit(X_train, y_train)
fig, ax = plt.subplots()
ax.scatter(x_train, y_train, color="black", alpha=0.5)
ax.plot(x_train, m_linear.predict(X_train), label=f"LinReg Quantile: {quantile_level}")
ax.plot(x_train, m_hgbt.predict(X_train), label=f"HGBT Quantile: {quantile_level}")
ax.plot(
x_train, y_quantile_train, label=f"True {quantile_level}-Quantile",
)
ax.legend()
best (alpha, n_knots) for linear quantile regression = (0.00026366508987303583, 5)
<matplotlib.legend.Legend at 0x13ae45dc0>
2. Calibration AssessmentĀ¶
First, we look at the calibration in total, i.e. neither conditioning on a feature nor on the predictions.
We want to predict the 75%-quantile, so we expect that 25% of the data lie above and 75% below the model predictions.
This is what the identification function for quantiles actually calculates such that a value of 0 is ideal.
Note that the identification function is implicitly called in compute_bias
as well as plot_bias
.
from model_diagnostics.calibration import compute_bias, plot_bias, plot_reliability_diagram
df = pl.DataFrame({
"linear_quant_reg": m_linear.predict(X_train),
"hgbt_quant_reg": m_hgbt.predict(X_train),
"true_quantile": y_quantile_train,
})
compute_bias(y_obs=y_train, y_pred=df, functional="quantile", level=quantile_level)
model | bias_mean | bias_count | bias_weights | bias_stderr | p_value |
---|---|---|---|---|---|
str | f64 | u32 | f64 | f64 | f64 |
"linear_quant_reg" | 0.005 | 200 | 200.0 | 0.030488 | 0.869899 |
"hgbt_quant_reg" | 0.0 | 200 | 200.0 | 0.030695 | 1.0 |
"true_quantile" | -0.005 | 200 | 200.0 | 0.030897 | 0.871607 |
ax = plot_bias(
y_obs=y_train,
y_pred=df,
functional="quantile",
level=quantile_level,
confidence_level=1 - 0.9,
)
ax.set_title("Bias Plot on Training Set")
Text(0.5, 1.0, 'Bias Plot on Training Set')
By setting a very small confidence level $\alpha$, such that $1-\alpha$ is in between the p-values, we see that the confidence interval of the linear model does not reach the 0-line while the one of the HGBT model does. Please note that this is a poor choice of a confidence level for this sample size because even the true quantile does not cover the 0-line due to sample uncertainty.
We go on by investigating auto-calibration with the classical tool: a reliability diagram.
%%time
ax = plot_reliability_diagram(
y_train,
df,
functional="quantile",
level=quantile_level,
n_bootstrap=100,
)
ax.set_ylim(None, 20)
ax.set_title("Reliability Diagram on Training Set")
CPU times: user 3.64 s, sys: 155 ms, total: 3.79 s Wall time: 3.75 s
Text(0.5, 1.0, 'Reliability Diagram on Training Set')
The reliability diagram shows reasonable auto-calibration for both models. For large predicted values, the linear model seems a bit off.
We conclude the assessment of calibration by a bias plot conditional on our single feature.
ax = plot_bias(y_obs=y_train, y_pred=df, feature=x_train, confidence_level=0.9)
ax.set_title("Bias Plot conditional on prediction on Training Set")
Text(0.5, 1.0, 'Bias Plot conditional on prediction on Training Set')
Here again, the HGBT model seems a bit better calibrated.
3. Comparison of Model PerformanceĀ¶
We use the standard loss function for quantile regression: the pinball loss. Furthermore, as is standard, we do this analysis out-of-sample, i.e. using the test data set.
We report not only the pinball loss or score, but also the additive score decomposition: score = miscalibration - discrimination + uncertainty
.
from model_diagnostics.scoring import decompose, plot_murphy_diagram, PinballLoss
pbl = PinballLoss(level=quantile_level)
df = pl.DataFrame({
"linear_quant_reg": m_linear.predict(X_test),
"hgbt_quant_reg": m_hgbt.predict(X_test),
"true_quantile": y_quantile_test,
})
decompose(y_obs=y_test, y_pred=df, scoring_function=pbl).sort("score")
model | miscalibration | discrimination | uncertainty | score |
---|---|---|---|---|
str | f64 | f64 | f64 | f64 |
"linear_quant_rā¦ | 0.065729 | 0.24092 | 1.075195 | 0.900004 |
"true_quantile" | 0.082475 | 0.257558 | 1.075195 | 0.900112 |
"hgbt_quant_regā¦ | 0.056078 | 0.206818 | 1.075195 | 0.924455 |
The linear model has a better (lower) out-of-sample score (or loss). While the HGBT has a better miscalibration term (smaller is better), the dominating term is the discrimination (larger is better) where the linear model is clearly superior.
Again, note that there is sample uncertainty. It causes the true quantile to have a slightly worse score than the linear model. A simple t-test will show if the score differences are significant or related to sample uncertainty.
from scipy.stats import ttest_1samp
ttest_1samp(
pbl.score_per_obs(y_obs=y_test, y_pred=m_linear.predict(X_test))
- pbl.score_per_obs(y_obs=y_test, y_pred=y_quantile_test),
popmean=0,
)
TtestResult(statistic=-0.005514228985365129, pvalue=0.9956058279989309, df=199)
The test statistic is negative which means that the score of the linear model is better than the one of the true quantile. With the very large p-value, however, we cannot reject the null hypothesis of equal mean values, i.e. equal scores. It means that we do not have statistical evidence to say that any one of the linear and the true model performs better.
We play the same game and perform a t-test for the score difference of the linear and the HGBT models.
ttest_1samp(
pbl.score_per_obs(y_obs=y_test, y_pred=m_linear.predict(X_test))
- pbl.score_per_obs(y_obs=y_test, y_pred=m_hgbt.predict(X_test)),
popmean=0,
)
TtestResult(statistic=-1.1262034739249922, pvalue=0.26143569923373233, df=199)
This time, the p-value is 26%: Assuming the null hypothesis of equal scores is true, there is a 26% chance of the test statistic to be more extreme as our obtained value of -1.126. This is not splendid, but, given our small sample size of 100, it is small enough to reject the null hypothesis and say that the linear model has a significantly better score than the HGBT model.
All of the above score analysis was based on the pinball loss, but there are other possible choices.
Does the ranking of the models change with the choice of a scoring function?
This can be analysed by means of the Murphy diagram.
The x-axis eta
specifies different scoring/loss functions, all consistent for the 75%-quantile.
plot_murphy_diagram(y_obs=y_test, y_pred=df, functional="quantile", level=quantile_level)
<Axes: title={'center': 'Murphy Diagram'}, xlabel='eta', ylabel='score'>
The linear model seems indeed to be superior over a range of eta
values.