calibration¶
Tools to assess model calibration.
compute_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges')
¶
Compute generalised bias conditional on a feature.
This function computes and aggregates the generalised bias, i.e. the values of the
canonical identification function, versus (grouped by) a feature.
This is a good way to assess whether a model is conditionally calibrated or not.
Well calibrated models have bias terms around zero.
For the mean functional, the generalised bias is the negative residual
y_pred - y_obs.
See Notes for further details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs) or (n_obs, n_models)
|
Predicted values, e.g. for the conditional expectation of the response,
|
required |
feature
|
array-like of shape (n_obs) or None
|
Some feature column. |
None
|
weights
|
array-like of shape (n_obs) or None
|
Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. Note that the standard errors and p-values in the output are based on the assumption that the variance of the bias is inverse proportional to the weights. See the Notes section for details. |
None
|
functional
|
str
|
The functional that is induced by the identification function
|
'mean'
|
level
|
float
|
The level of the expectile of quantile. (Often called \(\alpha\).)
It must be |
0.5
|
n_bins
|
int
|
The number of bins, at least 2. For numerical features, I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values. |
10
|
bin_method
|
str
|
The method for finding bin edges (boundaries). Options using
Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:
|
'sturges'
|
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
The result table contains at least the columns:
If
|
Notes
A model \(m(X)\) is conditionally calibrated iff
\(\mathbb{E}(V(m(X), Y)|X)=0\) almost surely with canonical identification
function \(V\).
The empirical version, given some data, reads
\(\bar{V} = \frac{1}{n}\sum_i \phi(x_i) V(m(x_i), y_i)\) with a test function
\(\phi(x_i)\) that projects on the specified feature.
For a feature with only two distinct values "a" and "b", this becomes
\(\bar{V} = \frac{1}{n_a}\sum_{i \text{ with }x_i=a} V(m(a), y_i)\) with
\(n_a=\sum_{i \text{ with }x_i=a}\) and similar for "b".
With case weights, this reads
\(\bar{V} = \frac{1}{\sum_i w_i}\sum_i w_i \phi(x_i) V(m(x_i), y_i)\).
This generalises the classical residual (up to a minus sign) for target functionals
other than the mean. See [FLM2022].
The standard error for \(\bar{V}\) is calculated in the standard way as \(\mathrm{SE} = \sqrt{\operatorname{Var}(\bar{V})} = \frac{\sigma}{\sqrt{n}}\) and the standard variance estimator for \(\sigma^2 = \operatorname{Var}(\phi(x_i) V(m(x_i), y_i))\) with Bessel correction, i.e. division by \(n-1\) instead of \(n\).
With case weights, the variance estimator becomes \(\operatorname{Var}(\bar{V}) = \frac{1}{n-1} \frac{1}{\sum_i w_i} \sum_i w_i (V(m(x_i), y_i) - \bar{V})^2\) with the implied relation \(\operatorname{Var}(V(m(x_i), y_i)) \sim \frac{1}{w_i} \). If your weights are for repeated observations, so-called frequency weights, then the above estimate is conservative because it uses \(n - 1\) instead of \((\sum_i w_i) - 1\).
References
[FLM2022]-
T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.
Examples:
>>> compute_bias(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
shape: (1, 5)
┌───────────┬────────────┬──────────────┬─────────────┬──────────┐
│ bias_mean ┆ bias_count ┆ bias_weights ┆ bias_stderr ┆ p_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪════════════╪══════════════╪═════════════╪══════════╡
│ 0.25 ┆ 4 ┆ 4.0 ┆ 0.478714 ┆ 0.637618 │
└───────────┴────────────┴──────────────┴─────────────┴──────────┘
>>> compute_bias(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2],
... feature=["a", "a", "b", "b"])
shape: (2, 6)
┌─────────┬───────────┬────────────┬──────────────┬─────────────┬─────────┐
│ feature ┆ bias_mean ┆ bias_count ┆ bias_weights ┆ bias_stderr ┆ p_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═══════════╪════════════╪══════════════╪═════════════╪═════════╡
│ a ┆ 0.0 ┆ 2 ┆ 2.0 ┆ 1.0 ┆ 1.0 │
│ b ┆ 0.5 ┆ 2 ┆ 2.0 ┆ 0.5 ┆ 0.5 │
└─────────┴───────────┴────────────┴──────────────┴─────────────┴─────────┘
plot_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges', confidence_level=0.9, ax=None)
¶
Plot model bias conditional on a feature.
This plots the generalised bias (residuals), i.e. the values of the canonical identification function, versus a feature. This is a good way to assess whether a model is conditionally calibrated or not. Well calibrated models have bias terms around zero. See Notes for further details.
For numerical features, NaN are treated as Null values. Null values are always plotted as rightmost value on the x-axis and marked with a diamond instead of a dot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs) or (n_obs, n_models)
|
Predicted values, e.g. for the conditional expectation of the response,
|
required |
feature
|
array-like of shape (n_obs) or None
|
Some feature column. |
None
|
weights
|
array-like of shape (n_obs) or None
|
Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. Note that the standard errors and p-values in the output are based on the assumption that the variance of the bias is inverse proportional to the weights. See the Notes section for details. |
None
|
functional
|
str
|
The functional that is induced by the identification function
|
'mean'
|
level
|
float
|
The level of the expectile or quantile. (Often called \(\alpha\).)
It must be |
0.5
|
n_bins
|
int
|
The number of bins, at least 2. For numerical features, I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values. |
10
|
bin_method
|
str
|
The method for finding bin edges (boundaries). Options using
Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:
|
'sturges'
|
confidence_level
|
float
|
Confidence level for error bars. If 0, no error bars are plotted. Value must
fulfil |
0.9
|
ax
|
matplotlib.axes.Axes or plotly Figure
|
Axes object to draw the plot onto, otherwise uses the current Axes. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
ax |
Either the matplotlib axes or the plotly figure. This is configurable by
setting the |
Notes
A model \(m(X)\) is conditionally calibrated iff \(E(V(m(X), Y))=0\) a.s. The
empirical version, given some data, reads \(\frac{1}{n}\sum_i V(m(x_i), y_i)\).
See [FLM2022].
References
FLM2022-
T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.
compute_marginal(y_obs, y_pred, X=None, feature_name=None, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None)
¶
Compute the marginal expectation conditional on a single feature.
This function computes the (weighted) average of observed response and predictions conditional on a given feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs) or (n_obs, n_models)
|
Predicted values, e.g. for the conditional expectation of the response,
|
required |
X
|
array-like of shape (n_obs, n_features) or None
|
The dataframe or array of features to be passed to the model predict function. |
None
|
feature_name
|
(int, str or None)
|
Column name (str) or index (int) of feature in |
None
|
predict_function
|
callable or None
|
A callable to get prediction, i.e. |
None
|
weights
|
array-like of shape (n_obs) or None
|
Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. |
None
|
n_bins
|
int
|
The number of bins, at least 2. For numerical features, I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values. |
10
|
bin_method
|
str
|
The method for finding bin edges (boundaries). Options using
Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:
|
'sturges'
|
n_max
|
int or None
|
Used only for partial dependence computation. The number of rows to subsample from X. This speeds up computation, in particular for slow predict functions. |
1000
|
rng
|
(Generator, int or None)
|
Used only for partial dependence computation. The random number generator used
for subsampling of |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
The result table contains at least the columns:
If
If
|
Notes
The marginal values are computed as an estimation of:
y_obs: \(\mathbb{E}(Y|feature)\)y_pred: \(\mathbb{E}(m(X)|feature)\)
with \(feature\) the column specified by feature_name.
Computationally that is more or less a group-by-aggregate operation on a dataset.
The standard error for both are calculated in the standard way as \(\mathrm{SE} = \sqrt{\operatorname{Var}(\bar{Y})} = \frac{\sigma}{\sqrt{n}}\) and the standard variance estimator for \(\sigma^2\) with Bessel correction, i.e. division by \(n-1\) instead of \(n\).
With case weights, the variance estimator becomes \(\operatorname{Var}(\bar{Y}) = \frac{1}{n-1} \frac{1}{\sum_i w_i} \sum_i w_i (y_i - \bar{y})^2\) with the implied relation \(\operatorname{Var}(y_i) \sim \frac{1}{w_i} \). If your weights are for repeated observations, so-called frequency weights, then the above estimate is conservative because it uses \(n - 1\) instead of \((\sum_i w_i) - 1\).
Examples:
>>> compute_marginal(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2])
shape: (1, 6)
┌────────────┬─────────────┬──────────────┬───────────────┬───────┬─────────┐
│ y_obs_mean ┆ y_pred_mean ┆ y_obs_stderr ┆ y_pred_stderr ┆ count ┆ weights │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ u32 ┆ f64 │
╞════════════╪═════════════╪══════════════╪═══════════════╪═══════╪═════════╡
│ 0.5 ┆ 0.75 ┆ 0.288675 ┆ 0.629153 ┆ 4 ┆ 4.0 │
└────────────┴─────────────┴──────────────┴───────────────┴───────┴─────────┘
>>> import polars as pl
>>> from sklearn.linear_model import Ridge
>>> pl.Config.set_tbl_width_chars(84)
<class 'polars.config.Config'>
>>> y_obs, X =[0, 0, 1, 1], [[0, 1], [1, 1], [2, 2], [3, 2]]
>>> m = Ridge().fit(X, y_obs)
>>> compute_marginal(y_obs=y_obs, y_pred=m.predict(X), X=X, feature_name=0,
... predict_function=m.predict)
shape: (3, 9)
┌──────────┬─────────┬─────────┬─────────┬───┬───────┬─────────┬─────────┬─────────┐
│ feature ┆ y_obs_m ┆ y_pred_ ┆ y_obs_s ┆ … ┆ count ┆ weights ┆ bin_edg ┆ partial │
│ 0 ┆ ean ┆ mean ┆ tderr ┆ ┆ --- ┆ --- ┆ es ┆ _depend │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ u32 ┆ f64 ┆ --- ┆ ence │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ ┆ ┆ array[f ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 64, 3] ┆ f64 │
╞══════════╪═════════╪═════════╪═════════╪═══╪═══════╪═════════╪═════════╪═════════╡
│ 0.5 ┆ 0.0 ┆ 0.125 ┆ 0.0 ┆ … ┆ 2 ┆ 2.0 ┆ [0.0, ┆ 0.25 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 0.5, ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 1.0] ┆ │
│ 2.0 ┆ 1.0 ┆ 0.75 ┆ 0.0 ┆ … ┆ 1 ┆ 1.0 ┆ [1.0, ┆ 0.625 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 0.0, ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 2.0] ┆ │
│ 3.0 ┆ 1.0 ┆ 1.0 ┆ 0.0 ┆ … ┆ 1 ┆ 1.0 ┆ [2.0, ┆ 0.875 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 0.0, ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 3.0] ┆ │
└──────────┴─────────┴─────────┴─────────┴───┴───────┴─────────┴─────────┴─────────┘
plot_marginal(y_obs, y_pred, X, feature_name, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None, ax=None, show_lines='numerical')
¶
Plot marginal observed and predicted conditional on a feature.
This plot provides a means to inspect a model per feature. The average of observed and predicted are plotted as well as a histogram of the feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values, e.g. for the conditional expectation of the response,
|
required |
X
|
array-like of shape (n_obs, n_features)
|
The dataframe or array of features to be passed to the model predict function. |
required |
feature_name
|
str or int
|
Column name (str) or index (int) of feature in |
required |
predict_function
|
callable or None
|
A callable to get prediction, i.e. |
None
|
weights
|
array-like of shape (n_obs) or None
|
Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. |
None
|
n_bins
|
int
|
The number of bins, at least 2. For numerical features, I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values. |
10
|
bin_method
|
str
|
The method for finding bin edges (boundaries). Options using
Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:
|
'sturges'
|
n_max
|
int or None
|
Used only for partial dependence computation. The number of rows to subsample from X. This speeds up computation, in particular for slow predict functions. |
1000
|
rng
|
(Generator, int or None)
|
Used only for partial dependence computation. The random number generator used
for subsampling of |
None
|
ax
|
matplotlib.axes.Axes or plotly Figure
|
Axes object to draw the plot onto, otherwise uses the current Axes. |
None
|
show_lines
|
str
|
Option for how to display mean values and partial dependence:
|
'numerical'
|
Returns:
| Name | Type | Description |
|---|---|---|
ax |
Either the matplotlib axes or the plotly figure. This is configurable by
setting the |
Examples:
If you wish to plot multiple features at once with subfigures, here is how to do it with matplotlib:
```py from math import ceil import matplotlib.pyplot as plt import numpy as np from model_diagnostics.calibration import plot_marginal
Replace by your own data and model.¶
n_obs = 100 y_obs = np.arange(n_obs) X = np.ones((n_obs, 2)) X[:, 0] = np.sin(np.arange(n_obs)) X[:, 1] = y_obs ** 2
def model_predict(X): s = 0.5 * n_obs * np.sin(X) return s.sum(axis=1) + np.sqrt(X[:, 1])
Now the plotting.¶
feature_list = [0, 1] n_rows, n_cols = ceil(len(feature_list) / 2), 2 fig, axs = plt.subplots(nrows=n_rows, ncols=n_cols, sharey=True) for i, ax in enumerate(axs): plot_marginal( y_obs=y_obs, y_pred=model_predict(X), X=X, feature_name=feature_list[i], predict_function=model_predict, ax=ax, ) fig.tight_layout() ```
For plotly, use the helper function
add_marginal_subplot:
```py from math import ceil import numpy as np from model_diagnostics import config_context from plotly.subplots import make_subplots from model_diagnostics.calibration import add_marginal_subplot, plot_marginal
Replace by your own data and model.¶
n_obs = 100 y_obs = np.arange(n_obs) X = np.ones((n_obs, 2)) X[:, 0] = np.sin(np.arange(n_obs)) X[:, 1] = y_obs ** 2
def model_predict(X): s = 0.5 * n_obs * np.sin(X) return s.sum(axis=1) + np.sqrt(X[:, 1])
Now the plotting.¶
feature_list = [0, 1] n_rows, n_cols = ceil(len(feature_list) / 2), 2 fig = make_subplots( rows=n_rows, cols=n_cols, vertical_spacing=0.3 / n_rows, # equals default # subplot_titles=feature_list, # maybe specs=[[{"secondary_y": True}] * n_cols] * n_rows, # This is important! ) for row in range(n_rows): for col in range(n_cols): i = n_cols * row + col with config_context(plot_backend="plotly"): subfig = plot_marginal( y_obs=y_obs, y_pred=model_predict(X), X=X, feature_name=feature_list[i], predict_function=model_predict, ) add_marginal_subplot(subfig, fig, row, col) fig.show() ```
add_marginal_subplot(subfig, fig, row, col)
¶
Add a plotly subplot from plot_marginal to a multi-plot figure.
This auxiliary function is accompanies
plot_marginal in order to ease
plotting with subfigures with the plotly backend.
For it to work, you must call plotly's make_subplots with the specs argument
and set the appropriate number of {"secondary_y": True} in a list of lists.
```py hl_lines="7"
from plotly.subplots import make_subplots
n_rows, n_cols = ...
fig = make_subplots(
rows=n_rows,
cols=n_cols,
specs=[[{"secondary_y": True}] * n_cols] * n_rows, # This is important!
)
``
The reason is thatplot_marginal` uses a secondary yaxis (and swapped sides with
the primary yaxis).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subfig
|
plotly Figure
|
The subfigure which is added to |
required |
fig
|
plotly Figure
|
The multi-plot figure to which |
required |
row
|
int
|
The (0-based) row index of |
required |
col
|
int
|
The (0-based) column index of |
required |
Returns:
| Type | Description |
|---|---|
fig
|
The plotly figure |
identification_function(y_obs, y_pred, *, functional='mean', level=0.5)
¶
Canonical identification function.
Identification functions act as generalised residuals. See Notes for further details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
functional
|
str
|
The functional that is induced by the identification function
|
'mean'
|
level
|
float
|
The level of the expectile of quantile. (Often called \(\alpha\).)
It must be |
0.5
|
Returns:
| Name | Type | Description |
|---|---|---|
V |
ndarray of shape (n_obs)
|
Values of the identification function. |
Notes
The function \(V(y, z)\) for observation \(y=y_{pred}\) and prediction \(z=y_{pred}\) is a strict identification function for the functional \(T\), or induces the functional \(T\) as:
for some class of distributions \(\mathcal{F}\). Implemented examples of the functional \(T\) are mean, median, expectiles and quantiles.
| functional | strict identification function \(V(y, z)\) |
|---|---|
| mean | \(z - y\) |
| median | \(\mathbf{1}\{z \ge y\} - \frac{1}{2}\) |
| expectile | \(2 \mid\mathbf{1}\{z \ge y\} - \alpha\mid (z - y)\) |
| quantile | \(\mathbf{1}\{z \ge y\} - \alpha\) |
For level \(\alpha\).
References
[Gneiting2011]-
T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902
Examples:
>>> identification_function(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
array([-1, 1, 0, 1])
plot_reliability_diagram(y_obs, y_pred, weights=None, *, functional='mean', level=0.5, n_bootstrap=None, confidence_level=0.9, diagram_type='reliability', ax=None)
¶
Plot a reliability diagram.
A reliability diagram or calibration curve assesses auto-calibration. It plots the
conditional expectation given the predictions E(y_obs|y_pred) (y-axis) vs the
predictions y_pred (x-axis).
The conditional expectation is estimated via isotonic regression (PAV algorithm)
of y_obs on y_pred.
See Notes for further details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs) or (n_obs, n_models)
|
Predicted values, e.g. for the conditional expectation of the response,
|
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
functional
|
str
|
The functional that is induced by the identification function
|
'mean'
|
level
|
float
|
The level of the expectile or quantile. (Often called \(\alpha\).)
It must be |
0.5
|
n_bootstrap
|
int or None
|
If not |
None
|
confidence_level
|
float
|
Confidence level for bootstrap uncertainty regions. |
0.9
|
diagram_type
|
str
|
|
'reliability'
|
ax
|
matplotlib.axes.Axes or plotly Figure
|
Axes object to draw the plot onto, otherwise uses the current Axes. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
ax |
Either the matplotlib axes or the plotly figure. This is configurable by
setting the |
Notes
The expectation conditional on the predictions is \(E(Y|y_{pred})\). This object is estimated by the pool-adjacent violator (PAV) algorithm, which has very desirable properties:
- It is non-parametric without any tuning parameter. Thus, the results are
easily reproducible.
- Optimal selection of bins
- Statistical consistent estimator
For details, refer to [Dimitriadis2021].
References
[Dimitriadis2021]-
T. Dimitriadis, T. Gneiting, and A. I. Jordan. "Stable reliability diagrams for probabilistic classifiers". In: Proceedings of the National Academy of Sciences 118.8 (2021), e2016191118. doi:10.1073/pnas.2016191118.