Skip to content

calibration

Tools to assess model calibration.

compute_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges')

Compute generalised bias conditional on a feature.

This function computes and aggregates the generalised bias, i.e. the values of the canonical identification function, versus (grouped by) a feature. This is a good way to assess whether a model is conditionally calibrated or not. Well calibrated models have bias terms around zero. For the mean functional, the generalised bias is the negative residual y_pred - y_obs. See Notes for further details.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values, e.g. for the conditional expectation of the response, E(Y|X).

required
feature array-like of shape (n_obs) or None

Some feature column.

None
weights array-like of shape (n_obs) or None

Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. Note that the standard errors and p-values in the output are based on the assumption that the variance of the bias is inverse proportional to the weights. See the Notes section for details.

None
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile of quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
n_bins int

The number of bins, at least 2. For numerical features, n_bins only applies when bin_method is set to "quantile" or "uniform". For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into "other n" with n indicating the unique count.

I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.

10
bin_method str

The method for finding bin edges (boundaries). Options using n_bins are:

  • "quantile"
  • "uniform"

Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:

  • "auto" Minimum bin width between the "sturges" and "fd" estimators. Provides good all-around performance.
  • "fd" (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size.
  • "doane" An improved version of Sturges' estimator that works better with non-normal datasets.
  • "scott" Less robust estimator that takes into account data variability and data size.
  • "stone" Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule.
  • "rice" Estimator does not take variability into account, only data size. Commonly overestimates number of bins required.
  • "sturges" R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.
  • "sqrt" Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.
'sturges'

Returns:

Name Type Description
df DataFrame

The result table contains at least the columns:

  • bias_mean: Mean of the bias
  • bias_cout: Number of data rows
  • bias_weights: Sum of weights
  • bias_stderr: Standard error, i.e. standard deviation of bias_mean
  • p_value: p-value of the 2-sided t-test with null hypothesis: bias_mean = 0

If feautre is not None, then there is also the column:

  • feature_name: The actual name of the feature with the (binned) feature values.
Notes

A model \(m(X)\) is conditionally calibrated iff \(\mathbb{E}(V(m(X), Y)|X)=0\) almost surely with canonical identification function \(V\). The empirical version, given some data, reads \(\bar{V} = \frac{1}{n}\sum_i \phi(x_i) V(m(x_i), y_i)\) with a test function \(\phi(x_i)\) that projects on the specified feature. For a feature with only two distinct values "a" and "b", this becomes \(\bar{V} = \frac{1}{n_a}\sum_{i \text{ with }x_i=a} V(m(a), y_i)\) with \(n_a=\sum_{i \text{ with }x_i=a}\) and similar for "b". With case weights, this reads \(\bar{V} = \frac{1}{\sum_i w_i}\sum_i w_i \phi(x_i) V(m(x_i), y_i)\). This generalises the classical residual (up to a minus sign) for target functionals other than the mean. See [FLM2022].

The standard error for \(\bar{V}\) is calculated in the standard way as \(\mathrm{SE} = \sqrt{\operatorname{Var}(\bar{V})} = \frac{\sigma}{\sqrt{n}}\) and the standard variance estimator for \(\sigma^2 = \operatorname{Var}(\phi(x_i) V(m(x_i), y_i))\) with Bessel correction, i.e. division by \(n-1\) instead of \(n\).

With case weights, the variance estimator becomes \(\operatorname{Var}(\bar{V}) = \frac{1}{n-1} \frac{1}{\sum_i w_i} \sum_i w_i (V(m(x_i), y_i) - \bar{V})^2\) with the implied relation \(\operatorname{Var}(V(m(x_i), y_i)) \sim \frac{1}{w_i} \). If your weights are for repeated observations, so-called frequency weights, then the above estimate is conservative because it uses \(n - 1\) instead of \((\sum_i w_i) - 1\).

References
[FLM2022]

T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.

Examples:

>>> compute_bias(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
shape: (1, 5)
┌───────────┬────────────┬──────────────┬─────────────┬──────────┐
│ bias_mean ┆ bias_count ┆ bias_weights ┆ bias_stderr ┆ p_value  │
│ ---       ┆ ---        ┆ ---          ┆ ---         ┆ ---      │
│ f64       ┆ u32        ┆ f64          ┆ f64         ┆ f64      │
╞═══════════╪════════════╪══════════════╪═════════════╪══════════╡
│ 0.25      ┆ 4          ┆ 4.0          ┆ 0.478714    ┆ 0.637618 │
└───────────┴────────────┴──────────────┴─────────────┴──────────┘
>>> compute_bias(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2],
... feature=["a", "a", "b", "b"])
shape: (2, 6)
┌─────────┬───────────┬────────────┬──────────────┬─────────────┬─────────┐
│ feature ┆ bias_mean ┆ bias_count ┆ bias_weights ┆ bias_stderr ┆ p_value │
│ ---     ┆ ---       ┆ ---        ┆ ---          ┆ ---         ┆ ---     │
│ str     ┆ f64       ┆ u32        ┆ f64          ┆ f64         ┆ f64     │
╞═════════╪═══════════╪════════════╪══════════════╪═════════════╪═════════╡
│ a       ┆ 0.0       ┆ 2          ┆ 2.0          ┆ 1.0         ┆ 1.0     │
│ b       ┆ 0.5       ┆ 2          ┆ 2.0          ┆ 0.5         ┆ 0.5     │
└─────────┴───────────┴────────────┴──────────────┴─────────────┴─────────┘

plot_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges', confidence_level=0.9, ax=None)

Plot model bias conditional on a feature.

This plots the generalised bias (residuals), i.e. the values of the canonical identification function, versus a feature. This is a good way to assess whether a model is conditionally calibrated or not. Well calibrated models have bias terms around zero. See Notes for further details.

For numerical features, NaN are treated as Null values. Null values are always plotted as rightmost value on the x-axis and marked with a diamond instead of a dot.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values, e.g. for the conditional expectation of the response, E(Y|X).

required
feature array-like of shape (n_obs) or None

Some feature column.

None
weights array-like of shape (n_obs) or None

Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. Note that the standard errors and p-values in the output are based on the assumption that the variance of the bias is inverse proportional to the weights. See the Notes section for details.

None
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile or quantile. (Often called \(\alpha\).) It must be 0 <= level <= 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
n_bins int

The number of bins, at least 2. For numerical features, n_bins only applies when bin_method is set to "quantile" or "uniform". For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into "other n" with n indicating the unique count.

I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.

10
bin_method str

The method for finding bin edges (boundaries). Options using n_bins are:

  • "quantile"
  • "uniform"

Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:

  • "auto" Minimum bin width between the "sturges" and "fd" estimators. Provides good all-around performance.
  • "fd" (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size.
  • "doane" An improved version of Sturges' estimator that works better with non-normal datasets.
  • "scott" Less robust estimator that takes into account data variability and data size.
  • "stone" Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule.
  • "rice" Estimator does not take variability into account, only data size. Commonly overestimates number of bins required.
  • "sturges" R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.
  • "sqrt" Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.
'sturges'
confidence_level float

Confidence level for error bars. If 0, no error bars are plotted. Value must fulfil 0 <= confidence_level < 1.

0.9
ax matplotlib.axes.Axes or plotly Figure

Axes object to draw the plot onto, otherwise uses the current Axes.

None

Returns:

Name Type Description
ax

Either the matplotlib axes or the plotly figure. This is configurable by setting the plot_backend via model_diagnostics.set_config or model_diagnostics.config_context.

Notes

A model \(m(X)\) is conditionally calibrated iff \(E(V(m(X), Y))=0\) a.s. The empirical version, given some data, reads \(\frac{1}{n}\sum_i V(m(x_i), y_i)\). See [FLM2022].

References
FLM2022

T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.

compute_marginal(y_obs, y_pred, X=None, feature_name=None, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None)

Compute the marginal expectation conditional on a single feature.

This function computes the (weighted) average of observed response and predictions conditional on a given feature.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values, e.g. for the conditional expectation of the response, E(Y|X).

required
X array-like of shape (n_obs, n_features) or None

The dataframe or array of features to be passed to the model predict function.

None
feature_name (int, str or None)

Column name (str) or index (int) of feature in X. If None, the total marginal is computed.

None
predict_function callable or None

A callable to get prediction, i.e. predict_function(X). Used to compute partial dependence. If None, partial dependence is omitted.

None
weights array-like of shape (n_obs) or None

Case weights. If given, the bias is calculated as weighted average of the identification function with these weights.

None
n_bins int

The number of bins, at least 2. For numerical features, n_bins only applies when bin_method is set to "quantile" or "uniform". For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into "other n" with n indicating the unique count.

I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.

10
bin_method str

The method for finding bin edges (boundaries). Options using n_bins are:

  • "quantile"
  • "uniform"

Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:

  • "auto" Minimum bin width between the "sturges" and "fd" estimators. Provides good all-around performance.
  • "fd" (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size.
  • "doane" An improved version of Sturges' estimator that works better with non-normal datasets.
  • "scott" Less robust estimator that takes into account data variability and data size.
  • "stone" Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule.
  • "rice" Estimator does not take variability into account, only data size. Commonly overestimates number of bins required.
  • "sturges" R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.
  • "sqrt" Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.
'sturges'
n_max int or None

Used only for partial dependence computation. The number of rows to subsample from X. This speeds up computation, in particular for slow predict functions.

1000
rng (Generator, int or None)

Used only for partial dependence computation. The random number generator used for subsampling of n_max rows. The input is internally wrapped by np.random.default_rng(rng).

None

Returns:

Name Type Description
df DataFrame

The result table contains at least the columns:

  • y_obs_mean: Mean of y_obs
  • y_pred_mean: Mean of y_pred
  • y_obs_stderr: Standard error, i.e. standard deviation of y_obs_mean
  • y_pred_stderr: Standard error, i.e. standard deviation of y_pred_mean
  • count: Number of data rows
  • weights: Sum of weights

If feature is not None, then there is also the column:

  • feature_name: The actual name of the feature with the (binned) feature values.

If feature is numerical, one also has:

  • bin_edges: The edges and standard deviation of the bins, i.e. (min, std, max).
Notes

The marginal values are computed as an estimation of:

  • y_obs: \(\mathbb{E}(Y|feature)\)
  • y_pred: \(\mathbb{E}(m(X)|feature)\)

with \(feature\) the column specified by feature_name. Computationally that is more or less a group-by-aggregate operation on a dataset.

The standard error for both are calculated in the standard way as \(\mathrm{SE} = \sqrt{\operatorname{Var}(\bar{Y})} = \frac{\sigma}{\sqrt{n}}\) and the standard variance estimator for \(\sigma^2\) with Bessel correction, i.e. division by \(n-1\) instead of \(n\).

With case weights, the variance estimator becomes \(\operatorname{Var}(\bar{Y}) = \frac{1}{n-1} \frac{1}{\sum_i w_i} \sum_i w_i (y_i - \bar{y})^2\) with the implied relation \(\operatorname{Var}(y_i) \sim \frac{1}{w_i} \). If your weights are for repeated observations, so-called frequency weights, then the above estimate is conservative because it uses \(n - 1\) instead of \((\sum_i w_i) - 1\).

Examples:

>>> compute_marginal(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2])
shape: (1, 6)
┌────────────┬─────────────┬──────────────┬───────────────┬───────┬─────────┐
│ y_obs_mean ┆ y_pred_mean ┆ y_obs_stderr ┆ y_pred_stderr ┆ count ┆ weights │
│ ---        ┆ ---         ┆ ---          ┆ ---           ┆ ---   ┆ ---     │
│ f64        ┆ f64         ┆ f64          ┆ f64           ┆ u32   ┆ f64     │
╞════════════╪═════════════╪══════════════╪═══════════════╪═══════╪═════════╡
│ 0.5        ┆ 0.75        ┆ 0.288675     ┆ 0.629153      ┆ 4     ┆ 4.0     │
└────────────┴─────────────┴──────────────┴───────────────┴───────┴─────────┘
>>> import polars as pl
>>> from sklearn.linear_model import Ridge
>>> pl.Config.set_tbl_width_chars(84)
<class 'polars.config.Config'>
>>> y_obs, X =[0, 0, 1, 1], [[0, 1], [1, 1], [2, 2], [3, 2]]
>>> m = Ridge().fit(X, y_obs)
>>> compute_marginal(y_obs=y_obs, y_pred=m.predict(X), X=X, feature_name=0,
... predict_function=m.predict)
shape: (3, 9)
┌──────────┬─────────┬─────────┬─────────┬───┬───────┬─────────┬─────────┬─────────┐
│ feature  ┆ y_obs_m ┆ y_pred_ ┆ y_obs_s ┆ … ┆ count ┆ weights ┆ bin_edg ┆ partial │
│ 0        ┆ ean     ┆ mean    ┆ tderr   ┆   ┆ ---   ┆ ---     ┆ es      ┆ _depend │
│ ---      ┆ ---     ┆ ---     ┆ ---     ┆   ┆ u32   ┆ f64     ┆ ---     ┆ ence    │
│ f64      ┆ f64     ┆ f64     ┆ f64     ┆   ┆       ┆         ┆ array[f ┆ ---     │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 64, 3]  ┆ f64     │
╞══════════╪═════════╪═════════╪═════════╪═══╪═══════╪═════════╪═════════╪═════════╡
│ 0.5      ┆ 0.0     ┆ 0.125   ┆ 0.0     ┆ … ┆ 2     ┆ 2.0     ┆ [0.0,   ┆ 0.25    │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 0.5,    ┆         │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 1.0]    ┆         │
│ 2.0      ┆ 1.0     ┆ 0.75    ┆ 0.0     ┆ … ┆ 1     ┆ 1.0     ┆ [1.0,   ┆ 0.625   │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 0.0,    ┆         │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 2.0]    ┆         │
│ 3.0      ┆ 1.0     ┆ 1.0     ┆ 0.0     ┆ … ┆ 1     ┆ 1.0     ┆ [2.0,   ┆ 0.875   │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 0.0,    ┆         │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 3.0]    ┆         │
└──────────┴─────────┴─────────┴─────────┴───┴───────┴─────────┴─────────┴─────────┘

plot_marginal(y_obs, y_pred, X, feature_name, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None, ax=None, show_lines='numerical')

Plot marginal observed and predicted conditional on a feature.

This plot provides a means to inspect a model per feature. The average of observed and predicted are plotted as well as a histogram of the feature.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs)

Predicted values, e.g. for the conditional expectation of the response, E(Y|X).

required
X array-like of shape (n_obs, n_features)

The dataframe or array of features to be passed to the model predict function.

required
feature_name str or int

Column name (str) or index (int) of feature in X.

required
predict_function callable or None

A callable to get prediction, i.e. predict_function(X). Used to compute partial dependence. If None, partial dependence is omitted.

None
weights array-like of shape (n_obs) or None

Case weights. If given, the bias is calculated as weighted average of the identification function with these weights.

None
n_bins int

The number of bins, at least 2. For numerical features, n_bins only applies when bin_method is set to "quantile" or "uniform". For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into "other n" with n indicating the unique count.

I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.

10
bin_method str

The method for finding bin edges (boundaries). Options using n_bins are:

  • "quantile"
  • "uniform"

Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges:

  • "auto" Minimum bin width between the "sturges" and "fd" estimators. Provides good all-around performance.
  • "fd" (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size.
  • "doane" An improved version of Sturges' estimator that works better with non-normal datasets.
  • "scott" Less robust estimator that takes into account data variability and data size.
  • "stone" Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule.
  • "rice" Estimator does not take variability into account, only data size. Commonly overestimates number of bins required.
  • "sturges" R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.
  • "sqrt" Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.
'sturges'
n_max int or None

Used only for partial dependence computation. The number of rows to subsample from X. This speeds up computation, in particular for slow predict functions.

1000
rng (Generator, int or None)

Used only for partial dependence computation. The random number generator used for subsampling of n_max rows. The input is internally wrapped by np.random.default_rng(rng).

None
ax matplotlib.axes.Axes or plotly Figure

Axes object to draw the plot onto, otherwise uses the current Axes.

None
show_lines str

Option for how to display mean values and partial dependence:

  • "always": Always draw lines.
  • "numerical": String and categorical features are drawn as points, numerical ones as lines.
'numerical'

Returns:

Name Type Description
ax

Either the matplotlib axes or the plotly figure. This is configurable by setting the plot_backend via model_diagnostics.set_config or model_diagnostics.config_context.

Examples:

If you wish to plot multiple features at once with subfigures, here is how to do it with matplotlib:

```py from math import ceil import matplotlib.pyplot as plt import numpy as np from model_diagnostics.calibration import plot_marginal

Replace by your own data and model.

n_obs = 100 y_obs = np.arange(n_obs) X = np.ones((n_obs, 2)) X[:, 0] = np.sin(np.arange(n_obs)) X[:, 1] = y_obs ** 2

def model_predict(X): s = 0.5 * n_obs * np.sin(X) return s.sum(axis=1) + np.sqrt(X[:, 1])

Now the plotting.

feature_list = [0, 1] n_rows, n_cols = ceil(len(feature_list) / 2), 2 fig, axs = plt.subplots(nrows=n_rows, ncols=n_cols, sharey=True) for i, ax in enumerate(axs): plot_marginal( y_obs=y_obs, y_pred=model_predict(X), X=X, feature_name=feature_list[i], predict_function=model_predict, ax=ax, ) fig.tight_layout() ```

For plotly, use the helper function add_marginal_subplot:

```py from math import ceil import numpy as np from model_diagnostics import config_context from plotly.subplots import make_subplots from model_diagnostics.calibration import add_marginal_subplot, plot_marginal

Replace by your own data and model.

n_obs = 100 y_obs = np.arange(n_obs) X = np.ones((n_obs, 2)) X[:, 0] = np.sin(np.arange(n_obs)) X[:, 1] = y_obs ** 2

def model_predict(X): s = 0.5 * n_obs * np.sin(X) return s.sum(axis=1) + np.sqrt(X[:, 1])

Now the plotting.

feature_list = [0, 1] n_rows, n_cols = ceil(len(feature_list) / 2), 2 fig = make_subplots( rows=n_rows, cols=n_cols, vertical_spacing=0.3 / n_rows, # equals default # subplot_titles=feature_list, # maybe specs=[[{"secondary_y": True}] * n_cols] * n_rows, # This is important! ) for row in range(n_rows): for col in range(n_cols): i = n_cols * row + col with config_context(plot_backend="plotly"): subfig = plot_marginal( y_obs=y_obs, y_pred=model_predict(X), X=X, feature_name=feature_list[i], predict_function=model_predict, ) add_marginal_subplot(subfig, fig, row, col) fig.show() ```

add_marginal_subplot(subfig, fig, row, col)

Add a plotly subplot from plot_marginal to a multi-plot figure.

This auxiliary function is accompanies plot_marginal in order to ease plotting with subfigures with the plotly backend.

For it to work, you must call plotly's make_subplots with the specs argument and set the appropriate number of {"secondary_y": True} in a list of lists. ```py hl_lines="7" from plotly.subplots import make_subplots

n_rows, n_cols = ... fig = make_subplots( rows=n_rows, cols=n_cols, specs=[[{"secondary_y": True}] * n_cols] * n_rows, # This is important! ) `` The reason is thatplot_marginal` uses a secondary yaxis (and swapped sides with the primary yaxis).

Parameters:

Name Type Description Default
subfig plotly Figure

The subfigure which is added to fig.

required
fig plotly Figure

The multi-plot figure to which subfig is added at positions row and col.

required
row int

The (0-based) row index of fig at which subfig is added.

required
col int

The (0-based) column index of fig at which subfig is added.

required

Returns:

Type Description
fig

The plotly figure fig.

identification_function(y_obs, y_pred, *, functional='mean', level=0.5)

Canonical identification function.

Identification functions act as generalised residuals. See Notes for further details.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs)

Predicted values of the functional, e.g. the conditional expectation of the response, E(Y|X).

required
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile of quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5

Returns:

Name Type Description
V ndarray of shape (n_obs)

Values of the identification function.

Notes

The function \(V(y, z)\) for observation \(y=y_{pred}\) and prediction \(z=y_{pred}\) is a strict identification function for the functional \(T\), or induces the functional \(T\) as:

\[ \mathbb{E}[V(Y, z)] = 0\quad \Leftrightarrow\quad z\in T(F) \quad \forall \text{ distributions } F \in \mathcal{F} \]

for some class of distributions \(\mathcal{F}\). Implemented examples of the functional \(T\) are mean, median, expectiles and quantiles.

functional strict identification function \(V(y, z)\)
mean \(z - y\)
median \(\mathbf{1}\{z \ge y\} - \frac{1}{2}\)
expectile \(2 \mid\mathbf{1}\{z \ge y\} - \alpha\mid (z - y)\)
quantile \(\mathbf{1}\{z \ge y\} - \alpha\)

For level \(\alpha\).

References
[Gneiting2011]

T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902

Examples:

>>> identification_function(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
array([-1,  1,  0,  1])

plot_reliability_diagram(y_obs, y_pred, weights=None, *, functional='mean', level=0.5, n_bootstrap=None, confidence_level=0.9, diagram_type='reliability', ax=None)

Plot a reliability diagram.

A reliability diagram or calibration curve assesses auto-calibration. It plots the conditional expectation given the predictions E(y_obs|y_pred) (y-axis) vs the predictions y_pred (x-axis). The conditional expectation is estimated via isotonic regression (PAV algorithm) of y_obs on y_pred. See Notes for further details.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values, e.g. for the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile or quantile. (Often called \(\alpha\).) It must be 0 <= level <= 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
n_bootstrap int or None

If not None, then scipy.stats.bootstrap with n_resamples=n_bootstrap is used to calculate confidence intervals at level confidence_level.

None
confidence_level float

Confidence level for bootstrap uncertainty regions.

0.9
diagram_type str
  • "reliability": Plot a reliability diagram.
  • "bias": Plot roughly a 45 degree rotated reliability diagram. The resulting plot is similar to plot_bias, i.e. y_pred - E(y_obs|y_pred) vs y_pred.
'reliability'
ax matplotlib.axes.Axes or plotly Figure

Axes object to draw the plot onto, otherwise uses the current Axes.

None

Returns:

Name Type Description
ax

Either the matplotlib axes or the plotly figure. This is configurable by setting the plot_backend via model_diagnostics.set_config or model_diagnostics.config_context.

Notes

The expectation conditional on the predictions is \(E(Y|y_{pred})\). This object is estimated by the pool-adjacent violator (PAV) algorithm, which has very desirable properties:

- It is non-parametric without any tuning parameter. Thus, the results are
  easily reproducible.
- Optimal selection of bins
- Statistical consistent estimator

For details, refer to [Dimitriadis2021].

References
[Dimitriadis2021]

T. Dimitriadis, T. Gneiting, and A. I. Jordan. "Stable reliability diagrams for probabilistic classifiers". In: Proceedings of the National Academy of Sciences 118.8 (2021), e2016191118. doi:10.1073/pnas.2016191118.