`calibration`¶

Tools to assess model calibration.

`compute_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges')` ¶

Compute generalised bias conditional on a feature.

This function computes and aggregates the generalised bias, i.e. the values of the canonical identification function, versus (grouped by) a feature. This is a good way to assess whether a model is conditionally calibrated or not. Well calibrated models have bias terms around zero. For the mean functional, the generalised bias is the negative residual y_pred - y_obs. See Notes for further details.

Parameters:

Name	Type	Description	Default
`y_obs`	`array-like of shape (n_obs)`	Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].	required
`y_pred`	`array-like of shape (n_obs) or (n_obs, n_models)`	Predicted values, e.g. for the conditional expectation of the response, `E(Y\|X)`.	required
`feature`	`array-like of shape (n_obs) or None`	Some feature column.	`None`
`weights`	`array-like of shape (n_obs) or None`	Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. Note that the standard errors and p-values in the output are based on the assumption that the variance of the bias is inverse proportional to the weights. See the Notes section for details.	`None`
`functional`	`str`	The functional that is induced by the identification function `V`. Options are: `"mean"`. Argument `level` is neglected. `"median"`. Argument `level` is neglected. `"expectile"` `"quantile"`	`'mean'`
`level`	`float`	The level of the expectile of quantile. (Often called \(\alpha\).) It must be `0 < level < 1`. `level=0.5` and `functional="expectile"` gives the mean. `level=0.5` and `functional="quantile"` gives the median.	`0.5`
`n_bins`	`int`	The number of bins, at least 2. For numerical features, `n_bins` only applies when `bin_method` is set to `"quantile"` or `"uniform"`. For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into `"other n"` with `n` indicating the unique count. I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.	`10`
`bin_method`	`str`	The method for finding bin edges (boundaries). Options using `n_bins` are: `"quantile"` `"uniform"` Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges: `"auto"` Minimum bin width between the `"sturges"` and `"fd"` estimators. Provides good all-around performance. `"fd"` (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size. `"doane"` An improved version of Sturges' estimator that works better with non-normal datasets. `"scott"` Less robust estimator that takes into account data variability and data size. `"stone"` Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule. `"rice"` Estimator does not take variability into account, only data size. Commonly overestimates number of bins required. `"sturges"` R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets. `"sqrt"` Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.	`'sturges'`

Returns:

Name	Type	Description
`df`	`DataFrame`	The result table contains at least the columns: `bias_mean`: Mean of the bias `bias_cout`: Number of data rows `bias_weights`: Sum of weights `bias_stderr`: Standard error, i.e. standard deviation of `bias_mean` `p_value`: p-value of the 2-sided t-test with null hypothesis: `bias_mean = 0` If `feautre` is not None, then there is also the column: `feature_name`: The actual name of the feature with the (binned) feature values.

Notes

A model \(m(X)\) is conditionally calibrated iff \(\mathbb{E}(V(m(X), Y)|X)=0\) almost surely with canonical identification function \(V\). The empirical version, given some data, reads \(\bar{V} = \frac{1}{n}\sum_i \phi(x_i) V(m(x_i), y_i)\) with a test function \(\phi(x_i)\) that projects on the specified feature. For a feature with only two distinct values "a" and "b", this becomes \(\bar{V} = \frac{1}{n_a}\sum_{i \text{ with }x_i=a} V(m(a), y_i)\) with \(n_a=\sum_{i \text{ with }x_i=a}\) and similar for "b". With case weights, this reads \(\bar{V} = \frac{1}{\sum_i w_i}\sum_i w_i \phi(x_i) V(m(x_i), y_i)\). This generalises the classical residual (up to a minus sign) for target functionals other than the mean. See [FLM2022].

The standard error for \(\bar{V}\) is calculated in the standard way as \(\mathrm{SE} = \sqrt{\operatorname{Var}(\bar{V})} = \frac{\sigma}{\sqrt{n}}\) and the standard variance estimator for \(\sigma^2 = \operatorname{Var}(\phi(x_i) V(m(x_i), y_i))\) with Bessel correction, i.e. division by \(n-1\) instead of \(n\).

With case weights, the variance estimator becomes \(\operatorname{Var}(\bar{V}) = \frac{1}{n-1} \frac{1}{\sum_i w_i} \sum_i w_i (V(m(x_i), y_i) - \bar{V})^2\) with the implied relation \(\operatorname{Var}(V(m(x_i), y_i)) \sim \frac{1}{w_i} \). If your weights are for repeated observations, so-called frequency weights, then the above estimate is conservative because it uses \(n - 1\) instead of \((\sum_i w_i) - 1\).

References

[FLM2022]: T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.

Examples:

>>> compute_bias(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
shape: (1, 5)
┌───────────┬────────────┬──────────────┬─────────────┬──────────┐
│ bias_mean ┆ bias_count ┆ bias_weights ┆ bias_stderr ┆ p_value  │
│ ---       ┆ ---        ┆ ---          ┆ ---         ┆ ---      │
│ f64       ┆ u32        ┆ f64          ┆ f64         ┆ f64      │
╞═══════════╪════════════╪══════════════╪═════════════╪══════════╡
│ 0.25      ┆ 4          ┆ 4.0          ┆ 0.478714    ┆ 0.637618 │
└───────────┴────────────┴──────────────┴─────────────┴──────────┘
>>> compute_bias(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2],
... feature=["a", "a", "b", "b"])
shape: (2, 6)
┌─────────┬───────────┬────────────┬──────────────┬─────────────┬─────────┐
│ feature ┆ bias_mean ┆ bias_count ┆ bias_weights ┆ bias_stderr ┆ p_value │
│ ---     ┆ ---       ┆ ---        ┆ ---          ┆ ---         ┆ ---     │
│ str     ┆ f64       ┆ u32        ┆ f64          ┆ f64         ┆ f64     │
╞═════════╪═══════════╪════════════╪══════════════╪═════════════╪═════════╡
│ a       ┆ 0.0       ┆ 2          ┆ 2.0          ┆ 1.0         ┆ 1.0     │
│ b       ┆ 0.5       ┆ 2          ┆ 2.0          ┆ 0.5         ┆ 0.5     │
└─────────┴───────────┴────────────┴──────────────┴─────────────┴─────────┘

`plot_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges', confidence_level=0.9, ax=None)` ¶

Plot model bias conditional on a feature.

This plots the generalised bias (residuals), i.e. the values of the canonical identification function, versus a feature. This is a good way to assess whether a model is conditionally calibrated or not. Well calibrated models have bias terms around zero. See Notes for further details.

For numerical features, NaN are treated as Null values. Null values are always plotted as rightmost value on the x-axis and marked with a diamond instead of a dot.

Parameters:

Name	Type	Description	Default
`y_obs`	`array-like of shape (n_obs)`	Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].	required
`y_pred`	`array-like of shape (n_obs) or (n_obs, n_models)`	Predicted values, e.g. for the conditional expectation of the response, `E(Y\|X)`.	required
`feature`	`array-like of shape (n_obs) or None`	Some feature column.	`None`
`weights`	`array-like of shape (n_obs) or None`	Case weights. If given, the bias is calculated as weighted average of the identification function with these weights. Note that the standard errors and p-values in the output are based on the assumption that the variance of the bias is inverse proportional to the weights. See the Notes section for details.	`None`
`functional`	`str`	The functional that is induced by the identification function `V`. Options are: `"mean"`. Argument `level` is neglected. `"median"`. Argument `level` is neglected. `"expectile"` `"quantile"`	`'mean'`
`level`	`float`	The level of the expectile or quantile. (Often called \(\alpha\).) It must be `0 <= level <= 1`. `level=0.5` and `functional="expectile"` gives the mean. `level=0.5` and `functional="quantile"` gives the median.	`0.5`
`n_bins`	`int`	The number of bins, at least 2. For numerical features, `n_bins` only applies when `bin_method` is set to `"quantile"` or `"uniform"`. For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into `"other n"` with `n` indicating the unique count. I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.	`10`
`bin_method`	`str`	The method for finding bin edges (boundaries). Options using `n_bins` are: `"quantile"` `"uniform"` Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges: `"auto"` Minimum bin width between the `"sturges"` and `"fd"` estimators. Provides good all-around performance. `"fd"` (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size. `"doane"` An improved version of Sturges' estimator that works better with non-normal datasets. `"scott"` Less robust estimator that takes into account data variability and data size. `"stone"` Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule. `"rice"` Estimator does not take variability into account, only data size. Commonly overestimates number of bins required. `"sturges"` R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets. `"sqrt"` Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.	`'sturges'`
`confidence_level`	`float`	Confidence level for error bars. If 0, no error bars are plotted. Value must fulfil `0 <= confidence_level < 1`.	`0.9`
`ax`	`matplotlib.axes.Axes or plotly Figure`	Axes object to draw the plot onto, otherwise uses the current Axes.	`None`

Returns:

Name	Type	Description
`ax`		Either the matplotlib axes or the plotly figure. This is configurable by setting the `plot_backend` via `model_diagnostics.set_config` or `model_diagnostics.config_context`.

Notes

A model \(m(X)\) is conditionally calibrated iff \(E(V(m(X), Y))=0\) a.s. The empirical version, given some data, reads \(\frac{1}{n}\sum_i V(m(x_i), y_i)\). See [FLM2022].

References

FLM2022: T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.

`compute_marginal(y_obs, y_pred, X=None, feature_name=None, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None)` ¶

Compute the marginal expectation conditional on a single feature.

This function computes the (weighted) average of observed response and predictions conditional on a given feature.

Parameters:

Name	Type	Description	Default
`y_obs`	`array-like of shape (n_obs)`	Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].	required
`y_pred`	`array-like of shape (n_obs) or (n_obs, n_models)`	Predicted values, e.g. for the conditional expectation of the response, `E(Y\|X)`.	required
`X`	`array-like of shape (n_obs, n_features) or None`	The dataframe or array of features to be passed to the model predict function.	`None`
`feature_name`	`(int, str or None)`	Column name (str) or index (int) of feature in `X`. If None, the total marginal is computed.	`None`
`predict_function`	`callable or None`	A callable to get prediction, i.e. `predict_function(X)`. Used to compute partial dependence. If `None`, partial dependence is omitted.	`None`
`weights`	`array-like of shape (n_obs) or None`	Case weights. If given, the bias is calculated as weighted average of the identification function with these weights.	`None`
`n_bins`	`int`	The number of bins, at least 2. For numerical features, `n_bins` only applies when `bin_method` is set to `"quantile"` or `"uniform"`. For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into `"other n"` with `n` indicating the unique count. I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.	`10`
`bin_method`	`str`	The method for finding bin edges (boundaries). Options using `n_bins` are: `"quantile"` `"uniform"` Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges: `"auto"` Minimum bin width between the `"sturges"` and `"fd"` estimators. Provides good all-around performance. `"fd"` (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size. `"doane"` An improved version of Sturges' estimator that works better with non-normal datasets. `"scott"` Less robust estimator that takes into account data variability and data size. `"stone"` Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule. `"rice"` Estimator does not take variability into account, only data size. Commonly overestimates number of bins required. `"sturges"` R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets. `"sqrt"` Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.	`'sturges'`
`n_max`	`int or None`	Used only for partial dependence computation. The number of rows to subsample from X. This speeds up computation, in particular for slow predict functions.	`1000`
`rng`	`(Generator, int or None)`	Used only for partial dependence computation. The random number generator used for subsampling of `n_max` rows. The input is internally wrapped by `np.random.default_rng(rng)`.	`None`

Returns:

Name	Type	Description
`df`	`DataFrame`	The result table contains at least the columns: `y_obs_mean`: Mean of `y_obs` `y_pred_mean`: Mean of `y_pred` `y_obs_stderr`: Standard error, i.e. standard deviation of `y_obs_mean` `y_pred_stderr`: Standard error, i.e. standard deviation of `y_pred_mean` `count`: Number of data rows `weights`: Sum of weights If `feature` is not None, then there is also the column: `feature_name`: The actual name of the feature with the (binned) feature values. If `feature` is numerical, one also has: `bin_edges`: The edges and standard deviation of the bins, i.e. (min, std, max).

Notes

The marginal values are computed as an estimation of:

y_obs: \(\mathbb{E}(Y|feature)\)
y_pred: \(\mathbb{E}(m(X)|feature)\)

with \(feature\) the column specified by feature_name. Computationally that is more or less a group-by-aggregate operation on a dataset.

The standard error for both are calculated in the standard way as \(\mathrm{SE} = \sqrt{\operatorname{Var}(\bar{Y})} = \frac{\sigma}{\sqrt{n}}\) and the standard variance estimator for \(\sigma^2\) with Bessel correction, i.e. division by \(n-1\) instead of \(n\).

With case weights, the variance estimator becomes \(\operatorname{Var}(\bar{Y}) = \frac{1}{n-1} \frac{1}{\sum_i w_i} \sum_i w_i (y_i - \bar{y})^2\) with the implied relation \(\operatorname{Var}(y_i) \sim \frac{1}{w_i} \). If your weights are for repeated observations, so-called frequency weights, then the above estimate is conservative because it uses \(n - 1\) instead of \((\sum_i w_i) - 1\).

Examples:

>>> compute_marginal(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2])
shape: (1, 6)
┌────────────┬─────────────┬──────────────┬───────────────┬───────┬─────────┐
│ y_obs_mean ┆ y_pred_mean ┆ y_obs_stderr ┆ y_pred_stderr ┆ count ┆ weights │
│ ---        ┆ ---         ┆ ---          ┆ ---           ┆ ---   ┆ ---     │
│ f64        ┆ f64         ┆ f64          ┆ f64           ┆ u32   ┆ f64     │
╞════════════╪═════════════╪══════════════╪═══════════════╪═══════╪═════════╡
│ 0.5        ┆ 0.75        ┆ 0.288675     ┆ 0.629153      ┆ 4     ┆ 4.0     │
└────────────┴─────────────┴──────────────┴───────────────┴───────┴─────────┘
>>> import polars as pl
>>> from sklearn.linear_model import Ridge
>>> pl.Config.set_tbl_width_chars(84)
<class 'polars.config.Config'>
>>> y_obs, X =[0, 0, 1, 1], [[0, 1], [1, 1], [2, 2], [3, 2]]
>>> m = Ridge().fit(X, y_obs)
>>> compute_marginal(y_obs=y_obs, y_pred=m.predict(X), X=X, feature_name=0,
... predict_function=m.predict)
shape: (3, 9)
┌──────────┬─────────┬─────────┬─────────┬───┬───────┬─────────┬─────────┬─────────┐
│ feature  ┆ y_obs_m ┆ y_pred_ ┆ y_obs_s ┆ … ┆ count ┆ weights ┆ bin_edg ┆ partial │
│ 0        ┆ ean     ┆ mean    ┆ tderr   ┆   ┆ ---   ┆ ---     ┆ es      ┆ _depend │
│ ---      ┆ ---     ┆ ---     ┆ ---     ┆   ┆ u32   ┆ f64     ┆ ---     ┆ ence    │
│ f64      ┆ f64     ┆ f64     ┆ f64     ┆   ┆       ┆         ┆ array[f ┆ ---     │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 64, 3]  ┆ f64     │
╞══════════╪═════════╪═════════╪═════════╪═══╪═══════╪═════════╪═════════╪═════════╡
│ 0.5      ┆ 0.0     ┆ 0.125   ┆ 0.0     ┆ … ┆ 2     ┆ 2.0     ┆ [0.0,   ┆ 0.25    │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 0.5,    ┆         │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 1.0]    ┆         │
│ 2.0      ┆ 1.0     ┆ 0.75    ┆ 0.0     ┆ … ┆ 1     ┆ 1.0     ┆ [1.0,   ┆ 0.625   │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 0.0,    ┆         │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 2.0]    ┆         │
│ 3.0      ┆ 1.0     ┆ 1.0     ┆ 0.0     ┆ … ┆ 1     ┆ 1.0     ┆ [2.0,   ┆ 0.875   │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 0.0,    ┆         │
│          ┆         ┆         ┆         ┆   ┆       ┆         ┆ 3.0]    ┆         │
└──────────┴─────────┴─────────┴─────────┴───┴───────┴─────────┴─────────┴─────────┘

`plot_marginal(y_obs, y_pred, X, feature_name, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None, ax=None, show_lines='numerical')` ¶

Plot marginal observed and predicted conditional on a feature.

This plot provides a means to inspect a model per feature. The average of observed and predicted are plotted as well as a histogram of the feature.

Parameters:

Name	Type	Description	Default
`y_obs`	`array-like of shape (n_obs)`	Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].	required
`y_pred`	`array-like of shape (n_obs)`	Predicted values, e.g. for the conditional expectation of the response, `E(Y\|X)`.	required
`X`	`array-like of shape (n_obs, n_features)`	The dataframe or array of features to be passed to the model predict function.	required
`feature_name`	`str or int`	Column name (str) or index (int) of feature in `X`.	required
`predict_function`	`callable or None`	A callable to get prediction, i.e. `predict_function(X)`. Used to compute partial dependence. If `None`, partial dependence is omitted.	`None`
`weights`	`array-like of shape (n_obs) or None`	Case weights. If given, the bias is calculated as weighted average of the identification function with these weights.	`None`
`n_bins`	`int`	The number of bins, at least 2. For numerical features, `n_bins` only applies when `bin_method` is set to `"quantile"` or `"uniform"`. For string-like and categorical features, the most frequent values are taken. Ties are dealt with by taking the first value in natural sorting order. The remaining values are merged into `"other n"` with `n` indicating the unique count. I present, null values are always included in the output, accounting for one bin. NaN values are treated as null values.	`10`
`bin_method`	`str`	The method for finding bin edges (boundaries). Options using `n_bins` are: `"quantile"` `"uniform"` Options automatically selecting the number of bins for numerical features thereby using uniform bins are same options as numpy.histogram_bin_edges: `"auto"` Minimum bin width between the `"sturges"` and `"fd"` estimators. Provides good all-around performance. `"fd"` (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size. `"doane"` An improved version of Sturges' estimator that works better with non-normal datasets. `"scott"` Less robust estimator that takes into account data variability and data size. `"stone"` Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott's rule. `"rice"` Estimator does not take variability into account, only data size. Commonly overestimates number of bins required. `"sturges"` R's default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets. `"sqrt"` Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.	`'sturges'`
`n_max`	`int or None`	Used only for partial dependence computation. The number of rows to subsample from X. This speeds up computation, in particular for slow predict functions.	`1000`
`rng`	`(Generator, int or None)`	Used only for partial dependence computation. The random number generator used for subsampling of `n_max` rows. The input is internally wrapped by `np.random.default_rng(rng)`.	`None`
`ax`	`matplotlib.axes.Axes or plotly Figure`	Axes object to draw the plot onto, otherwise uses the current Axes.	`None`
`show_lines`	`str`	Option for how to display mean values and partial dependence: `"always"`: Always draw lines. `"numerical"`: String and categorical features are drawn as points, numerical ones as lines.	`'numerical'`

Returns:

Name	Type	Description
`ax`		Either the matplotlib axes or the plotly figure. This is configurable by setting the `plot_backend` via `model_diagnostics.set_config` or `model_diagnostics.config_context`.

Examples:

If you wish to plot multiple features at once with subfigures, here is how to do it with matplotlib:

```py from math import ceil import matplotlib.pyplot as plt import numpy as np from model_diagnostics.calibration import plot_marginal

Replace by your own data and model.¶

n_obs = 100 y_obs = np.arange(n_obs) X = np.ones((n_obs, 2)) X[:, 0] = np.sin(np.arange(n_obs)) X[:, 1] = y_obs ** 2

def model_predict(X): s = 0.5 * n_obs * np.sin(X) return s.sum(axis=1) + np.sqrt(X[:, 1])

Now the plotting.¶

feature_list = [0, 1] n_rows, n_cols = ceil(len(feature_list) / 2), 2 fig, axs = plt.subplots(nrows=n_rows, ncols=n_cols, sharey=True) for i, ax in enumerate(axs): plot_marginal( y_obs=y_obs, y_pred=model_predict(X), X=X, feature_name=feature_list[i], predict_function=model_predict, ax=ax, ) fig.tight_layout() ```

For plotly, use the helper function add_marginal_subplot:

```py from math import ceil import numpy as np from model_diagnostics import config_context from plotly.subplots import make_subplots from model_diagnostics.calibration import add_marginal_subplot, plot_marginal

Replace by your own data and model.¶

n_obs = 100 y_obs = np.arange(n_obs) X = np.ones((n_obs, 2)) X[:, 0] = np.sin(np.arange(n_obs)) X[:, 1] = y_obs ** 2

def model_predict(X): s = 0.5 * n_obs * np.sin(X) return s.sum(axis=1) + np.sqrt(X[:, 1])

Now the plotting.¶

feature_list = [0, 1] n_rows, n_cols = ceil(len(feature_list) / 2), 2 fig = make_subplots( rows=n_rows, cols=n_cols, vertical_spacing=0.3 / n_rows, # equals default # subplot_titles=feature_list, # maybe specs=[[{"secondary_y": True}] * n_cols] * n_rows, # This is important! ) for row in range(n_rows): for col in range(n_cols): i = n_cols * row + col with config_context(plot_backend="plotly"): subfig = plot_marginal( y_obs=y_obs, y_pred=model_predict(X), X=X, feature_name=feature_list[i], predict_function=model_predict, ) add_marginal_subplot(subfig, fig, row, col) fig.show() ```

`add_marginal_subplot(subfig, fig, row, col)` ¶

Add a plotly subplot from plot_marginal to a multi-plot figure.

This auxiliary function is accompanies plot_marginal in order to ease plotting with subfigures with the plotly backend.

For it to work, you must call plotly's make_subplots with the specs argument and set the appropriate number of {"secondary_y": True} in a list of lists. ```py hl_lines="7" from plotly.subplots import make_subplots

n_rows, n_cols = ... fig = make_subplots( rows=n_rows, cols=n_cols, specs=[[{"secondary_y": True}] * n_cols] * n_rows, # This is important! ) `` The reason is thatplot_marginal` uses a secondary yaxis (and swapped sides with the primary yaxis).

Parameters:

Name	Type	Description	Default
`subfig`	`plotly Figure`	The subfigure which is added to `fig`.	required
`fig`	`plotly Figure`	The multi-plot figure to which `subfig` is added at positions `row` and `col`.	required
`row`	`int`	The (0-based) row index of `fig` at which `subfig` is added.	required
`col`	`int`	The (0-based) column index of `fig` at which `subfig` is added.	required

Returns:

Type	Description
`fig`	The plotly figure `fig`.

`identification_function(y_obs, y_pred, *, functional='mean', level=0.5)` ¶

Canonical identification function.

Identification functions act as generalised residuals. See Notes for further details.

Parameters:

Name	Type	Description	Default
`y_obs`	`array-like of shape (n_obs)`	Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].	required
`y_pred`	`array-like of shape (n_obs)`	Predicted values of the `functional`, e.g. the conditional expectation of the response, `E(Y\|X)`.	required
`functional`	`str`	The functional that is induced by the identification function `V`. Options are: `"mean"`. Argument `level` is neglected. `"median"`. Argument `level` is neglected. `"expectile"` `"quantile"`	`'mean'`
`level`	`float`	The level of the expectile of quantile. (Often called \(\alpha\).) It must be `0 < level < 1`. `level=0.5` and `functional="expectile"` gives the mean. `level=0.5` and `functional="quantile"` gives the median.	`0.5`

Returns:

Name	Type	Description
`V`	`ndarray of shape (n_obs)`	Values of the identification function.

Notes

The function \(V(y, z)\) for observation \(y=y_{pred}\) and prediction \(z=y_{pred}\) is a strict identification function for the functional \(T\), or induces the functional \(T\) as:

\[ \mathbb{E}[V(Y, z)] = 0\quad \Leftrightarrow\quad z\in T(F) \quad \forall \text{ distributions } F \in \mathcal{F} \]

for some class of distributions \(\mathcal{F}\). Implemented examples of the functional \(T\) are mean, median, expectiles and quantiles.

functional	strict identification function \(V(y, z)\)
mean	\(z - y\)
median	\(\mathbf{1}\{z \ge y\} - \frac{1}{2}\)
expectile	\(2 \mid\mathbf{1}\{z \ge y\} - \alpha\mid (z - y)\)
quantile	\(\mathbf{1}\{z \ge y\} - \alpha\)

For level \(\alpha\).

References

[Gneiting2011]: T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902

Examples:

>>> identification_function(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
array([-1,  1,  0,  1])

`plot_reliability_diagram(y_obs, y_pred, weights=None, *, functional='mean', level=0.5, n_bootstrap=None, confidence_level=0.9, diagram_type='reliability', ax=None)` ¶

Plot a reliability diagram.

A reliability diagram or calibration curve assesses auto-calibration. It plots the conditional expectation given the predictions E(y_obs|y_pred) (y-axis) vs the predictions y_pred (x-axis). The conditional expectation is estimated via isotonic regression (PAV algorithm) of y_obs on y_pred. See Notes for further details.

Parameters:

Name	Type	Description	Default
`y_obs`	`array-like of shape (n_obs)`	Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].	required
`y_pred`	`array-like of shape (n_obs) or (n_obs, n_models)`	Predicted values, e.g. for the conditional expectation of the response, `E(Y\|X)`.	required
`weights`	`array-like of shape (n_obs) or None`	Case weights.	`None`
`functional`	`str`	The functional that is induced by the identification function `V`. Options are: `"mean"`. Argument `level` is neglected. `"median"`. Argument `level` is neglected. `"expectile"` `"quantile"`	`'mean'`
`level`	`float`	The level of the expectile or quantile. (Often called \(\alpha\).) It must be `0 <= level <= 1`. `level=0.5` and `functional="expectile"` gives the mean. `level=0.5` and `functional="quantile"` gives the median.	`0.5`
`n_bootstrap`	`int or None`	If not `None`, then `scipy.stats.bootstrap` with `n_resamples=n_bootstrap` is used to calculate confidence intervals at level `confidence_level`.	`None`
`confidence_level`	`float`	Confidence level for bootstrap uncertainty regions.	`0.9`
`diagram_type`	`str`	`"reliability"`: Plot a reliability diagram. `"bias"`: Plot roughly a 45 degree rotated reliability diagram. The resulting plot is similar to `plot_bias`, i.e. `y_pred - E(y_obs\|y_pred)` vs `y_pred`.	`'reliability'`
`ax`	`matplotlib.axes.Axes or plotly Figure`	Axes object to draw the plot onto, otherwise uses the current Axes.	`None`

Returns:

Name	Type	Description
`ax`		Either the matplotlib axes or the plotly figure. This is configurable by setting the `plot_backend` via `model_diagnostics.set_config` or `model_diagnostics.config_context`.

Notes

The expectation conditional on the predictions is \(E(Y|y_{pred})\). This object is estimated by the pool-adjacent violator (PAV) algorithm, which has very desirable properties:

- It is non-parametric without any tuning parameter. Thus, the results are
  easily reproducible.
- Optimal selection of bins
- Statistical consistent estimator

For details, refer to [Dimitriadis2021].

References

[Dimitriadis2021]: T. Dimitriadis, T. Gneiting, and A. I. Jordan. "Stable reliability diagrams for probabilistic classifiers". In: Proceedings of the National Academy of Sciences 118.8 (2021), e2016191118. doi:10.1073/pnas.2016191118.

calibration¶

compute_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges') ¶

plot_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges', confidence_level=0.9, ax=None) ¶

compute_marginal(y_obs, y_pred, X=None, feature_name=None, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None) ¶

plot_marginal(y_obs, y_pred, X, feature_name, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None, ax=None, show_lines='numerical') ¶

Replace by your own data and model.¶

Now the plotting.¶

Replace by your own data and model.¶

Now the plotting.¶

add_marginal_subplot(subfig, fig, row, col) ¶

identification_function(y_obs, y_pred, *, functional='mean', level=0.5) ¶

plot_reliability_diagram(y_obs, y_pred, weights=None, *, functional='mean', level=0.5, n_bootstrap=None, confidence_level=0.9, diagram_type='reliability', ax=None) ¶

`calibration`¶

`compute_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges')` ¶

`plot_bias(y_obs, y_pred, feature=None, weights=None, *, functional='mean', level=0.5, n_bins=10, bin_method='sturges', confidence_level=0.9, ax=None)` ¶

`compute_marginal(y_obs, y_pred, X=None, feature_name=None, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None)` ¶

`plot_marginal(y_obs, y_pred, X, feature_name, predict_function=None, weights=None, *, n_bins=10, bin_method='sturges', n_max=1000, rng=None, ax=None, show_lines='numerical')` ¶

`add_marginal_subplot(subfig, fig, row, col)` ¶

`identification_function(y_obs, y_pred, *, functional='mean', level=0.5)` ¶

`plot_reliability_diagram(y_obs, y_pred, weights=None, *, functional='mean', level=0.5, n_bootstrap=None, confidence_level=0.9, diagram_type='reliability', ax=None)` ¶