Skip to content

scoring

Tools to assess predictive model performance.

ElementaryScore

Elementary scoring function.

The smaller the better.

The elementary scoring function is consistent for the specified functional for all values of eta and is the main ingredient for Murphy diagrams. See Notes for further details.

Parameters:

Name Type Description Default
eta float

Free parameter.

required
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile of quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
Notes

The elementary scoring or loss function is given by

\[ S_\eta(y, z) = (\mathbf{1}\{\eta \le z\} - \mathbf{1}\{\eta \le y\}) V(y, \eta) \]

with identification functions \(V\) for the given functional \(T\) . If allows for the mixture or Choquet representation

\[ S(y, z) = \int S_\eta(y, z) \,dH(\eta) \]

for some locally finite measure \(H\). It follows that the scoring function \(S\) is consistent for \(T\).

References
[Jordan2022]

A.I. Jordan, A. Mühlemann, J.F. Ziegel. "Characterizing the optimal solutions to the isotonic regression problem for identifiable functionals". (2022) doi:10.1007/s10463-021-00808-0

[GneitingResin2022]

T. Gneiting, J. Resin. "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination". arxiv:2108.03210

Examples:

>>> el_score = ElementaryScore(eta=2, functional="mean")
>>> el_score(y_obs=[1, 2, 2, 1], y_pred=[4, 1, 2, 3])
np.float64(0.5)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

GammaDeviance

Gamma deviance.

The smaller the better, minimum is zero.

The Gamma deviance is strictly consistent for the mean. It has a degree of homogeneity of 0 and is therefore insensitive to a change of units or multiplication of y_obs and y_pred by the same positive constant.

Attributes:

Name Type Description
functional str

"mean"

Notes

\(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\)

Examples:

>>> gd = GammaDeviance()
>>> gd(y_obs=[3, 2, 1, 1], y_pred=[2, 1, 1 , 2])
np.float64(0.2972674459459178)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

HomogeneousExpectileScore

Homogeneous scoring function of degree h for expectiles.

The smaller the better, minimum is zero.

Up to a multiplicative constant, these are the only scoring functions that are strictly consistent for expectiles at level alpha and homogeneous functions. The possible additive constant is chosen such that the minimal function value equals zero.

Note that the 1/2-expectile (level alpha=0.5) equals the mean.

Parameters:

Name Type Description Default
degree float

Degree of homogeneity.

2
level float

The level of the expectile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 gives the mean.

0.5

Attributes:

Name Type Description
functional str

"mean" if level=0.5, else "expectile"

Notes

The homogeneous score of degree \(h\) is given by

\[ S_\alpha^h(y, z) = 2 |\mathbf{1}\{z \ge y\} - \alpha| \frac{2}{h(h-1)} \left(|y|^h - |z|^h - h \operatorname{sign}(z) |z|^{h-1} (y-z)\right) \]

Note that the first term, \(2 |\mathbf{1}\{z \ge y\} - \alpha|\) equals 1 for \(\alpha=0.5\). There are important domain restrictions and limits:

  • \(h>1\): All real numbers \(y\) and \(z\) are allowed.

    Special case \(h=2, \alpha=\frac{1}{2}\) equals the squared error, aka Normal deviance \(S(y, z) = (y - z)^2\).

  • \(0 < h \leq 1\): Only \(y \geq 0\), \(z>0\) are allowed.

    Special case \(h=1, \alpha=\frac{1}{2}\) (by taking the limit) equals the Poisson deviance \(S(y, z) = 2(y\log\frac{y}{z} - y + z)\).

  • \(h \leq 0\): Only \(y>0\), \(z>0\) are allowed.

    Special case \(h=0, \alpha=\frac{1}{2}\) (by taking the limit) equals the Gamma deviance \(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\).

For the common domains, \(S_{\frac{1}{2}}^h\) equals the Tweedie deviance with the following relation between the degree of homogeneity \(h\) and the Tweedie power \(p\): \(h = 2-p\).

References
[Gneiting2011]

T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902

Examples:

>>> hes = HomogeneousExpectileScore(degree=2, level=0.1)
>>> hes(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.95)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

HomogeneousQuantileScore

Homogeneous scoring function of degree h for quantiles.

The smaller the better, minimum is zero.

Up to a multiplicative constant, these are the only scoring funtions that are strictly consistent for quantiles at level alpha and homogeneous functions. The possible additive constant is chosen such that the minimal function value equals zero.

Note that the 1/2-quantile (level alpha=0.5) equals the median.

Parameters:

Name Type Description Default
degree float

Degree of homogeneity.

2
level float

The level of the quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 gives the median.

0.5

Attributes:

Name Type Description
functional str

"quantile"

Notes

The homogeneous score of degree \(h\) is given by

\[ S_\alpha^h(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) \frac{z^h - y^h}{h} \]

There are important domain restrictions and limits:

  • \(h\) positive odd integer: All real numbers \(y\) and \(z\) are allowed.

    • Special case \(h=1\) equals the pinball loss, \(S(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y)\).
    • Special case \(h=1, \alpha=\frac{1}{2}\) equals half the absolute error \(S(y, z) = \frac{1}{2}|z - y|\).
  • \(h\) real valued: Only \(y>0\), \(z>0\) are allowed.

    Special case \(h=0\) (by taking the limit) equals \(S(y, z) = |\mathbf{1}\{z \ge y\} - \alpha| \log\frac{z}{y}\).

References
[Gneiting2011]

T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902

Examples:

>>> hqs = HomogeneousQuantileScore(degree=3, level=0.1)
>>> hqs(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.6083333333333334)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

LogLoss

Log loss.

The smaller the better, minimum is zero.

The log loss is a strictly consistent scoring function for the mean for observations and predictions in the range 0 to 1. It is also referred to as (half the) Bernoulli deviance, (half the) Binomial log-likelihood, logistic loss and binary cross-entropy. Its minimal function value is zero.

Attributes:

Name Type Description
functional str

"mean"

Notes

The log loss for \(y,z \in [0,1]\) is given by

\[ S(y, z) = - y \log\frac{z}{y} - (1 - y) \log\frac{1-z}{1-y} \]

If one restricts to \(y\in \{0, 1\}\), this simplifies to

\[ S(y, z) = - y \log(z) - (1 - y) \log(1-z) \]

Examples:

>>> ll = LogLoss()
>>> ll(y_obs=[0, 0.5, 1, 1], y_pred=[0.1, 0.2, 0.8 , 0.9], weights=[1, 2, 1, 1])
np.float64(0.17603033705165635)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

PinballLoss

Pinball loss.

The smaller the better, minimum is zero.

The pinball loss is strictly consistent for quantiles.

Parameters:

Name Type Description Default
level float

The level of the quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 gives the median.

0.5

Attributes:

Name Type Description
functional str

"quantile"

Notes

The pinball loss has degree of homogeneity 1 and is given by

\[ S_\alpha(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y) \]

The authors do not know where and when the term pinball loss was coined. It is most famously used in quantile regression.

Examples:

>>> pl = PinballLoss(level=0.9)
>>> pl(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.275)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

PoissonDeviance

Poisson deviance.

The smaller the better, minimum is zero.

The Poisson deviance is strictly consistent for the mean. It has a degree of homogeneity of 1.

Attributes:

Name Type Description
functional str

"mean"

Notes

\(S(y, z) = 2(y\log\frac{y}{z} - y + z)\)

Examples:

>>> pd = PoissonDeviance()
>>> pd(y_obs=[0, 0, 1, 1], y_pred=[2, 1, 1 , 2])
np.float64(1.6534264097200273)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

SquaredError

Squared error.

The smaller the better, minimum is zero.

The squared error is strictly consistent for the mean. It has a degree of homogeneity of 2. In the context of probabilistic classification, it is also known as Brier score.

Attributes:

Name Type Description
functional str

"mean"

Notes

\(S(y, z) = (y - z)^2\)

Examples:

>>> se = SquaredError()
>>> se(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.75)

__call__(y_obs, y_pred, weights=None)

Mean or average score.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None

Returns:

Name Type Description
score float

The average score.

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

decompose(y_obs, y_pred, weights=None, *, scoring_function, functional=None, level=None)

Additive decomposition of scores.

The score is decomposed as score = miscalibration - discrimination + uncertainty.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None
scoring_function callable

A scoring function with signature roughly fun(y_obs, y_pred, weights) -> float.

required
functional str or None

The target functional which y_pred aims to predict. If None, then it will be inferred from scoring_function.functional. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
None
level float or None

Functionals like expectiles and quantiles have a level (often called alpha). If None, then it will be inferred from scoring_function.level.

None

Returns:

Name Type Description
decomposition DataFrame

The resulting score decomposition as a dataframe with columns:

  • miscalibration
  • discrimination
  • uncertainty
  • score: the average score
If `y_pred` contains several predictions, i.e. it is 2-dimension with shape
`(n_obs, n_pred)` and `n_pred >1`, then there is the additional column:
  • model
Notes

To be precise, this function returns the decomposition of the score in terms of auto-miscalibration, auto-discrimination (or resolution) and uncertainy (or entropy), see [FLM2022] and references therein. The key element is to estimate the recalibrated predictions, i.e. \(T(Y|m(X))\) for the target functional \(T\) and model predictions \(m(X)\). This is accomplished by isotonic regression, [Dimitriadis2021] and [Gneiting2021].

References
[FLM2022]

T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.

[Dimitriadis2021]

T. Dimitriadis, T. Gneiting, and A. I. Jordan. "Stable reliability diagrams for probabilistic classifiers". (2021) doi:10.1073/pnas.2016191118

[Gneiting2021]

T. Gneiting and J. Resin. "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination". (2021). arXiv:2108.03210.

Examples:

>>> decompose(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2],
... scoring_function=SquaredError())
shape: (1, 4)
┌────────────────┬────────────────┬─────────────┬───────┐
│ miscalibration ┆ discrimination ┆ uncertainty ┆ score │
│ ---            ┆ ---            ┆ ---         ┆ ---   │
│ f64            ┆ f64            ┆ f64         ┆ f64   │
╞════════════════╪════════════════╪═════════════╪═══════╡
│ 0.625          ┆ 0.125          ┆ 0.25        ┆ 0.75  │
└────────────────┴────────────────┴─────────────┴───────┘

plot_murphy_diagram(y_obs, y_pred, weights=None, *, etas=100, functional='mean', level=0.5, ax=None)

Plot a Murphy diagram.

A Murphy diagram plots the scores of elementary scoring functions ElementaryScore over a range of their free parameter eta. This shows, if a model dominates all others over a wide class of scoring functions or if the ranking is very much dependent on the choice of scoring function. See Notes for further details.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values, e.g. for the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None
etas int or array - like

If an integer is given, equidistant points between min and max y values are generater. If an array-like is given, those points are used.

100
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile of quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
ax Axes

Axes object to draw the plot onto, otherwise uses the current Axes.

None

Returns:

Name Type Description
ax

Either the matplotlib axes or the plotly figure. This is configurable by setting the plot_backend via model_diagnostics.set_config or model_diagnostics.config_context.

Notes

For details, refer to [Ehm2015].

References
[Ehm2015]

W. Ehm, T. Gneiting, A. Jordan, F. Krüger. "Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings". arxiv:1503.08195.