scoring¶
Tools to assess predictive model performance.
ElementaryScore
¶
Elementary scoring function.
The smaller the better.
The elementary scoring function is consistent for the specified functional for
all values of eta and is the main ingredient for Murphy diagrams.
See Notes for further details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eta
|
float
|
Free parameter. |
required |
functional
|
str
|
The functional that is induced by the identification function
|
'mean'
|
level
|
float
|
The level of the expectile of quantile. (Often called \(\alpha\).)
It must be |
0.5
|
Notes
The elementary scoring or loss function is given by
with identification functions
\(V\) for the given functional \(T\) . If allows for the mixture or Choquet
representation
for some locally finite measure \(H\). It follows that the scoring function \(S\) is consistent for \(T\).
References
[Jordan2022]-
A.I. Jordan, A. Mühlemann, J.F. Ziegel. "Characterizing the optimal solutions to the isotonic regression problem for identifiable functionals". (2022) doi:10.1007/s10463-021-00808-0
[GneitingResin2022]-
T. Gneiting, J. Resin. "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination". arxiv:2108.03210
Examples:
>>> el_score = ElementaryScore(eta=2, functional="mean")
>>> el_score(y_obs=[1, 2, 2, 1], y_pred=[4, 1, 2, 3])
np.float64(0.5)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
GammaDeviance
¶
Gamma deviance.
The smaller the better, minimum is zero.
The Gamma deviance is strictly consistent for the mean.
It has a degree of homogeneity of 0 and is therefore insensitive to a change of
units or multiplication of y_obs and y_pred by the same positive constant.
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"mean" |
Notes
\(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\)
Examples:
>>> gd = GammaDeviance()
>>> gd(y_obs=[3, 2, 1, 1], y_pred=[2, 1, 1 , 2])
np.float64(0.2972674459459178)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
HomogeneousExpectileScore
¶
Homogeneous scoring function of degree h for expectiles.
The smaller the better, minimum is zero.
Up to a multiplicative constant, these are the only scoring functions that are strictly consistent for expectiles at level alpha and homogeneous functions. The possible additive constant is chosen such that the minimal function value equals zero.
Note that the 1/2-expectile (level alpha=0.5) equals the mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
degree
|
float
|
Degree of homogeneity. |
2
|
level
|
float
|
The level of the expectile. (Often called \(\alpha\).)
It must be |
0.5
|
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"mean" if |
Notes
The homogeneous score of degree \(h\) is given by
Note that the first term, \(2 |\mathbf{1}\{z \ge y\} - \alpha|\) equals 1 for \(\alpha=0.5\). There are important domain restrictions and limits:
-
\(h>1\): All real numbers \(y\) and \(z\) are allowed.
Special case \(h=2, \alpha=\frac{1}{2}\) equals the squared error, aka Normal deviance \(S(y, z) = (y - z)^2\).
-
\(0 < h \leq 1\): Only \(y \geq 0\), \(z>0\) are allowed.
Special case \(h=1, \alpha=\frac{1}{2}\) (by taking the limit) equals the Poisson deviance \(S(y, z) = 2(y\log\frac{y}{z} - y + z)\).
-
\(h \leq 0\): Only \(y>0\), \(z>0\) are allowed.
Special case \(h=0, \alpha=\frac{1}{2}\) (by taking the limit) equals the Gamma deviance \(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\).
For the common domains, \(S_{\frac{1}{2}}^h\) equals the Tweedie deviance with the following relation between the degree of homogeneity \(h\) and the Tweedie power \(p\): \(h = 2-p\).
References
[Gneiting2011]-
T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902
Examples:
>>> hes = HomogeneousExpectileScore(degree=2, level=0.1)
>>> hes(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.95)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
HomogeneousQuantileScore
¶
Homogeneous scoring function of degree h for quantiles.
The smaller the better, minimum is zero.
Up to a multiplicative constant, these are the only scoring funtions that are strictly consistent for quantiles at level alpha and homogeneous functions. The possible additive constant is chosen such that the minimal function value equals zero.
Note that the 1/2-quantile (level alpha=0.5) equals the median.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
degree
|
float
|
Degree of homogeneity. |
2
|
level
|
float
|
The level of the quantile. (Often called \(\alpha\).)
It must be |
0.5
|
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"quantile" |
Notes
The homogeneous score of degree \(h\) is given by
There are important domain restrictions and limits:
-
\(h\) positive odd integer: All real numbers \(y\) and \(z\) are allowed.
- Special case \(h=1\) equals the pinball loss, \(S(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y)\).
- Special case \(h=1, \alpha=\frac{1}{2}\) equals half the absolute error \(S(y, z) = \frac{1}{2}|z - y|\).
-
\(h\) real valued: Only \(y>0\), \(z>0\) are allowed.
Special case \(h=0\) (by taking the limit) equals \(S(y, z) = |\mathbf{1}\{z \ge y\} - \alpha| \log\frac{z}{y}\).
References
[Gneiting2011]-
T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902
Examples:
>>> hqs = HomogeneousQuantileScore(degree=3, level=0.1)
>>> hqs(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.6083333333333334)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
LogLoss
¶
Log loss.
The smaller the better, minimum is zero.
The log loss is a strictly consistent scoring function for the mean for observations and predictions in the range 0 to 1. It is also referred to as (half the) Bernoulli deviance, (half the) Binomial log-likelihood, logistic loss and binary cross-entropy. Its minimal function value is zero.
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"mean" |
Notes
The log loss for \(y,z \in [0,1]\) is given by
If one restricts to \(y\in \{0, 1\}\), this simplifies to
Examples:
>>> ll = LogLoss()
>>> ll(y_obs=[0, 0.5, 1, 1], y_pred=[0.1, 0.2, 0.8 , 0.9], weights=[1, 2, 1, 1])
np.float64(0.17603033705165635)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
PinballLoss
¶
Pinball loss.
The smaller the better, minimum is zero.
The pinball loss is strictly consistent for quantiles.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
float
|
The level of the quantile. (Often called \(\alpha\).)
It must be |
0.5
|
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"quantile" |
Notes
The pinball loss has degree of homogeneity 1 and is given by
The authors do not know where and when the term pinball loss was coined. It is most famously used in quantile regression.
Examples:
>>> pl = PinballLoss(level=0.9)
>>> pl(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.275)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
PoissonDeviance
¶
Poisson deviance.
The smaller the better, minimum is zero.
The Poisson deviance is strictly consistent for the mean. It has a degree of homogeneity of 1.
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"mean" |
Notes
\(S(y, z) = 2(y\log\frac{y}{z} - y + z)\)
Examples:
>>> pd = PoissonDeviance()
>>> pd(y_obs=[0, 0, 1, 1], y_pred=[2, 1, 1 , 2])
np.float64(1.6534264097200273)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
SquaredError
¶
Squared error.
The smaller the better, minimum is zero.
The squared error is strictly consistent for the mean. It has a degree of homogeneity of 2. In the context of probabilistic classification, it is also known as Brier score.
Attributes:
| Name | Type | Description |
|---|---|---|
functional |
str
|
"mean" |
Notes
\(S(y, z) = (y - z)^2\)
Examples:
>>> se = SquaredError()
>>> se(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
np.float64(0.75)
__call__(y_obs, y_pred, weights=None)
¶
Mean or average score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score |
float
|
The average score. |
score_per_obs(y_obs, y_pred)
¶
Score per observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs)
|
Predicted values of the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
score_per_obs |
ndarray
|
Values of the scoring function for each observation. |
decompose(y_obs, y_pred, weights=None, *, scoring_function, functional=None, level=None)
¶
Additive decomposition of scores.
The score is decomposed as
score = miscalibration - discrimination + uncertainty.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. |
required |
y_pred
|
array-like of shape (n_obs) or (n_obs, n_models)
|
Predicted values of the |
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
scoring_function
|
callable
|
A scoring function with signature roughly
|
required |
functional
|
str or None
|
The target functional which
|
None
|
level
|
float or None
|
Functionals like expectiles and quantiles have a level (often called alpha).
If |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
decomposition |
DataFrame
|
The resulting score decomposition as a dataframe with columns:
|
If `y_pred` contains several predictions, i.e. it is 2-dimension with shape
|
|
|
`(n_obs, n_pred)` and `n_pred >1`, then there is the additional column:
|
|
Notes
To be precise, this function returns the decomposition of the score in terms of
auto-miscalibration, auto-discrimination (or resolution) and uncertainy (or
entropy), see [FLM2022] and references therein.
The key element is to estimate the recalibrated predictions, i.e. \(T(Y|m(X))\) for
the target functional \(T\) and model predictions \(m(X)\).
This is accomplished by isotonic regression, [Dimitriadis2021] and
[Gneiting2021].
References
[FLM2022]-
T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.
[Dimitriadis2021]-
T. Dimitriadis, T. Gneiting, and A. I. Jordan. "Stable reliability diagrams for probabilistic classifiers". (2021) doi:10.1073/pnas.2016191118
[Gneiting2021]-
T. Gneiting and J. Resin. "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination". (2021). arXiv:2108.03210.
Examples:
>>> decompose(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2],
... scoring_function=SquaredError())
shape: (1, 4)
┌────────────────┬────────────────┬─────────────┬───────┐
│ miscalibration ┆ discrimination ┆ uncertainty ┆ score │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪═════════════╪═══════╡
│ 0.625 ┆ 0.125 ┆ 0.25 ┆ 0.75 │
└────────────────┴────────────────┴─────────────┴───────┘
plot_murphy_diagram(y_obs, y_pred, weights=None, *, etas=100, functional='mean', level=0.5, ax=None)
¶
Plot a Murphy diagram.
A Murphy diagram plots the scores of elementary scoring functions ElementaryScore
over a range of their free parameter eta. This shows, if a model dominates all
others over a wide class of scoring functions or if the ranking is very much
dependent on the choice of scoring function.
See Notes for further details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_obs
|
array-like of shape (n_obs)
|
Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1]. |
required |
y_pred
|
array-like of shape (n_obs) or (n_obs, n_models)
|
Predicted values, e.g. for the conditional expectation of the response,
|
required |
weights
|
array-like of shape (n_obs) or None
|
Case weights. |
None
|
etas
|
int or array - like
|
If an integer is given, equidistant points between min and max y values are generater. If an array-like is given, those points are used. |
100
|
functional
|
str
|
The functional that is induced by the identification function
|
'mean'
|
level
|
float
|
The level of the expectile of quantile. (Often called \(\alpha\).)
It must be |
0.5
|
ax
|
Axes
|
Axes object to draw the plot onto, otherwise uses the current Axes. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
ax |
Either the matplotlib axes or the plotly figure. This is configurable by
setting the |
References
[Ehm2015]-
W. Ehm, T. Gneiting, A. Jordan, F. Krüger. "Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings". arxiv:1503.08195.