Skip to content

scoring

scoring

ElementaryScore

Elementary scoring function.

The smaller the better.

The elementary scoring function is consistent for the specified functional for all values of eta and is the main ingredient for Murphy diagrams. See Notes for further details.

Parameters:

Name Type Description Default
eta float

Free parameter.

required
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile of quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
Notes

The elementary scoring or loss function is given by

\[ S_\eta(y, z) = (\mathbf{1}\{\eta \le z\} - \mathbf{1}\{\eta \le y\}) V(y, \eta) \]

with identification functions \(V\) for the given functional \(T\) . If allows for the mixture or Choquet representation

\[ S(y, z) = \int S_\eta(y, z) \,dH(\eta) \]

for some locally finite measure \(H\). It follows that the scoring function \(S\) is consistent for \(T\).

References
[Jordan2022]

A.I. Jordan, A. Mühlemann, J.F. Ziegel. "Characterizing the optimal solutions to the isotonic regression problem for identifiable functionals". (2022) doi:10.1007/s10463-021-00808-0

[GneitingResin2022]

T. Gneiting, J. Resin. "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination". arxiv:2108.03210

Examples:

>>> el_score = ElementaryScore(eta=2, functional="mean")
>>> el_score(y_obs=[1, 2, 2, 1], y_pred=[4, 1, 2, 3])
0.5
Source code in src/model_diagnostics/scoring/scoring.py
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
class ElementaryScore(_BaseScoringFunction):
    r"""Elementary scoring function.

    The smaller the better.

    The elementary scoring function is consistent for the specified `functional` for
    all values of `eta` and is the main ingredient for Murphy diagrams.
    See [Notes](#notes) for further details.

    Parameters
    ----------
    eta : float
        Free parameter.
    functional : str
        The functional that is induced by the identification function `V`. Options are:

        - `"mean"`. Argument `level` is neglected.
        - `"median"`. Argument `level` is neglected.
        - `"expectile"`
        - `"quantile"`
    level : float
        The level of the expectile of quantile. (Often called \(\alpha\).)
        It must be `0 < level < 1`.
        `level=0.5` and `functional="expectile"` gives the mean.
        `level=0.5` and `functional="quantile"` gives the median.

    Notes
    -----
    The elementary scoring or loss function is given by

    \[
    S_\eta(y, z) = (\mathbf{1}\{\eta \le z\} - \mathbf{1}\{\eta \le y\})
    V(y, \eta)
    \]

    with [identification functions]
    [model_diagnostics.calibration.identification_function]
    \(V\) for the given `functional` \(T\) . If allows for the mixture or Choquet
    representation

    \[
    S(y, z) = \int S_\eta(y, z) \,dH(\eta)
    \]

    for some locally finite measure \(H\). It follows that the scoring function \(S\)
    is consistent for \(T\).

    References
    ----------
    `[Jordan2022]`

    :   A.I. Jordan, A. Mühlemann, J.F. Ziegel.
        "Characterizing the optimal solutions to the isotonic regression problem for
        identifiable functionals". (2022)
        [doi:10.1007/s10463-021-00808-0](https://doi.org/10.1007/s10463-021-00808-0)

    `[GneitingResin2022]`

    :   T. Gneiting, J. Resin.
        "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration,
        Reliability Diagrams, and Coefficient of Determination".
        [arxiv:2108.03210](https://arxiv.org/abs/2108.03210)

    Examples
    --------
    >>> el_score = ElementaryScore(eta=2, functional="mean")
    >>> el_score(y_obs=[1, 2, 2, 1], y_pred=[4, 1, 2, 3])
    0.5
    """

    def __init__(
        self, eta: float, functional: str = "mean", level: float = 0.5
    ) -> None:
        self.eta = eta
        self._functional = functional
        if level <= 0 or level >= 1:
            msg = f"Argument level must fulfil 0 < level < 1, got {level}."
            raise ValueError(msg)
        self.level = level

    @property
    def functional(self):
        return self._functional

    def score_per_obs(
        self,
        y_obs: npt.ArrayLike,
        y_pred: npt.ArrayLike,
    ) -> np.ndarray:
        """Score per observation.

        Parameters
        ----------
        y_obs : array-like of shape (n_obs)
            Observed values of the response variable.
        y_pred : array-like of shape (n_obs)
            Predicted values of the `functional` of interest, e.g. the conditional
            expectation of the response, `E(Y|X)`.

        Returns
        -------
        score_per_obs : ndarray
            Values of the scoring function for each observation.
        """
        y: np.ndarray
        z: np.ndarray
        y, z = validate_2_arrays(y_obs, y_pred)

        eta = self.eta
        eta_term = np.less_equal(eta, z).astype(float) - np.less_equal(eta, y)
        return eta_term * identification_function(
            y_obs=y_obs,
            y_pred=np.full(y.shape, fill_value=float(eta)),
            functional=self.functional,
            level=self.level,
        )

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

Source code in src/model_diagnostics/scoring/scoring.py
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
def score_per_obs(
    self,
    y_obs: npt.ArrayLike,
    y_pred: npt.ArrayLike,
) -> np.ndarray:
    """Score per observation.

    Parameters
    ----------
    y_obs : array-like of shape (n_obs)
        Observed values of the response variable.
    y_pred : array-like of shape (n_obs)
        Predicted values of the `functional` of interest, e.g. the conditional
        expectation of the response, `E(Y|X)`.

    Returns
    -------
    score_per_obs : ndarray
        Values of the scoring function for each observation.
    """
    y: np.ndarray
    z: np.ndarray
    y, z = validate_2_arrays(y_obs, y_pred)

    eta = self.eta
    eta_term = np.less_equal(eta, z).astype(float) - np.less_equal(eta, y)
    return eta_term * identification_function(
        y_obs=y_obs,
        y_pred=np.full(y.shape, fill_value=float(eta)),
        functional=self.functional,
        level=self.level,
    )

GammaDeviance

Gamma deviance.

The smaller the better, minimum is zero.

The Gamma deviance is strictly consistent for the mean. It has a degree of homogeneity of 0 and is therefore insensitive to a change of units or multiplication of y_obs and y_pred by the same positive constant.

Attributes:

Name Type Description
functional str

"mean"

Notes

\(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\)

Examples:

>>> gd = GammaDeviance()
>>> gd(y_obs=[3, 2, 1, 1], y_pred=[2, 1, 1 , 2])
0.2972674459459178
Source code in src/model_diagnostics/scoring/scoring.py
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
class GammaDeviance(HomogeneousExpectileScore):
    r"""Gamma deviance.

    The smaller the better, minimum is zero.

    The Gamma deviance is strictly consistent for the mean.
    It has a degree of homogeneity of 0 and is therefore insensitive to a change of
    units or multiplication of `y_obs` and `y_pred` by the same positive constant.

    Attributes
    ----------
    functional: str
        "mean"

    Notes
    -----
    \(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\)

    Examples
    --------
    >>> gd = GammaDeviance()
    >>> gd(y_obs=[3, 2, 1, 1], y_pred=[2, 1, 1 , 2])
    0.2972674459459178
    """

    def __init__(self) -> None:
        super().__init__(degree=0, level=0.5)

HomogeneousExpectileScore

Homogeneous scoring function of degree h for expectiles.

The smaller the better, minimum is zero.

Up to a multiplicative constant, these are the only scoring functions that are strictly consistent for expectiles at level alpha and homogeneous functions. The possible additive constant is chosen such that the minimal function value equals zero.

Note that the ½-expectile (level alpha=0.5) equals the mean.

Parameters:

Name Type Description Default
degree float

Degree of homogeneity.

2
level float

The level of the expectile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 gives the mean.

0.5

Attributes:

Name Type Description
functional str

"mean" if level=0.5, else "expectile"

Notes

The homogeneous score of degree \(h\) is given by

\[ S_\alpha^h(y, z) = 2 |\mathbf{1}\{z \ge y\} - \alpha| \frac{2}{h(h-1)} \left(|y|^h - |z|^h - h \operatorname{sign}(z) |z|^{h-1} (y-z)\right) \]

Note that the first term, \(2 |\mathbf{1}\{z \ge y\} - \alpha|\) equals 1 for \(\alpha=0.5\). There are important domain restrictions and limits:

  • \(h>1\): All real numbers \(y\) and \(z\) are allowed.

    Special case \(h=2, \alpha=\frac{1}{2}\) equals the squared error, aka Normal deviance \(S(y, z) = (y - z)^2\).

  • \(0 < h \leq 1\): Only \(y \geq 0\), \(z>0\) are allowed.

    Special case \(h=1, \alpha=\frac{1}{2}\) (by taking the limit) equals the Poisson deviance \(S(y, z) = 2(y\log\frac{y}{z} - y + z)\).

  • \(h \leq 0\): Only \(y>0\), \(z>0\) are allowed.

    Special case \(h=0, \alpha=\frac{1}{2}\) (by taking the limit) equals the Gamma deviance \(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\).

For the common domains, \(S_{\frac{1}{2}}^h\) equals the Tweedie deviance with the following relation between the degree of homogeneity \(h\) and the Tweedie power \(p\): \(h = 2-p\).

References
[Gneiting2011]

T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902

Examples:

>>> hes = HomogeneousExpectileScore(degree=2, level=0.1)
>>> hes(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
0.95
Source code in src/model_diagnostics/scoring/scoring.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
class HomogeneousExpectileScore(_BaseScoringFunction):
    r"""Homogeneous scoring function of degree h for expectiles.

    The smaller the better, minimum is zero.

    Up to a multiplicative constant, these are the only scoring functions that are
    strictly consistent for expectiles at level alpha and homogeneous functions.
    The possible additive constant is chosen such that the minimal function value
    equals zero.

    Note that the 1/2-expectile (level alpha=0.5) equals the mean.

    Parameters
    ----------
    degree : float
        Degree of homogeneity.
    level : float
        The level of the expectile. (Often called \(\alpha\).)
        It must be `0 < level < 1`.
        `level=0.5` gives the mean.

    Attributes
    ----------
    functional: str
        "mean" if `level=0.5`, else "expectile"

    Notes
    -----
    The homogeneous score of degree \(h\) is given by

    \[
    S_\alpha^h(y, z) = 2 |\mathbf{1}\{z \ge y\} - \alpha| \frac{2}{h(h-1)}
    \left(|y|^h - |z|^h - h \operatorname{sign}(z) |z|^{h-1} (y-z)\right)
    \]

    Note that the first term, \(2 |\mathbf{1}\{z \ge y\} - \alpha|\) equals 1 for
    \(\alpha=0.5\).
    There are important domain restrictions and limits:

    - \(h>1\): All real numbers \(y\) and \(z\) are allowed.

        Special case \(h=2, \alpha=\frac{1}{2}\) equals the squared error, aka Normal
        deviance \(S(y, z) = (y - z)^2\).

    - \(0 < h \leq 1\): Only \(y \geq 0\), \(z>0\) are allowed.

        Special case \(h=1, \alpha=\frac{1}{2}\) (by taking the limit) equals the
        Poisson deviance \(S(y, z) = 2(y\log\frac{y}{z} - y + z)\).

    - \(h \leq 0\): Only \(y>0\), \(z>0\) are allowed.

        Special case \(h=0, \alpha=\frac{1}{2}\) (by taking the limit) equals the Gamma
        deviance \(S(y, z) = 2(\frac{y}{z} -\log\frac{y}{z} - 1)\).

    For the common domains, \(S_{\frac{1}{2}}^h\) equals the
    [Tweedie deviance](https://en.wikipedia.org/wiki/Tweedie_distribution) with the
    following relation between the degree of homogeneity \(h\) and the Tweedie
    power \(p\): \(h = 2-p\).

    References
    ----------
    `[Gneiting2011]`

    :   T. Gneiting.
        "Making and Evaluating Point Forecasts". (2011)
        [doi:10.1198/jasa.2011.r10138](https://doi.org/10.1198/jasa.2011.r10138)
        [arxiv:0912.0902](https://arxiv.org/abs/0912.0902)

    Examples
    --------
    >>> hes = HomogeneousExpectileScore(degree=2, level=0.1)
    >>> hes(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
    0.95
    """

    def __init__(self, degree: float = 2, level: float = 0.5) -> None:
        self.degree = degree
        if level <= 0 or level >= 1:
            msg = f"Argument level must fulfil 0 < level < 1, got {level}."
            raise ValueError(msg)
        self.level = level

    @property
    def functional(self):
        if self.level == 0.5:
            return "mean"
        else:
            return "expectile"

    def score_per_obs(
        self,
        y_obs: npt.ArrayLike,
        y_pred: npt.ArrayLike,
    ) -> np.ndarray:
        """Score per observation.

        Parameters
        ----------
        y_obs : array-like of shape (n_obs)
            Observed values of the response variable.
        y_pred : array-like of shape (n_obs)
            Predicted values of the `functional` of interest, e.g. the conditional
            expectation of the response, `E(Y|X)`.

        Returns
        -------
        score_per_obs : ndarray
            Values of the scoring function for each observation.
        """
        y: np.ndarray
        z: np.ndarray
        y, z = validate_2_arrays(y_obs, y_pred)

        if self.degree == 2:
            # Fast path
            score = np.square(z - y)
        elif self.degree > 1:
            z_abs = np.abs(z)
            score = 2 * (
                (np.power(np.abs(y), self.degree) - np.power(z_abs, self.degree))
                / (self.degree * (self.degree - 1))
                - np.sign(z)
                / (self.degree - 1)
                * np.power(z_abs, self.degree - 1)
                * (y - z)
            )
        elif self.degree == 1:
            # Domain: y >= 0 and z > 0
            if not np.all((y >= 0) & (z > 0)):
                msg = (
                    f"Valid domain for degree={self.degree} is y_obs >= 0 and "
                    "y_pred > 0."
                )
                raise ValueError(msg)
            score = 2 * (special.xlogy(y, y / z) - y + z)
        elif self.degree == 0:
            # Domain: y > 0 and z > 0.
            if not np.all((y > 0) & (z > 0)):
                msg = (
                    f"Valid domain for degree={self.degree} is "
                    "y_obs > 0 and y_pred > 0."
                )
                raise ValueError(msg)
            y_z = y / z
            score = 2 * (y_z - np.log(y_z) - 1)
        else:  # self.degree < 1
            # Domain: y >= 0 and z > 0 for 0 < self.degree < 1
            # Domain: y > 0  and z > 0 else
            if self.degree > 0:
                if not np.all((y >= 0) & (z > 0)):
                    msg = (
                        f"Valid domain for degree={self.degree} is "
                        "y_obs >= 0 and y_pred > 0."
                    )
                    raise ValueError(msg)
            elif not np.all((y > 0) & (z > 0)):
                msg = (
                    f"Valid domain for degree={self.degree} is "
                    "y_obs > 0 and y_pred > 0."
                )
                raise ValueError(msg)
            # Note: We add 0.0 to be sure we have floating points. Integers are not
            # allowerd to be raised to a negative power.
            score = 2 * (
                (np.power(y + 0.0, self.degree) - np.power(z + 0.0, self.degree))
                / (self.degree * (self.degree - 1))
                - 1 / (self.degree - 1) * np.power(z + 0.0, self.degree - 1) * (y - z)
            )

        if self.level == 0.5:
            return score
        else:
            return 2 * np.abs(np.greater_equal(z, y) - self.level) * score

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

Source code in src/model_diagnostics/scoring/scoring.py
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
def score_per_obs(
    self,
    y_obs: npt.ArrayLike,
    y_pred: npt.ArrayLike,
) -> np.ndarray:
    """Score per observation.

    Parameters
    ----------
    y_obs : array-like of shape (n_obs)
        Observed values of the response variable.
    y_pred : array-like of shape (n_obs)
        Predicted values of the `functional` of interest, e.g. the conditional
        expectation of the response, `E(Y|X)`.

    Returns
    -------
    score_per_obs : ndarray
        Values of the scoring function for each observation.
    """
    y: np.ndarray
    z: np.ndarray
    y, z = validate_2_arrays(y_obs, y_pred)

    if self.degree == 2:
        # Fast path
        score = np.square(z - y)
    elif self.degree > 1:
        z_abs = np.abs(z)
        score = 2 * (
            (np.power(np.abs(y), self.degree) - np.power(z_abs, self.degree))
            / (self.degree * (self.degree - 1))
            - np.sign(z)
            / (self.degree - 1)
            * np.power(z_abs, self.degree - 1)
            * (y - z)
        )
    elif self.degree == 1:
        # Domain: y >= 0 and z > 0
        if not np.all((y >= 0) & (z > 0)):
            msg = (
                f"Valid domain for degree={self.degree} is y_obs >= 0 and "
                "y_pred > 0."
            )
            raise ValueError(msg)
        score = 2 * (special.xlogy(y, y / z) - y + z)
    elif self.degree == 0:
        # Domain: y > 0 and z > 0.
        if not np.all((y > 0) & (z > 0)):
            msg = (
                f"Valid domain for degree={self.degree} is "
                "y_obs > 0 and y_pred > 0."
            )
            raise ValueError(msg)
        y_z = y / z
        score = 2 * (y_z - np.log(y_z) - 1)
    else:  # self.degree < 1
        # Domain: y >= 0 and z > 0 for 0 < self.degree < 1
        # Domain: y > 0  and z > 0 else
        if self.degree > 0:
            if not np.all((y >= 0) & (z > 0)):
                msg = (
                    f"Valid domain for degree={self.degree} is "
                    "y_obs >= 0 and y_pred > 0."
                )
                raise ValueError(msg)
        elif not np.all((y > 0) & (z > 0)):
            msg = (
                f"Valid domain for degree={self.degree} is "
                "y_obs > 0 and y_pred > 0."
            )
            raise ValueError(msg)
        # Note: We add 0.0 to be sure we have floating points. Integers are not
        # allowerd to be raised to a negative power.
        score = 2 * (
            (np.power(y + 0.0, self.degree) - np.power(z + 0.0, self.degree))
            / (self.degree * (self.degree - 1))
            - 1 / (self.degree - 1) * np.power(z + 0.0, self.degree - 1) * (y - z)
        )

    if self.level == 0.5:
        return score
    else:
        return 2 * np.abs(np.greater_equal(z, y) - self.level) * score

HomogeneousQuantileScore

Homogeneous scoring function of degree h for quantiles.

The smaller the better, minimum is zero.

Up to a multiplicative constant, these are the only scoring funtions that are strictly consistent for quantiles at level alpha and homogeneous functions. The possible additive constant is chosen such that the minimal function value equals zero.

Note that the ½-quantile (level alpha=0.5) equals the median.

Parameters:

Name Type Description Default
degree float

Degree of homogeneity.

2
level float

The level of the quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 gives the median.

0.5

Attributes:

Name Type Description
functional str

"quantile"

Notes

The homogeneous score of degree \(h\) is given by

\[ S_\alpha^h(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) \frac{z^h - y^h}{h} \]

There are important domain restrictions and limits:

  • \(h\) positive odd integer: All real numbers \(y\) and \(z\) are allowed.

    • Special case \(h=1\) equals the pinball loss, \(S(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y)\).
    • Special case \(h=1, \alpha=\frac{1}{2}\) equals half the absolute error \(S(y, z) = \frac{1}{2}|z - y|\).
  • \(h\) real valued: Only \(y>0\), \(z>0\) are allowed.

    Special case \(h=0\) (by taking the limit) equals \(S(y, z) = |\mathbf{1}\{z \ge y\} - \alpha| \log\frac{z}{y}\).

References
[Gneiting2011]

T. Gneiting. "Making and Evaluating Point Forecasts". (2011) doi:10.1198/jasa.2011.r10138 arxiv:0912.0902

Examples:

>>> hqs = HomogeneousQuantileScore(degree=3, level=0.1)
>>> hqs(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
0.6083333333333334
Source code in src/model_diagnostics/scoring/scoring.py
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
class HomogeneousQuantileScore(_BaseScoringFunction):
    r"""Homogeneous scoring function of degree h for quantiles.

    The smaller the better, minimum is zero.

    Up to a multiplicative constant, these are the only scoring funtions that are
    strictly consistent for quantiles at level alpha and homogeneous functions.
    The possible additive constant is chosen such that the minimal function value
    equals zero.

    Note that the 1/2-quantile (level alpha=0.5) equals the median.

    Parameters
    ----------
    degree : float
        Degree of homogeneity.
    level : float
        The level of the quantile. (Often called \(\alpha\).)
        It must be `0 < level < 1`.
        `level=0.5` gives the median.

    Attributes
    ----------
    functional: str
        "quantile"

    Notes
    -----
    The homogeneous score of degree \(h\) is given by

    \[
    S_\alpha^h(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) \frac{z^h - y^h}{h}
    \]

    There are important domain restrictions and limits:

    - \(h\) positive odd integer: All real numbers \(y\) and \(z\) are allowed.

        - Special case \(h=1\) equals the pinball loss,
          \(S(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y)\).
        - Special case \(h=1, \alpha=\frac{1}{2}\) equals half the absolute error
          \(S(y, z) = \frac{1}{2}|z - y|\).

    - \(h\) real valued: Only \(y>0\), \(z>0\) are allowed.

        Special case \(h=0\) (by taking the limit) equals
        \(S(y, z) = |\mathbf{1}\{z \ge y\} - \alpha| \log\frac{z}{y}\).

    References
    ----------
    `[Gneiting2011]`

    :   T. Gneiting.
        "Making and Evaluating Point Forecasts". (2011)
        [doi:10.1198/jasa.2011.r10138](https://doi.org/10.1198/jasa.2011.r10138)
        [arxiv:0912.0902](https://arxiv.org/abs/0912.0902)

    Examples
    --------
    >>> hqs = HomogeneousQuantileScore(degree=3, level=0.1)
    >>> hqs(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
    0.6083333333333334
    """

    def __init__(self, degree: float = 2, level: float = 0.5) -> None:
        self.degree = degree
        if level <= 0 or level >= 1:
            msg = f"Argument level must fulfil 0 < level < 1, got {level}."
            raise ValueError(msg)
        self.level = level

    @property
    def functional(self):
        return "quantile"

    def score_per_obs(
        self,
        y_obs: npt.ArrayLike,
        y_pred: npt.ArrayLike,
    ) -> np.ndarray:
        """Score per observation.

        Parameters
        ----------
        y_obs : array-like of shape (n_obs)
            Observed values of the response variable.
        y_pred : array-like of shape (n_obs)
            Predicted values of the `functional` of interest, e.g. the conditional
            expectation of the response, `E(Y|X)`.

        Returns
        -------
        score_per_obs : ndarray
            Values of the scoring function for each observation.
        """
        y: np.ndarray
        z: np.ndarray
        y, z = validate_2_arrays(y_obs, y_pred)

        if self.degree == 1:
            # Fast path
            score = z - y
        elif self.degree > 1 and self.degree % 2 == 1:
            # Odd positive degree
            score = (np.power(z, self.degree) - np.power(y, self.degree)) / self.degree
        elif self.degree == 0:
            # Domain: y > 0 and z > 0.
            if not np.all((y > 0) & (z > 0)):
                msg = (
                    f"Valid domain for degree={self.degree} is "
                    "y_obs > 0 and y_pred > 0."
                )
                raise ValueError(msg)
            score = np.log(z / y)
        else:
            # Domain: y > 0 and z > 0.
            if not np.all((y > 0) & (z > 0)):
                msg = (
                    f"Valid domain for degree={self.degree} is "
                    "y_obs > 0 and y_pred > 0."
                )
                raise ValueError(msg)
            score = (np.power(z, self.degree) - np.power(y, self.degree)) / self.degree

        if self.level == 0.5:
            return 0.5 * np.abs(score)
        else:
            return (np.greater_equal(z, y) - self.level) * score

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

Source code in src/model_diagnostics/scoring/scoring.py
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
def score_per_obs(
    self,
    y_obs: npt.ArrayLike,
    y_pred: npt.ArrayLike,
) -> np.ndarray:
    """Score per observation.

    Parameters
    ----------
    y_obs : array-like of shape (n_obs)
        Observed values of the response variable.
    y_pred : array-like of shape (n_obs)
        Predicted values of the `functional` of interest, e.g. the conditional
        expectation of the response, `E(Y|X)`.

    Returns
    -------
    score_per_obs : ndarray
        Values of the scoring function for each observation.
    """
    y: np.ndarray
    z: np.ndarray
    y, z = validate_2_arrays(y_obs, y_pred)

    if self.degree == 1:
        # Fast path
        score = z - y
    elif self.degree > 1 and self.degree % 2 == 1:
        # Odd positive degree
        score = (np.power(z, self.degree) - np.power(y, self.degree)) / self.degree
    elif self.degree == 0:
        # Domain: y > 0 and z > 0.
        if not np.all((y > 0) & (z > 0)):
            msg = (
                f"Valid domain for degree={self.degree} is "
                "y_obs > 0 and y_pred > 0."
            )
            raise ValueError(msg)
        score = np.log(z / y)
    else:
        # Domain: y > 0 and z > 0.
        if not np.all((y > 0) & (z > 0)):
            msg = (
                f"Valid domain for degree={self.degree} is "
                "y_obs > 0 and y_pred > 0."
            )
            raise ValueError(msg)
        score = (np.power(z, self.degree) - np.power(y, self.degree)) / self.degree

    if self.level == 0.5:
        return 0.5 * np.abs(score)
    else:
        return (np.greater_equal(z, y) - self.level) * score

LogLoss

Log loss.

The smaller the better, minimum is zero.

The log loss is a strictly consistent scoring function for the mean for observations and predictions in the range 0 to 1. It is also referred to as (half the) Bernoulli deviance, (half the) Binomial log-likelihood, logistic loss and binary cross-entropy. Its minimal function value is zero.

Attributes:

Name Type Description
functional str

"mean"

Notes

The log loss for \(y,z \in [0,1]\) is given by

\[ S(y, z) = - y \log\frac{z}{y} - (1 - y) \log\frac{1-z}{1-y} \]

If one restricts to \(y\in \{0, 1\}\), this simplifies to

\[ S(y, z) = - y \log(z) - (1 - y) \log(1-z) \]

Examples:

>>> ll = LogLoss()
>>> ll(y_obs=[0, 0.5, 1, 1], y_pred=[0.1, 0.2, 0.8 , 0.9], weights=[1, 2, 1, 1])
0.17603033705165635
Source code in src/model_diagnostics/scoring/scoring.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
class LogLoss(_BaseScoringFunction):
    r"""Log loss.

    The smaller the better, minimum is zero.

    The log loss is a strictly consistent scoring function for the mean for
    observations and predictions in the range 0 to 1.
    It is also referred to as (half the) Bernoulli deviance,
    (half the) Binomial log-likelihood, logistic loss and binary cross-entropy.
    Its minimal function value is zero.

    Attributes
    ----------
    functional: str
        "mean"

    Notes
    -----
    The log loss for \(y,z \in [0,1]\) is given by

    \[
    S(y, z) = - y \log\frac{z}{y} - (1 - y) \log\frac{1-z}{1-y}
    \]

    If one restricts to \(y\in \{0, 1\}\), this simplifies to

    \[
    S(y, z) = - y \log(z) - (1 - y) \log(1-z)
    \]

    Examples
    --------
    >>> ll = LogLoss()
    >>> ll(y_obs=[0, 0.5, 1, 1], y_pred=[0.1, 0.2, 0.8 , 0.9], weights=[1, 2, 1, 1])
    0.17603033705165635
    """

    @property
    def functional(self):
        return "mean"

    def score_per_obs(
        self,
        y_obs: npt.ArrayLike,
        y_pred: npt.ArrayLike,
    ) -> np.ndarray:
        """Score per observation.

        Parameters
        ----------
        y_obs : array-like of shape (n_obs)
            Observed values of the response variable.
        y_pred : array-like of shape (n_obs)
            Predicted values of the `functional` of interest, e.g. the conditional
            expectation of the response, `E(Y|X)`.

        Returns
        -------
        score_per_obs : ndarray
            Values of the scoring function for each observation.
        """
        y: np.ndarray
        z: np.ndarray
        y, z = validate_2_arrays(y_obs, y_pred)

        score = -special.xlogy(y, z) - special.xlogy(1 - y, 1 - z)
        if np.any((y > 0) & (y < 1)):
            score += special.xlogy(y, y) + special.xlogy(1 - y, 1 - y)
        return score

score_per_obs(y_obs, y_pred)

Score per observation.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required

Returns:

Name Type Description
score_per_obs ndarray

Values of the scoring function for each observation.

Source code in src/model_diagnostics/scoring/scoring.py
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
def score_per_obs(
    self,
    y_obs: npt.ArrayLike,
    y_pred: npt.ArrayLike,
) -> np.ndarray:
    """Score per observation.

    Parameters
    ----------
    y_obs : array-like of shape (n_obs)
        Observed values of the response variable.
    y_pred : array-like of shape (n_obs)
        Predicted values of the `functional` of interest, e.g. the conditional
        expectation of the response, `E(Y|X)`.

    Returns
    -------
    score_per_obs : ndarray
        Values of the scoring function for each observation.
    """
    y: np.ndarray
    z: np.ndarray
    y, z = validate_2_arrays(y_obs, y_pred)

    score = -special.xlogy(y, z) - special.xlogy(1 - y, 1 - z)
    if np.any((y > 0) & (y < 1)):
        score += special.xlogy(y, y) + special.xlogy(1 - y, 1 - y)
    return score

PinballLoss

Pinball loss.

The smaller the better, minimum is zero.

The pinball loss is strictly consistent for quantiles.

Parameters:

Name Type Description Default
level float

The level of the quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 gives the median.

0.5

Attributes:

Name Type Description
functional str

"quantile"

Notes

The pinball loss has degree of homogeneity 1 and is given by

\[ S_\alpha(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y) \]

The authors do not know where and when the term pinball loss was coined. It is most famously used in quantile regression.

Examples:

>>> pl = PinballLoss(level=0.9)
>>> pl(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
0.275
Source code in src/model_diagnostics/scoring/scoring.py
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
class PinballLoss(HomogeneousQuantileScore):
    r"""Pinball loss.

    The smaller the better, minimum is zero.

    The pinball loss is strictly consistent for quantiles.

    Parameters
    ----------
    level : float
        The level of the quantile. (Often called \(\alpha\).)
        It must be `0 < level < 1`.
        `level=0.5` gives the median.

    Attributes
    ----------
    functional: str
        "quantile"

    Notes
    -----
    The pinball loss has degree of homogeneity 1 and is given by

    \[
    S_\alpha(y, z) = (\mathbf{1}\{z \ge y\} - \alpha) (z - y)
    \]

    The authors do not know where and when the term *pinball loss* was coined. It is
    most famously used in quantile regression.

    Examples
    --------
    >>> pl = PinballLoss(level=0.9)
    >>> pl(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
    0.275
    """

    def __init__(self, level: float = 0.5) -> None:
        super().__init__(degree=1, level=level)

PoissonDeviance

Poisson deviance.

The smaller the better, minimum is zero.

The Poisson deviance is strictly consistent for the mean. It has a degree of homogeneity of 1.

Attributes:

Name Type Description
functional str

"mean"

Notes

\(S(y, z) = 2(y\log\frac{y}{z} - y + z)\)

Examples:

>>> pd = PoissonDeviance()
>>> pd(y_obs=[0, 0, 1, 1], y_pred=[2, 1, 1 , 2])
1.6534264097200273
Source code in src/model_diagnostics/scoring/scoring.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
class PoissonDeviance(HomogeneousExpectileScore):
    r"""Poisson deviance.

    The smaller the better, minimum is zero.

    The Poisson deviance is strictly consistent for the mean.
    It has a degree of homogeneity of 1.

    Attributes
    ----------
    functional: str
        "mean"

    Notes
    -----
    \(S(y, z) = 2(y\log\frac{y}{z} - y + z)\)

    Examples
    --------
    >>> pd = PoissonDeviance()
    >>> pd(y_obs=[0, 0, 1, 1], y_pred=[2, 1, 1 , 2])
    1.6534264097200273
    """

    def __init__(self) -> None:
        super().__init__(degree=1, level=0.5)

SquaredError

Squared error.

The smaller the better, minimum is zero.

The squared error is strictly consistent for the mean. It has a degree of homogeneity of 2. In the context of probabilistic classification, it is also known as Brier score.

Attributes:

Name Type Description
functional str

"mean"

Notes

\(S(y, z) = (y - z)^2\)

Examples:

>>> se = SquaredError()
>>> se(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
0.75
Source code in src/model_diagnostics/scoring/scoring.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
class SquaredError(HomogeneousExpectileScore):
    r"""Squared error.

    The smaller the better, minimum is zero.

    The squared error is strictly consistent for the mean.
    It has a degree of homogeneity of 2.
    In the context of probabilistic classification, it is also known as Brier score.


    Attributes
    ----------
    functional: str
        "mean"

    Notes
    -----
    \(S(y, z) = (y - z)^2\)

    Examples
    --------
    >>> se = SquaredError()
    >>> se(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1 , 2])
    0.75
    """

    def __init__(self) -> None:
        super().__init__(degree=2, level=0.5)

decompose(y_obs, y_pred, weights=None, *, scoring_function, functional=None, level=None)

Additive decomposition of scores.

The score is decomposed as score = miscalibration - discrimination + uncertainty.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable.

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values of the functional of interest, e.g. the conditional expectation of the response, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None
scoring_function callable

A scoring function with signature roughly fun(y_obs, y_pred, weights) -> float.

required
functional str or None

The target functional which y_pred aims to predict. If None, then it will be inferred from scoring_function.functional. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
None
level float or None

Functionals like expectiles and quantiles have a level (often called alpha). If None, then it will be inferred from scoring_function.level.

None

Returns:

Name Type Description
decomposition DataFrame

The resulting score decomposition as a dataframe with columns:

  • miscalibration
  • discrimination
  • uncertainty
  • score: the average score
If `y_pred` contains several predictions, i.e. it is 2-dimension with shape
`(n_obs, n_pred)` and `n_pred >1`, then there is the additional column:
  • model
Notes

To be precise, this function returns the decomposition of the score in terms of auto-miscalibration, auto-discrimination (or resolution) and uncertainy (or entropy), see [FLM2022] and references therein. The key element is to estimate the recalibrated predictions, i.e. \(T(Y|m(X))\) for the target functional \(T\) and model predictions \(m(X)\). This is accomplished by isotonic regression, [Dimitriadis2021] and [Gneiting2021].

References
[FLM2022]

T. Fissler, C. Lorentzen, and M. Mayer. "Model Comparison and Calibration Assessment". (2022) arxiv:2202.12780.

[Dimitriadis2021]

T. Dimitriadis, T. Gneiting, and A. I. Jordan. "Stable reliability diagrams for probabilistic classifiers". (2021) doi:10.1073/pnas.2016191118

[Gneiting2021]

T. Gneiting and J. Resin. "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination". (2021). arXiv:2108.03210.

Examples:

>>> decompose(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2],
... scoring_function=SquaredError())
shape: (1, 4)
┌────────────────┬────────────────┬─────────────┬───────┐
│ miscalibration ┆ discrimination ┆ uncertainty ┆ score │
│ ---            ┆ ---            ┆ ---         ┆ ---   │
│ f64            ┆ f64            ┆ f64         ┆ f64   │
╞════════════════╪════════════════╪═════════════╪═══════╡
│ 0.625          ┆ 0.125          ┆ 0.25        ┆ 0.75  │
└────────────────┴────────────────┴─────────────┴───────┘
Source code in src/model_diagnostics/scoring/scoring.py
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
def decompose(
    y_obs: npt.ArrayLike,
    y_pred: npt.ArrayLike,
    weights: Optional[npt.ArrayLike] = None,
    *,
    scoring_function: Callable[..., Any],  # TODO: make type hint stricter
    functional: Optional[str] = None,
    level: Optional[float] = None,
) -> pl.DataFrame:
    r"""Additive decomposition of scores.

    The score is decomposed as
    `score = miscalibration - discrimination + uncertainty`.

    Parameters
    ----------
    y_obs : array-like of shape (n_obs)
        Observed values of the response variable.
    y_pred : array-like of shape (n_obs) or (n_obs, n_models)
        Predicted values of the `functional` of interest, e.g. the conditional
        expectation of the response, `E(Y|X)`.
    weights : array-like of shape (n_obs) or None
        Case weights.
    scoring_function : callable
        A scoring function with signature roughly
        `fun(y_obs, y_pred, weights) -> float`.
    functional : str or None
        The target functional which `y_pred` aims to predict.
        If `None`, then it will be inferred from `scoring_function.functional`.
        Options are:

        - `"mean"`. Argument `level` is neglected.
        - `"median"`. Argument `level` is neglected.
        - `"expectile"`
        - `"quantile"`
    level : float or None
        Functionals like expectiles and quantiles have a level (often called alpha).
        If `None`, then it will be inferred from `scoring_function.level`.

    Returns
    -------
    decomposition : polars.DataFrame
        The resulting score decomposition as a dataframe with columns:

        - `miscalibration`
        - `discrimination`
        - `uncertainty`
        - `score`: the average score

    If `y_pred` contains several predictions, i.e. it is 2-dimension with shape
    `(n_obs, n_pred)` and `n_pred >1`, then there is the additional column:

        - `model`

    Notes
    -----
    To be precise, this function returns the decomposition of the score in terms of
    auto-miscalibration, auto-discrimination (or resolution) and uncertainy (or
    entropy), see `[FLM2022]` and references therein.
    The key element is to estimate the recalibrated predictions, i.e. \(T(Y|m(X))\) for
    the target functional \(T\) and model predictions \(m(X)\).
    This is accomplished by isotonic regression, `[Dimitriadis2021]` and
    `[Gneiting2021]`.

    References
    ----------
    `[FLM2022]`

    :   T. Fissler, C. Lorentzen, and M. Mayer.
        "Model Comparison and Calibration Assessment". (2022)
        [arxiv:2202.12780](https://arxiv.org/abs/2202.12780).

    `[Dimitriadis2021]`

    :   T. Dimitriadis, T. Gneiting, and A. I. Jordan.
        "Stable reliability diagrams for probabilistic classifiers". (2021)
        [doi:10.1073/pnas.2016191118](https://doi.org/10.1073/pnas.2016191118)

    `[Gneiting2021]`

    :   T. Gneiting and J. Resin.
        "Regression Diagnostics meets Forecast Evaluation: Conditional Calibration,
        Reliability Diagrams, and Coefficient of Determination". (2021).
        [arXiv:2108.03210](https://arxiv.org/abs/2108.03210).

    Examples
    --------
    >>> decompose(y_obs=[0, 0, 1, 1], y_pred=[-1, 1, 1, 2],
    ... scoring_function=SquaredError())
    shape: (1, 4)
    ┌────────────────┬────────────────┬─────────────┬───────┐
    │ miscalibration ┆ discrimination ┆ uncertainty ┆ score │
    │ ---            ┆ ---            ┆ ---         ┆ ---   │
    │ f64            ┆ f64            ┆ f64         ┆ f64   │
    ╞════════════════╪════════════════╪═════════════╪═══════╡
    │ 0.625          ┆ 0.125          ┆ 0.25        ┆ 0.75  │
    └────────────────┴────────────────┴─────────────┴───────┘
    """
    if functional is None:
        if hasattr(scoring_function, "functional"):
            functional = scoring_function.functional
        else:
            msg = (
                "You set functional=None, but scoring_function has no attribute "
                "functional."
            )
            raise ValueError(msg)
    if level is None:
        level = 0.5
        if functional in ("expectile", "quantile"):
            if hasattr(scoring_function, "level"):
                level = float(scoring_function.level)
            else:
                msg = (
                    "You set level=None, but scoring_function has no attribute "
                    "level."
                )
                raise ValueError(msg)

    allowed_functionals = ("mean", "median", "expectile", "quantile")
    if functional not in allowed_functionals:
        msg = (
            f"The functional must be one of {allowed_functionals}, got "
            f"{functional}."
        )
        raise ValueError(msg)
    if functional in ("expectile", "quantile") and (level <= 0 or level >= 1):
        msg = f"The level must fulfil 0 < level < 1, got {level}."
        raise ValueError(msg)

    validate_same_first_dimension(y_obs, y_pred)
    n_pred = length_of_second_dimension(y_pred)
    pred_names, _ = get_sorted_array_names(y_pred)
    y_o = np.asarray(y_obs)

    if weights is None:
        w = None
    else:
        validate_same_first_dimension(weights, y_o)
        w = np.asarray(weights)  # needed to satisfy mypy
        if w.ndim > 1:
            msg = f"The array weights must be 1-dimensional, got weights.ndim={w.ndim}."
            raise ValueError(msg)

    if functional == "mean":
        iso = IsotonicRegression_skl(y_min=None, y_max=None)
        marginal = np.average(y_o, weights=w)
    else:
        iso = IsotonicRegression(functional=functional, level=level)
        if functional == "expectile":
            marginal = expectile(y_o, alpha=level, weights=w)
        elif functional == "quantile":
            marginal = 0.5 * (
                quantile_lower(y_o, level=level) + quantile_upper(y_o, level=level)
            )

    if y_o[0] == marginal == y_o[-1]:
        # y_o is constant. We need to check if y_o is allowed as argument to y_pred.
        # For instance for the poisson deviance, y_o = 0 is allowed. But 0 is forbidden
        # as a prediction.
        try:
            scoring_function(y_o[0], marginal)
        except ValueError as exc:
            msg = (
                "Your y_obs is constant and lies outside the allowed range of y_pred "
                "of your scoring function. Therefore, the score decomposition cannot "
                "be applied."
            )
            raise ValueError(msg) from exc

    # The recalibrated versions, further down, could contain min(y_obs) and that could
    # be outside of the valid domain, e.g. y_pred = 0 for the Poisson deviance where
    # y_obs=0 is allowed. We detect that here:
    y_min = np.amin(y_o)
    y_min_allowed = True
    try:
        scoring_function(y_o[:1], np.array([y_min]), None if w is None else w[:1])
    except ValueError:
        y_min_allowed = False

    marginal = np.full_like(y_o, fill_value=marginal, dtype=float)
    score_marginal = scoring_function(y_o, marginal, w)

    df_list = []
    for i in range(len(pred_names)):
        # Loop over columns of y_pred.
        x = y_pred if n_pred == 0 else get_second_dimension(y_pred, i)
        iso.fit(x, y_o, sample_weight=w)
        recalibrated = np.squeeze(iso.predict(x))
        if not y_min_allowed and recalibrated[0] <= y_min:
            # Oh dear, this needs quite some extra work:
            # First index of value greater than y_min
            idx1 = np.argmax(recalibrated > y_min)
            val1 = recalibrated[idx1]
            # First index of value greater than the value at idx1.
            idx2 = np.argmax(recalibrated > val1)
            # Note that val1 may already be the largest value of the array => idx2 = 0.
            if idx2 == 0:
                idx2 = recalibrated.shape[0]
            # We merge the first 2 blocks of the isotonic regression as it violates
            # our domain requirements.
            re2 = recalibrated[:idx2]
            w2 = None if w is None else w[:idx2]
            if functional == "mean":
                recalibrated[:idx2] = np.average(re2, weights=w2)
            elif functional == "expectile":
                recalibrated[:idx2] = expectile(re2, alpha=level, weights=w2)
            elif functional == "quantile":
                # Note, no scoring function known that could end up here.
                lower = quantile_lower(re2, level=level)
                upper = quantile_upper(re2, level=level)
                recalibrated[:idx2] = 0.5 * (lower + upper)

        score = scoring_function(y_o, x, w)
        try:
            score_recalibrated = scoring_function(y_o, recalibrated, w)
        except ValueError as exc:
            msg = (
                "The recalibrated predictions obtained from isotonic regression are "
                "very likely outside the allowed range of y_pred of your scoring "
                "function. Therefore, the score decomposition cannot be applied."
            )
            raise ValueError(msg) from exc

        df = pl.DataFrame(
            {
                "model": pred_names[i],
                "miscalibration": score - score_recalibrated,
                "discrimination": score_marginal - score_recalibrated,
                "uncertainty": score_marginal,
                "score": score,
            }
        )
        df_list.append(df)

    df = pl.concat(df_list)

    # Remove column "model" for a single model.
    if n_pred <= 1:
        df = df.drop("model")

    return df

plot_murphy_diagram(y_obs, y_pred, weights=None, *, etas=100, functional='mean', level=0.5, ax=None)

Plot a Murphy diagram.

A Murphy diagram plots the scores of elementary scoring functions ElementaryScore over a range of their free parameter eta. This shows, if a model dominates all others over a wide class of scoring functions or if the ranking is very much dependent on the choice of scoring function. See Notes for further details.

Parameters:

Name Type Description Default
y_obs array-like of shape (n_obs)

Observed values of the response variable. For binary classification, y_obs is expected to be in the interval [0, 1].

required
y_pred array-like of shape (n_obs) or (n_obs, n_models)

Predicted values of the conditional expectation of Y, E(Y|X).

required
weights array-like of shape (n_obs) or None

Case weights.

None
etas int or array - like

If an integer is given, equidistant points between min and max y values are generater. If an array-like is given, those points are used.

100
functional str

The functional that is induced by the identification function V. Options are:

  • "mean". Argument level is neglected.
  • "median". Argument level is neglected.
  • "expectile"
  • "quantile"
'mean'
level float

The level of the expectile of quantile. (Often called \(\alpha\).) It must be 0 < level < 1. level=0.5 and functional="expectile" gives the mean. level=0.5 and functional="quantile" gives the median.

0.5
ax Axes

Axes object to draw the plot onto, otherwise uses the current Axes.

None

Returns:

Name Type Description
ax

Either the matplotlib axes or the plotly figure. This is configurable by setting the plot_backend via model_diagnostics.set_config or model_diagnostics.config_context.

Notes

For details, refer to [Ehm2015].

References
[Ehm2015]

W. Ehm, T. Gneiting, A. Jordan, F. Krüger. "Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings". arxiv:1503.08195.

Source code in src/model_diagnostics/scoring/plots.py
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def plot_murphy_diagram(
    y_obs: npt.ArrayLike,
    y_pred: npt.ArrayLike,
    weights: Optional[npt.ArrayLike] = None,
    *,
    etas: Union[int, npt.ArrayLike] = 100,
    functional: str = "mean",
    level: float = 0.5,
    ax: Optional[mpl.axes.Axes] = None,
):
    r"""Plot a Murphy diagram.

    A Murphy diagram plots the scores of elementary scoring functions `ElementaryScore`
    over a range of their free parameter `eta`. This shows, if a model dominates all
    others over a wide class of scoring functions or if the ranking is very much
    dependent on the choice of scoring function.
    See [Notes](#notes) for further details.

    Parameters
    ----------
    y_obs : array-like of shape (n_obs)
        Observed values of the response variable.
        For binary classification, y_obs is expected to be in the interval [0, 1].
    y_pred : array-like of shape (n_obs) or (n_obs, n_models)
        Predicted values of the conditional expectation of Y, `E(Y|X)`.
    weights : array-like of shape (n_obs) or None
        Case weights.
    etas : int or array-like
        If an integer is given, equidistant points between min and max y values are
        generater. If an array-like is given, those points are used.
    functional : str
        The functional that is induced by the identification function `V`. Options are:

        - `"mean"`. Argument `level` is neglected.
        - `"median"`. Argument `level` is neglected.
        - `"expectile"`
        - `"quantile"`
    level : float
        The level of the expectile of quantile. (Often called \(\alpha\).)
        It must be `0 < level < 1`.
        `level=0.5` and `functional="expectile"` gives the mean.
        `level=0.5` and `functional="quantile"` gives the median.
    ax : matplotlib.axes.Axes
        Axes object to draw the plot onto, otherwise uses the current Axes.

    Returns
    -------
    ax :
        Either the matplotlib axes or the plotly figure. This is configurable by
        setting the `plot_backend` via
        [`model_diagnostics.set_config`][model_diagnostics.set_config] or
        [`model_diagnostics.config_context`][model_diagnostics.config_context].

    Notes
    -----
    For details, refer to `[Ehm2015]`.

    References
    ----------
    `[Ehm2015]`

    :   W. Ehm, T. Gneiting, A. Jordan, F. Krüger.
        "Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet
        Representations, and Forecast Rankings".
        [arxiv:1503.08195](https://arxiv.org/abs/1503.08195).
    """
    if ax is None:
        plot_backend = get_config()["plot_backend"]
        if plot_backend == "matplotlib":
            ax = plt.gca()
        else:
            import plotly.graph_objects as go

            fig = ax = go.Figure()
    elif isinstance(ax, mpl.axes.Axes):
        plot_backend = "matplotlib"
    elif is_plotly_figure(ax):
        import plotly.graph_objects as go

        plot_backend = "plotly"
        fig = ax
    else:
        msg = (
            "The ax argument must be None, a matplotlib Axes or a plotly Figure, "
            f"got {type(ax)}."
        )
        raise ValueError(msg)

    if (n_cols := length_of_second_dimension(y_obs)) > 0:
        if n_cols == 1:
            y_obs = get_second_dimension(y_obs, 0)
        else:
            msg = (
                f"Array-like y_obs has more than 2 dimensions, y_obs.shape[1]={n_cols}"
            )
            raise ValueError(msg)

    y_pred_min, y_pred_max = get_array_min_max(y_pred)
    y_obs_min, y_obs_max = get_array_min_max(y_obs)
    y_min, y_max = min(y_pred_min, y_obs_min), max(y_pred_max, y_obs_max)

    if y_min == y_max:
        msg = "All values y_obs and y_pred are one single and same value."
        raise ValueError(msg)
    elif isinstance(etas, numbers.Integral):
        etas = np.linspace(y_min, y_max, num=etas, endpoint=True)
    else:
        etas = np.asarray(etas).astype(float)
        if etas.ndim > 1:
            etas = etas.reshape(max(etas.shape))

    def elementary_score(y_obs, y_pred, weights, eta):
        sf = ElementaryScore(eta, functional=functional, level=level)
        return sf(y_obs=y_obs, y_pred=y_pred, weights=weights)

    n_pred = length_of_second_dimension(y_pred)
    pred_names, _ = get_sorted_array_names(y_pred)

    for i in range(len(pred_names)):
        y_pred_i = y_pred if n_pred == 0 else get_second_dimension(y_pred, i)

        y_plot = [
            elementary_score(y_obs=y_obs, y_pred=y_pred_i, weights=weights, eta=eta)
            for eta in etas
        ]
        label = pred_names[i] if n_pred >= 2 else None
        if plot_backend == "matplotlib":
            ax.plot(etas, y_plot, label=label)
        else:
            fig.add_scatter(
                x=etas,
                y=y_plot,
                mode="lines",
                line={"color": get_plotly_color(i)},
                name=label,
            )

    xlabel = "eta"
    ylabel = "score"
    title = "Murphy Diagram"
    if n_pred <= 1 and len(pred_names[0]) > 0:
        title = title + " " + pred_names[0]

    if plot_backend == "matplotlib":
        if n_pred >= 2:
            ax.legend()
        ax.set_title(title)
        ax.set(xlabel=xlabel, ylabel=ylabel)
    else:
        if n_pred <= 1:
            fig.update_layout(showlegend=False)
        fig.update_layout(xaxis_title=xlabel, yaxis_title=ylabel, title=title)

    return ax