Regression on Workers' Compensation Dataset¶

This notebook demonstrates model diagnostics on the workers' compensation dataset https://www.openml.org/d/42876. For details, see https://arxiv.org/abs/2202.12780.

The modelling goal is to predict the expectation $E(Y|X)$ of $Y=\text{UltimateIncurredClaimCost}$ conditional on the features $X$.

The impatient can directly jump to chapters 3-4 where the capability of model-diagnostics are shown. Chapters 1-2 are needed for data preparation and model training.

1. Data¶

1.1 Data Load and Preprocessing¶

In [1]:

Copied!





import calendar
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


set_config(transform_output="pandas")  # enable the set_output API
import calendar
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


set_config(transform_output="pandas")  # enable the set_output API

In [2]:

Copied!

df_original = fetch_openml(data_id=42876, parser="auto").frame
df_original = fetch_openml(data_id=42876, parser="auto").frame

In [3]:

Copied!





df = df_original.query("WeeklyPay >= 200 and HoursWorkedPerWeek >= 20")
df = df.assign(
    DateTimeOfAccident=lambda x: pd.to_datetime(x["DateTimeOfAccident"]),
    DateOfAccident=lambda x: x.DateTimeOfAccident.dt.date,
    DateReported=lambda x: pd.to_datetime(x.DateReported).dt.date,
    LogDelay=lambda x: np.log1p(pd.to_timedelta(x.DateReported - x.DateOfAccident).dt.days),
    HourOfAccident=lambda x: x.DateTimeOfAccident.dt.hour,
    WeekDayOfAccident=lambda x: pd.Categorical.from_codes(
        codes=x.DateTimeOfAccident.dt.weekday,
        dtype=pd.CategoricalDtype(list(calendar.day_name), ordered=True),
    ),
    LogWeeklyPay=lambda x: np.log1p(x.WeeklyPay),
    LogInitial=lambda x: np.log(x.InitialCaseEstimate),
    DependentChildren=lambda x: np.fmin(4, x.DependentChildren),
    HoursWorkedPerWeek=lambda x: np.fmin(60, x.HoursWorkedPerWeek),
    Gender=lambda x: x.Gender.astype("category"),
    MaritalStatus=lambda x: x.MaritalStatus.astype("category"),
    PartTimeFullTime=lambda x: x.PartTimeFullTime.astype("category"),
).rename(columns={"HoursWorkedPerWeek": "HoursPerWeek"})

x_continuous = [
    "Age",
    "LogWeeklyPay",
    "LogInitial",
    "HourOfAccident",
    "HoursPerWeek",
    "LogDelay",
]
x_discrete = [
    "Gender",
    "MaritalStatus",
    "PartTimeFullTime",
    "DependentChildren",
    "DaysWorkedPerWeek",
    "WeekDayOfAccident",
]
x_vars = x_continuous + x_discrete
y_var = "UltimateIncurredClaimCost"

print(f"The prepared dataset contains {df.shape[0]} rows.")
df = df_original.query("WeeklyPay >= 200 and HoursWorkedPerWeek >= 20")
df = df.assign(
    DateTimeOfAccident=lambda x: pd.to_datetime(x["DateTimeOfAccident"]),
    DateOfAccident=lambda x: x.DateTimeOfAccident.dt.date,
    DateReported=lambda x: pd.to_datetime(x.DateReported).dt.date,
    LogDelay=lambda x: np.log1p(pd.to_timedelta(x.DateReported - x.DateOfAccident).dt.days),
    HourOfAccident=lambda x: x.DateTimeOfAccident.dt.hour,
    WeekDayOfAccident=lambda x: pd.Categorical.from_codes(
        codes=x.DateTimeOfAccident.dt.weekday,
        dtype=pd.CategoricalDtype(list(calendar.day_name), ordered=True),
    ),
    LogWeeklyPay=lambda x: np.log1p(x.WeeklyPay),
    LogInitial=lambda x: np.log(x.InitialCaseEstimate),
    DependentChildren=lambda x: np.fmin(4, x.DependentChildren),
    HoursWorkedPerWeek=lambda x: np.fmin(60, x.HoursWorkedPerWeek),
    Gender=lambda x: x.Gender.astype("category"),
    MaritalStatus=lambda x: x.MaritalStatus.astype("category"),
    PartTimeFullTime=lambda x: x.PartTimeFullTime.astype("category"),
).rename(columns={"HoursWorkedPerWeek": "HoursPerWeek"})

x_continuous = [
    "Age",
    "LogWeeklyPay",
    "LogInitial",
    "HourOfAccident",
    "HoursPerWeek",
    "LogDelay",
]
x_discrete = [
    "Gender",
    "MaritalStatus",
    "PartTimeFullTime",
    "DependentChildren",
    "DaysWorkedPerWeek",
    "WeekDayOfAccident",
]
x_vars = x_continuous + x_discrete
y_var = "UltimateIncurredClaimCost"

print(f"The prepared dataset contains {df.shape[0]} rows.")

The prepared dataset contains 82017 rows.

In [4]:

Copied!

df.head()
df.head()

Out[4]:

	DateTimeOfAccident	DateReported	Age	Gender	MaritalStatus	WeeklyPay	PartTimeFullTime	HoursPerWeek	DaysWorkedPerWeek	ClaimDescription	InitialCaseEstimate	UltimateIncurredClaimCost	DateOfAccident	LogDelay	HourOfAccident	WeekDayOfAccident	LogWeeklyPay	LogInitial
0	2005-01-13 09:00:00	2005-01-24	45	M	S	500	F	38	5	MOVING DISC STRAINED RIGHT SHOULDER	9500.0	102.39	2005-01-13	2.484907	9	Thursday	6.216606	9.159047
1	1994-09-28 15:00:00	1994-10-17	40	M	M	283	F	38	5	BOILING WATER CAME FROM TRUCK STRAIN RIGHT WRIST	3000.0	1451.00	1994-09-28	2.995732	15	Wednesday	5.648974	8.006368
5	2002-02-28 07:00:00	2002-04-08	50	F	S	517	F	38	5	SPILLED CHEMICAL WELDING IRRITATION LEFT CORNEA	1000.0	320.28	2002-02-28	3.688879	7	Thursday	6.249975	6.907755
7	1995-04-20 14:00:00	1995-05-08	19	M	S	200	F	38	5	ENTERED GRINDER FOREIGN BODY RIGHT EYE	110.0	108.00	1995-04-20	2.944439	14	Thursday	5.303305	4.700480
8	2005-01-07 14:00:00	2005-01-31	19	M	S	767	F	40	5	LIFTING STRAIN LOWER BACK AND	9700.0	7110.90	2005-01-07	3.218876	14	Friday	6.643790	9.179881

1.2 Data Split¶

In [5]:

Copied!





df_train, df_test = train_test_split(df, train_size=0.75, random_state=1234321)
df = (
    pd.concat((df_train, df_test), axis=0, keys=("train", "test"))
    .reset_index(level=0)
    .rename(columns={"level_0": "split"})
)

y_train, y_test = df_train[y_var], df_test[y_var]
X_train, X_test = df_train[x_vars], df_test[x_vars]
df_train, df_test = train_test_split(df, train_size=0.75, random_state=1234321)
df = (
    pd.concat((df_train, df_test), axis=0, keys=("train", "test"))
    .reset_index(level=0)
    .rename(columns={"level_0": "split"})
)

y_train, y_test = df_train[y_var], df_test[y_var]
X_train, X_test = df_train[x_vars], df_test[x_vars]

We check whether the split results in two approximately identically distributed samples.

In [6]:

Copied!





df.groupby("split").agg(
    mean=pd.NamedAgg(column=y_var, aggfunc="mean"),
    q20=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.2)),
    q40=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.4)),
    q50=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.5)),
    q60=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.6)),
    q80=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.8)),
    q90=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.9)),
)
df.groupby("split").agg(
    mean=pd.NamedAgg(column=y_var, aggfunc="mean"),
    q20=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.2)),
    q40=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.4)),
    q50=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.5)),
    q60=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.6)),
    q80=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.8)),
    q90=pd.NamedAgg(column=y_var, aggfunc=lambda x: np.quantile(x, 0.9)),
)

Out[6]:

	mean	q20	q40	q50	q60	q80	q90
split
test	13300.958856	205.0	464.714	707.500	1135.984	4716.426	18979.432
train	12909.500160	203.0	459.000	691.185	1096.230	4596.730	19030.384

2. Models¶

We aim at predicting the conditional expectation $E(Y|X)$ and will finally evaluate the models with the Gamma deviance $S(z, y) = 2 \left(log(\frac{z}{y}) + \frac{y}{z} - 1\right)$.

2.1 The trivial model¶

In [7]:

Copied!

from sklearn.dummy import DummyRegressor

m_trivial = DummyRegressor(strategy="mean").fit(X_train, y_train)
from sklearn.dummy import DummyRegressor

m_trivial = DummyRegressor(strategy="mean").fit(X_train, y_train)

2.2 OLS¶

Here, we train an Ordinary Least Squares (OLS) model, but on the log transformed target. TransformedTargetRegressor takes care of this transformation and also of the back transformation such that predictions are on the original target UltimateIncurredClaimCost.

In [ ]:

Copied!





from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder


# ColumnTransformer for linear models
col_trans_linear = ColumnTransformer(
    [
        ("numeric_features", "passthrough", x_continuous),
        (
            "categorical_features",
            OneHotEncoder(drop="first", sparse_output=False),
            x_discrete
        ),
    ]
)

m_ols = Pipeline(
    [
        ("column_transformer", col_trans_linear),
        (
            "model",
            TransformedTargetRegressor(
                regressor=LinearRegression(),
                func=np.log,
                inverse_func=np.exp,
            )
        ),
    ]
).fit(X_train, y_train)
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder


# ColumnTransformer for linear models
col_trans_linear = ColumnTransformer(
    [
        ("numeric_features", "passthrough", x_continuous),
        (
            "categorical_features",
            OneHotEncoder(drop="first", sparse_output=False),
            x_discrete
        ),
    ]
)

m_ols = Pipeline(
    [
        ("column_transformer", col_trans_linear),
        (
            "model",
            TransformedTargetRegressor(
                regressor=LinearRegression(),
                func=np.log,
                inverse_func=np.exp,
            )
        ),
    ]
).fit(X_train, y_train)

We expect a high bias due to the log transformation. We correct for this bias by a multiplicative constant such that on the training set: $\sum_i m(x_i) = \sum_i y_i$.

In [9]:

Copied!

ols_corr_factor = np.sum(y_train) / np.sum(m_ols.predict(X_train))
print(f"The correction factor is {ols_corr_factor}")
m_ols[-1].regressor_.intercept_ += np.log(ols_corr_factor)
ols_corr_factor = np.sum(y_train) / np.sum(m_ols.predict(X_train))
print(f"The correction factor is {ols_corr_factor}")
m_ols[-1].regressor_.intercept_ += np.log(ols_corr_factor)

The correction factor is 6.6446580674943165

In [10]:

Copied!

np.sum(y_train) / np.sum(m_ols.predict(X_train))
np.sum(y_train) / np.sum(m_ols.predict(X_train))

Out[10]:

1.0000000000000002

2.3 Gamma and Poisson GLM¶

We use both, a Poisson and a Gamma GLM. The reason for a Gamma GLM is that we measure performance with the Gamma deviance. The reason to try out a Poisson GLM is that is has a canonical loss-link combination that ensures good calibration (given the design matrix is well specified).

In [11]:

Copied!





from sklearn.linear_model import GammaRegressor, PoissonRegressor


m_glm_poisson = Pipeline(
    [
        ("column_transformer", col_trans_linear),
        ("model", PoissonRegressor(alpha=1e-15, solver="newton-cholesky")),
    ]
).fit(X_train, y_train)

m_glm_gamma = Pipeline(
    [
        ("column_transformer", col_trans_linear),
        ("model", GammaRegressor(alpha=1e-15, solver="newton-cholesky")),
    ]
).fit(X_train, y_train)
from sklearn.linear_model import GammaRegressor, PoissonRegressor


m_glm_poisson = Pipeline(
    [
        ("column_transformer", col_trans_linear),
        ("model", PoissonRegressor(alpha=1e-15, solver="newton-cholesky")),
    ]
).fit(X_train, y_train)

m_glm_gamma = Pipeline(
    [
        ("column_transformer", col_trans_linear),
        ("model", GammaRegressor(alpha=1e-15, solver="newton-cholesky")),
    ]
).fit(X_train, y_train)

2.4 Gradient Boosted Decision Trees¶

Note that starting with scikit-learn version 1.3, we could use the Gamma deviance to train boosted trees, e.g. HistGradientBoostingRegressor(loss="gamma") (HGBT)—or use XGBoost or LightGBM for that. Instead, we again use the Poisson deviance for training the HGBT to make this exercise a bit more interesting. Here again, we expect better calibration for the canonical loss-link combination that comes with loss="poisson".

In [12]:

Copied!





%%time
# Note that this cell might take a little while ~ 2 minutes on my laptop.
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.preprocessing import OrdinalEncoder


# ColumnTransformer for boosted trees
col_trans_bt = ColumnTransformer(
    [
        ("categorical_features", OrdinalEncoder(), x_discrete),
        ("numeric_features", "passthrough", x_continuous),
    ]
)
m_hgbt_poisson_base = Pipeline(
    [
        ("column_transformer", col_trans_bt),
        (
            "model", HistGradientBoostingRegressor(
                loss="poisson",
                categorical_features=list(range(len(x_discrete))),
                monotonic_cst={"numeric_features__LogWeeklyPay": 1}, # set_output API, YES!
                max_iter=200,
                random_state=33,
            )
        ),
    ]
)

param_grid = {
    "model__learning_rate": [0.02, 0.05],
    "model__min_samples_leaf": [30, 40, 50],
    "model__l2_regularization": [0.1, 1],
    "model__max_depth": [None, 3],
}


# successive halfing grid search (CV) on the training data
shgs = HalvingGridSearchCV(
    m_hgbt_poisson_base,
    param_grid=param_grid,
    cv=5,
    scoring="neg_mean_gamma_deviance",
    random_state=321,
).fit(X_train, y_train)
m_hgbt_poisson = shgs.best_estimator_
%%time
# Note that this cell might take a little while ~ 2 minutes on my laptop.
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.preprocessing import OrdinalEncoder


# ColumnTransformer for boosted trees
col_trans_bt = ColumnTransformer(
    [
        ("categorical_features", OrdinalEncoder(), x_discrete),
        ("numeric_features", "passthrough", x_continuous),
    ]
)
m_hgbt_poisson_base = Pipeline(
    [
        ("column_transformer", col_trans_bt),
        (
            "model", HistGradientBoostingRegressor(
                loss="poisson",
                categorical_features=list(range(len(x_discrete))),
                monotonic_cst={"numeric_features__LogWeeklyPay": 1}, # set_output API, YES!
                max_iter=200,
                random_state=33,
            )
        ),
    ]
)

param_grid = {
    "model__learning_rate": [0.02, 0.05],
    "model__min_samples_leaf": [30, 40, 50],
    "model__l2_regularization": [0.1, 1],
    "model__max_depth": [None, 3],
}


# successive halfing grid search (CV) on the training data
shgs = HalvingGridSearchCV(
    m_hgbt_poisson_base,
    param_grid=param_grid,
    cv=5,
    scoring="neg_mean_gamma_deviance",
    random_state=321,
).fit(X_train, y_train)
m_hgbt_poisson = shgs.best_estimator_

CPU times: user 6min 37s, sys: 7.69 s, total: 6min 44s
Wall time: 1min 48s

In [13]:

Copied!

shgs.best_params_, shgs.best_score_
shgs.best_params_, shgs.best_score_

Out[13]:

({'model__l2_regularization': 1,
  'model__learning_rate': 0.05,
  'model__max_depth': 3,
  'model__min_samples_leaf': 40},
 -3.547907637012485)

In [14]:

Copied!

m_hgbt_poisson[-1].n_iter_
m_hgbt_poisson[-1].n_iter_

Out[14]:

3. Calibration Assessment¶

To make the code easier, we put together the predictions of all of our models on the test set in a single dataframe.

In [15]:

Copied!





df_pred_test = pd.DataFrame(
    {
        "Trivial": m_trivial.predict(X_test),
        "OLS": m_ols.predict(X_test),
        "GLM_Gamma": m_glm_gamma.predict(X_test),
        "GLM_Poisson": m_glm_poisson.predict(X_test),
        "HGBT_Poisson": m_hgbt_poisson.predict(X_test),
    }
)
df_pred_test = pd.DataFrame(
    {
        "Trivial": m_trivial.predict(X_test),
        "OLS": m_ols.predict(X_test),
        "GLM_Gamma": m_glm_gamma.predict(X_test),
        "GLM_Poisson": m_glm_poisson.predict(X_test),
        "HGBT_Poisson": m_hgbt_poisson.predict(X_test),
    }
)

3.1. Reliability Diagrams¶

A reliability diagram plots an estimation of $E(Y|m(X))$ versus $m(X)$ and therefore assesses auto-calibration. Good auto-calibration is seen by closeness to the diagonal depicted as dotted line.

A good way to estimate $E(Y|m(X))$ and thereby avoiding manual binning is by isotonic regression. This is implemented in plot_reliability_diagram.

In [16]:

Copied!

import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from model_diagnostics.calibration import compute_bias, plot_bias, plot_reliability_diagram
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from model_diagnostics.calibration import compute_bias, plot_bias, plot_reliability_diagram

In [17]:

Copied!





fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(8, 8), sharex=True, sharey=True)

for ax, model in zip(axes.flat, ["OLS", "GLM_Gamma", "GLM_Poisson", "HGBT_Poisson"]):
    plot_reliability_diagram(
        y_obs=y_test,
        y_pred=df_pred_test[model],
        n_bootstrap=100,
        ax=ax,
    )
    ax.xaxis.set_major_locator(MultipleLocator(100_000))

fig.set_layout_engine(layout="tight")
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(8, 8), sharex=True, sharey=True)

for ax, model in zip(axes.flat, ["OLS", "GLM_Gamma", "GLM_Poisson", "HGBT_Poisson"]):
    plot_reliability_diagram(
        y_obs=y_test,
        y_pred=df_pred_test[model],
        n_bootstrap=100,
        ax=ax,
    )
    ax.xaxis.set_major_locator(MultipleLocator(100_000))

fig.set_layout_engine(layout="tight")

No description has been provided for this image

For a better visual comparison, we use the diagram_type="bias" option to rotate the plots by 45 degrees (and mirror on the horizontal line) and plot all models in one go. This means, closer to y=0 is better. On top, we set the x-axis to a logarithmic scale and zoom in a little.

In [18]:

Copied!





ax = plot_reliability_diagram(
    y_obs=y_test,
    y_pred=df_pred_test,
    n_bootstrap=100,
    diagram_type="bias",
)
ax.set_xlim(left=1e4, right=3e5)
ax.set_ylim(bottom=-1e5, top=1e5)
ax.set_xscale("log")
ax = plot_reliability_diagram(
    y_obs=y_test,
    y_pred=df_pred_test,
    n_bootstrap=100,
    diagram_type="bias",
)
ax.set_xlim(left=1e4, right=3e5)
ax.set_ylim(bottom=-1e5, top=1e5)
ax.set_xscale("log")

3.2 Conditional Bias¶

Unconditional Bias¶

In [19]:

Copied!





compute_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=None,
)
compute_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=None,
)

Out[19]:

shape: (5, 6)

model	bias_mean	bias_count	bias_weights	bias_stderr	p_value
str	f64	u32	f64	f64	f64
"Trivial"	-391.458696	20505	20505.0	394.427909	0.320979
"OLS"	-158.226277	20505	20505.0	370.470343	0.669314
"GLM_Gamma"	-1525.26242	20505	20505.0	373.653779	0.000045
"GLM_Poisson"	-142.672362	20505	20505.0	370.458593	0.70015
"HGBT_Poisson"	-94.439905	20505	20505.0	370.823364	0.798976

Bias Conditional on Gender¶

In [20]:

Copied!





compute_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["Gender"],
)
compute_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["Gender"],
)

Out[20]:

shape: (10, 7)

model	Gender	bias_mean	bias_count	bias_weights	bias_stderr	p_value
str	cat	f64	u32	f64	f64	f64
"Trivial"	"F"	-4693.414762	4441	4441.0	925.511027	4.1151e-7
"Trivial"	"M"	797.845767	16064	16064.0	433.155234	0.065502
"OLS"	"F"	-1212.773117	4441	4441.0	878.188822	0.16735
"OLS"	"M"	133.310235	16064	16064.0	405.795301	0.742526
"GLM_Gamma"	"F"	-1183.005016	4441	4441.0	882.685587	0.180238
"GLM_Gamma"	"M"	-1619.881763	16064	16064.0	409.812589	0.000078
"GLM_Poisson"	"F"	-184.995122	4441	4441.0	878.931176	0.833305
"GLM_Poisson"	"M"	-130.971952	16064	16064.0	405.685045	0.74682
"HGBT_Poisson"	"F"	-303.384575	4441	4441.0	875.609671	0.728996
"HGBT_Poisson"	"M"	-36.675756	16064	16064.0	406.774313	0.928159

In [21]:

Copied!





plot_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["Gender"],
    confidence_level=0.8
)
plot_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["Gender"],
    confidence_level=0.8
)

Out[21]:

<Axes: title={'center': 'Bias Plot'}, xlabel='Gender', ylabel='bias'>

Bias Conditional on LogWeeklyPay¶

In [22]:

Copied!





plot_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["LogWeeklyPay"],
    confidence_level=0.8,
)
plot_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["LogWeeklyPay"],
    confidence_level=0.8,
)

Out[22]:

<Axes: title={'center': 'Bias Plot'}, xlabel='binned LogWeeklyPay', ylabel='bias'>

Bias Conditional on Age¶

Age is given as an integer in the data. We can pass n_bins to appropriate number to see all integer ages that occur in the test data.

In [23]:

Copied!





ax = plot_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["Age"],
    n_bins=X_test["Age"].nunique(),
    confidence_level=0.8,
)
ax = plot_bias(
    y_obs=y_test,
    y_pred=df_pred_test,
    feature=X_test["Age"],
    n_bins=X_test["Age"].nunique(),
    confidence_level=0.8,
)

3.3 Marginal Plot - a Model overview per feature¶

Here, we focus on a single model, say HGBT_Poisson. We want to get a good overview of calibration, model effects and feature distribution at the same time. This functionality is offered by plot_marginal. It plots:

Mean observed and mean predicted per (binned) feature value
Partial dependence of the feature
Marginal distribution of the feature

To make this even more interesting, we use the plotly backend.

In [24]:

Copied!





from model_diagnostics import config_context
from model_diagnostics.calibration import plot_marginal


with config_context(plot_backend="plotly"):
    fig = plot_marginal(
        y_obs=y_test,
        y_pred=df_pred_test["HGBT_Poisson"],
        X=X_test,
        feature_name="Age",
        predict_function=m_hgbt_poisson.predict
    )
fig.show(renderer="notebook")
from model_diagnostics import config_context
from model_diagnostics.calibration import plot_marginal


with config_context(plot_backend="plotly"):
    fig = plot_marginal(
        y_obs=y_test,
        y_pred=df_pred_test["HGBT_Poisson"],
        X=X_test,
        feature_name="Age",
        predict_function=m_hgbt_poisson.predict
    )
fig.show(renderer="notebook")

And several features in one plotly figure with multiple subplots. Here, we need a lot of extra work, as plot_marginal is already a subplot consisting of 2 y-axis which are side-flipped, 1st yaxis on the right, 2nd yaxis on the left side. This is necessary to plot the histogram in the background.

In [25]:

Copied!





from math import ceil
from plotly.subplots import make_subplots


feature_list = ["Gender", "MaritalStatus", "LogWeeklyPay", "Age"]
n_rows, n_cols = ceil(len(feature_list) / 2), 2
fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    vertical_spacing=0.4 / n_rows,  # default = 0.3 / n_rows
    # subplot_titles=feature_list,
    specs=[[{"secondary_y": True}] * n_cols] * n_rows,
)

rows, cols = fig._get_subplot_rows_columns()
for row in rows:
    for col in cols:
        i = len(rows) * (row - 1) + col - 1  # 0-based index
        with config_context(plot_backend="plotly"):
            subfig = plot_marginal(
                y_obs=y_test,
                y_pred=df_pred_test["HGBT_Poisson"],
                X=X_test,
                feature_name=feature_list[i],
                predict_function=m_hgbt_poisson.predict,
            )
        # Transfer the x-axis titles of the subfig to fig.
        xaxis = "xaxis" if i == 0 else f"xaxis{i + 1}"
        fig["layout"][xaxis]["title"] = subfig["layout"]["xaxis"]["title"]
        # Change sides of y-axis.
        yaxis = "yaxis" if i == 0 else f"yaxis{2 * i + 1}"
        yaxis2 = f"yaxis{2 * (i + 1)}"
        fig.update_layout(**{
            yaxis: {"side": "right"}, yaxis2: {"side": "left", "title": "y"}
        })
        if not(row == rows[-1] and col == cols[-1]):
            # Suppress legends for all but the last subfig.
            subfig.update_traces(showlegend=False)
        for d in subfig.data:
            fig.add_trace(d, row=row, col=col, secondary_y=d["yaxis"] == "y2")

fig.update_layout(title="Marginal Plot HGBT_Poisson")
fig.update_xaxes(title_standoff=5)
fig.update_yaxes(title_standoff=5)
fig.show(renderer="notebook")
from math import ceil
from plotly.subplots import make_subplots


feature_list = ["Gender", "MaritalStatus", "LogWeeklyPay", "Age"]
n_rows, n_cols = ceil(len(feature_list) / 2), 2
fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    vertical_spacing=0.4 / n_rows,  # default = 0.3 / n_rows
    # subplot_titles=feature_list,
    specs=[[{"secondary_y": True}] * n_cols] * n_rows,
)

rows, cols = fig._get_subplot_rows_columns()
for row in rows:
    for col in cols:
        i = len(rows) * (row - 1) + col - 1  # 0-based index
        with config_context(plot_backend="plotly"):
            subfig = plot_marginal(
                y_obs=y_test,
                y_pred=df_pred_test["HGBT_Poisson"],
                X=X_test,
                feature_name=feature_list[i],
                predict_function=m_hgbt_poisson.predict,
            )
        # Transfer the x-axis titles of the subfig to fig.
        xaxis = "xaxis" if i == 0 else f"xaxis{i + 1}"
        fig["layout"][xaxis]["title"] = subfig["layout"]["xaxis"]["title"]
        # Change sides of y-axis.
        yaxis = "yaxis" if i == 0 else f"yaxis{2 * i + 1}"
        yaxis2 = f"yaxis{2 * (i + 1)}"
        fig.update_layout(**{
            yaxis: {"side": "right"}, yaxis2: {"side": "left", "title": "y"}
        })
        if not(row == rows[-1] and col == cols[-1]):
            # Suppress legends for all but the last subfig.
            subfig.update_traces(showlegend=False)
        for d in subfig.data:
            fig.add_trace(d, row=row, col=col, secondary_y=d["yaxis"] == "y2")

fig.update_layout(title="Marginal Plot HGBT_Poisson")
fig.update_xaxes(title_standoff=5)
fig.update_yaxes(title_standoff=5)
fig.show(renderer="notebook")

4. Predictive Model Performance¶

4.1 Score Decomposition of the Gamma Deviance¶

We use the Gamma deviance and evaluate on the test set. On top, we additively decompose the scores in terms of miscalibration (smaller is better), discrimination (larger is better) and uncertainty (property of the data, same for all models).

In [26]:

Copied!





import polars as pl
from model_diagnostics.scoring import GammaDeviance, decompose


gamma_deviance = GammaDeviance()

df = decompose(
    y_obs=y_test,
    y_pred=df_pred_test,
    scoring_function=gamma_deviance,
)
df.sort(["score"])
import polars as pl
from model_diagnostics.scoring import GammaDeviance, decompose


gamma_deviance = GammaDeviance()

df = decompose(
    y_obs=y_test,
    y_pred=df_pred_test,
    scoring_function=gamma_deviance,
)
df.sort(["score"])

Out[26]:

shape: (5, 5)

model	miscalibration	discrimination	uncertainty	score
str	f64	f64	f64	f64
"HGBT_Poisson"	0.088114	1.696757	5.086002	3.477359
"GLM_Gamma"	0.19056	1.597707	5.086002	3.678855
"OLS"	0.271994	1.605604	5.086002	3.752392
"GLM_Poisson"	0.470965	1.624191	5.086002	3.932776
"Trivial"	0.000901	8.8818e-16	5.086002	5.086903

First, the HGBT model has the best (smallest) overall Gamma deviance in column score. We also see that the uncertainty term is model independent as it is a property of the data and the scoring function only. Very typical for insurance claims, it is the largest component. This shows that insurance claims happen to a large part by chance.

The deciding difference of the scores then stems from the miscalibration term. Except for the trivial model, the HGBT model has the best (smallest) one. The discrimination term (larger is better) is quite similar among the non-trivial models. Interestingly, the Gamma GLM has a better miscalibration term, but a worse discrimination term compared to the Poisson GLM.

4.2 Murphy Diagram¶

How does the ranking of the model performances change with other scoring functions? Is one model dominating the others for a wide range of scoring functions? Such questions can be handily answered by a Murphy diagram. It plots the mean score for a wide range of elementary scoring functions, parametrized by eta on the x-axis. We explicitly specify the etas to zoom into the range where differences can be spotted.

In [27]:

Copied!





from model_diagnostics.scoring import SquaredError, plot_murphy_diagram


ax = plot_murphy_diagram(
    y_obs=y_test,
    y_pred=df_pred_test.loc[:, ["OLS", "GLM_Gamma", "GLM_Poisson", "HGBT_Poisson"]],
    etas=np.linspace(1e4, 0.2e6, 1000),
)
from model_diagnostics.scoring import SquaredError, plot_murphy_diagram


ax = plot_murphy_diagram(
    y_obs=y_test,
    y_pred=df_pred_test.loc[:, ["OLS", "GLM_Gamma", "GLM_Poisson", "HGBT_Poisson"]],
    etas=np.linspace(1e4, 0.2e6, 1000),
)

We see that there is no clear winner. Indeed, for other scoring functions like the squared error, the HGBT model is not the best performing one.

In [28]:

Copied!





decompose(
    y_obs=y_test,
    y_pred=df_pred_test,
    scoring_function=SquaredError(),
).sort(["score"])
decompose(
    y_obs=y_test,
    y_pred=df_pred_test,
    scoring_function=SquaredError(),
).sort(["score"])

Out[28]:

shape: (5, 5)

model	miscalibration	discrimination	uncertainty	score
str	f64	f64	f64	f64
"GLM_Poisson"	3.8145e7	4.1404e8	3.1899e9	2.8140e9
"OLS"	6.8568e7	4.4428e8	3.1899e9	2.8142e9
"HGBT_Poisson"	2.7010e7	3.9737e8	3.1899e9	2.8195e9
"GLM_Gamma"	4.1506e7	3.6635e8	3.1899e9	2.8650e9
"Trivial"	153239.910611	0.0	3.1899e9	3.1900e9