6 Main Example
We will work with the French motor third-part liability (MTPL) dataset available on openML (ID 41214) to model claims frequency.
6.1 The data
The first R script of this notebook performs the following steps:
- The dataset is downloaded from OpenML and saved.
- Rows with identical risk factors (“Area”, “VehPower”, “VehAge”, “DrivAge”, “BonusMalus”, “VehBrand”, “VehGas”, “Density”, “Region”) are assigned to a “group_id”. We assume that rows in such a group represent the same policy, even if their “IDpol” varies. This seems to be a curious property of this dataset. As we will see later, working with grouped data is a tricky business, and ignoring grouping can lead to biased results.
- Certain transformations are applied. For instance, very large feature values are truncated and small categories are collapsed to one category.
- The column names of the features, of the response and of the exposure are stored in variables.
- All relevant columns are described univariately.
Note: We will often use the forward pipe operator |>
.
It moves the left-hand side as first argument in the function on the
right-hand side. An example: iris |> head()
is the same
as head(iris)
. If you need a quick refresher on R, go
through the
first chapter of my statistical computing lecture.
library(tidyverse)
library(OpenML)
library(farff)
library(patchwork)
library(splitTools)
library(mgcv)
library(lightgbm)
main <- "french_motor"
# Download and save dataset
df0 <- getOMLDataSet(data.id = 41214)$data
# Preprocessing
prep <- df0 |>
# Identify rows that might belong to the same policy
group_by_at(
c("Area", "VehPower", "VehAge", "DrivAge", "BonusMalus", "VehBrand",
"VehGas", "Density", "Region")
) |>
mutate(group_id = cur_group_id(), group_size = n()) |>
ungroup() |>
arrange(group_id) |>
# Usual preprocessing
mutate(
Exposure = pmin(1, Exposure),
Freq = pmin(15, ClaimNb / Exposure),
VehPower = pmin(12, VehPower),
VehAge = pmin(20, VehAge),
VehGas = factor(VehGas),
DrivAge = pmin(85, DrivAge),
logDensity = log(Density),
VehBrand = relevel(fct_lump_n(VehBrand, n = 3), "B12"),
PolicyRegion = relevel(fct_lump_n(Region, n = 5), "R24"),
AreaCode = Area
)
# Covariates
xvars <- c(
"VehAge", "DrivAge", "logDensity", "VehPower", "PolicyRegion", "VehBrand", "VehGas"
)
# Description
nrow(prep)
## [1] 678013
head(prep[, c("IDpol", "group_id", "ClaimNb", "Exposure", "Freq", xvars)], 8)
summary(prep[, c("Freq", "Exposure", xvars)])
## Freq Exposure VehAge DrivAge
## Min. : 0.0000 Min. :0.002732 Min. : 0.000 Min. :18.00
## 1st Qu.: 0.0000 1st Qu.:0.180000 1st Qu.: 2.000 1st Qu.:34.00
## Median : 0.0000 Median :0.490000 Median : 6.000 Median :44.00
## Mean : 0.1622 Mean :0.528545 Mean : 6.976 Mean :45.49
## 3rd Qu.: 0.0000 3rd Qu.:0.990000 3rd Qu.:11.000 3rd Qu.:55.00
## Max. :15.0000 Max. :1.000000 Max. :20.000 Max. :85.00
## logDensity VehPower PolicyRegion VehBrand
## Min. : 0.000 Min. : 4.00 R24 :160601 B12 :166024
## 1st Qu.: 4.522 1st Qu.: 5.00 R11 : 69791 B1 :162736
## Median : 5.974 Median : 6.00 R53 : 42122 B2 :159861
## Mean : 5.982 Mean : 6.43 R82 : 84752 Other:189392
## 3rd Qu.: 7.413 3rd Qu.: 7.00 R93 : 79315
## Max. :10.204 Max. :12.00 Other:241432
## VehGas
## Diesel :332136
## Regular:345877
##
##
##
##
# Histograms
prep |>
select(Freq, Exposure, DrivAge, VehAge, VehPower, logDensity) |>
pivot_longer(everything()) |>
ggplot(aes(x = value)) +
geom_histogram(bins = 19, fill = "darkorange") +
facet_wrap("name", scales = "free") +
ggtitle("Histograms of numeric columns")
# Barplots
one_barplot <- function(v) {
ggplot(prep, aes(.data[[v]])) + geom_bar(fill = "darkorange")
}
wrap_plots(
lapply(c("ClaimNb", "VehBrand", "VehGas", "PolicyRegion"), one_barplot),
top = "Barplots of discrete columns"
)
Comments:
- The dataset has about 680k observations.
- The univariate distributions of the variables in the prepared dataset look plausible.
- Rows 3 and 4 share the same “group_id”, as do rows 6, 7, and 8. Not only are all of their risk factors identical, but their “IDpol” are very similar as well. This increases confidence in our proxy “group_id”.
6.2 The models
We will now calculate three models with R:
- An additive Poisson GLM fitted with the function
glm()
from the {stats} package, a - Poisson GAM with several penalized regression smooths with the
gam()
function of the {mgcv} package, and a - Poisson boosted trees model with implicit log link fitted by LightGBM (Ke et al. 2017) and tuned with grouped five-fold cross-validation.
The models are calculated on about 80% of the observations, while the rest is kept for model validation. Since we suspect that some policies are represented by multiple rows, the data split is done in a grouped way, assigning all rows with the same “group_id” to either the training or the test dataset. That logic was also applied to create the cross-validation folds for selecting or tuning the hyperparameters of the LightGBM model. More about grouped splits will follow in Chapter 2.
set.seed(22) # Seeds the data split as well as the boosted trees model
ind <- partition(prep$group_id, p = c(train = 0.8, test = 0.2), type = "grouped")
train <- prep[ind$train, ]
test <- prep[ind$test, ]
# 1) GLM
fit_glm <- glm(
Freq ~ VehAge + DrivAge + logDensity + VehPower + PolicyRegion + VehBrand + VehGas,
data = train,
family = quasipoisson(),
weights = Exposure
)
summary(fit_glm)
##
## Call:
## glm(formula = Freq ~ VehAge + DrivAge + logDensity + VehPower +
## PolicyRegion + VehBrand + VehGas, family = quasipoisson(),
## data = train, weights = Exposure)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.114e+00 4.809e-02 -43.958 < 2e-16 ***
## VehAge -3.441e-02 1.601e-03 -21.489 < 2e-16 ***
## DrivAge -5.104e-03 5.126e-04 -9.957 < 2e-16 ***
## logDensity 6.453e-02 4.571e-03 14.116 < 2e-16 ***
## VehPower 4.878e-03 3.751e-03 1.300 0.19346
## PolicyRegionR11 -8.731e-02 3.169e-02 -2.755 0.00587 **
## PolicyRegionR53 -1.392e-02 2.984e-02 -0.466 0.64092
## PolicyRegionR82 2.408e-05 2.546e-02 0.001 0.99925
## PolicyRegionR93 -8.260e-02 2.810e-02 -2.939 0.00329 **
## PolicyRegionOther -1.191e-01 2.032e-02 -5.864 4.53e-09 ***
## VehBrandB1 -1.602e-01 2.360e-02 -6.789 1.13e-11 ***
## VehBrandB2 -1.606e-01 2.362e-02 -6.799 1.05e-11 ***
## VehBrandOther -1.473e-01 2.205e-02 -6.680 2.39e-11 ***
## VehGasRegular 6.664e-02 1.484e-02 4.490 7.13e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.443783)
##
## Null deviance: 170431 on 542169 degrees of freedom
## Residual deviance: 168532 on 542156 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
# 2) GAM with penalized regression smooths and maximum k-1 df. Takes long to fit
fit_gam <- gam(
Freq ~ s(VehAge, k = 7) + s(DrivAge, k = 7) + s(logDensity, k = 3) +
s(VehPower, k = 3) + PolicyRegion + VehBrand + VehGas,
data = train,
family = quasipoisson(),
weights = Exposure
)
summary(fit_gam)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## Freq ~ s(VehAge, k = 7) + s(DrivAge, k = 7) + s(logDensity, k = 3) +
## s(VehPower, k = 3) + PolicyRegion + VehBrand + VehGas
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.21592 0.02103 -105.376 < 2e-16 ***
## PolicyRegionR11 -0.07206 0.02576 -2.798 0.005149 **
## PolicyRegionR53 0.00631 0.02422 0.260 0.794486
## PolicyRegionR82 0.02171 0.02067 1.050 0.293568
## PolicyRegionR93 -0.07566 0.02283 -3.314 0.000921 ***
## PolicyRegionOther -0.10695 0.01650 -6.483 9.00e-11 ***
## VehBrandB1 -0.12878 0.01985 -6.489 8.64e-11 ***
## VehBrandB2 -0.14097 0.01983 -7.109 1.17e-12 ***
## VehBrandOther -0.10611 0.01859 -5.708 1.14e-08 ***
## VehGasRegular 0.05788 0.01224 4.727 2.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(VehAge) 5.980 6.000 438.36 < 2e-16 ***
## s(DrivAge) 5.887 5.994 207.61 < 2e-16 ***
## s(logDensity) 1.013 1.027 280.29 < 2e-16 ***
## s(VehPower) 1.983 2.000 14.86 7.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.00695 Deviance explained = 2.59%
## GCV = 0.30624 Scale est. = 0.94909 n = 542170
# Visualizing effect of additive component on scale of "linear" predictor
plot(fit_gam, select = 2, main = "Additive effect of 'DrivAge' on log scale")
# 3) Boosted trees - parameters found by CV randomized search, see later
dtrain <- lgb.Dataset(
data.matrix(train[xvars]),
label = train$Freq,
weight = train$Exposure,
params = list(feature_pre_filter = FALSE)
)
params <- list(
objective = "poisson",
learning_rate = 0.05,
num_leaves = 31,
lambda_l2 = 7.5,
lambda_l1 = 4,
colsample_bynode = 0.8,
bagging_fraction = 0.8,
min_data_in_leaf = 100,
min_sum_hessian_in_leaf = 0,
poisson_max_delta_step = 0.7
)
fit_lgb <- lgb.train(params = params, data = dtrain, nrounds = 361)
## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011115 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Info] Total Bins 370
## [LightGBM] [Info] Number of data points in the train set: 542170, number of used features: 7
## [LightGBM] [Info] Start training from score -2.321650
# Dump everything
save(prep, train, test, fit_glm, fit_gam, xvars, file = file.path(main, "intro.RData"))
lgb.save(fit_lgb, file.path(main, "fit_lgb.txt"))
Comments:
- Fitting a GAM on 540k rows of data takes comparably long.
- The code required to fit the boosted trees model is much longer than for the GLM and the GAM. However, the code is very general, so that only minor adaptions need to be made for other modeling situations.
- Interpreting the GLM and the GAM can be easily made from the model output. For example, for the GLM, a one-point increase in “DrivAge” is associated with a change in log expected claims frequency of -0.0051. On the frequency scale, this corresponds to a change of 100\% \cdot (e^{-0.0051} - 1) \approx -0.5\%. The GAM shows a more complex (and realistic) age effect. For the GLM and the GAM, the test statistics give us an indication of the variable importance: Vehicle age and driver age appear to be among the strongest predictors. What about the boosted trees model? It will be one of the goals of the next chapter to fill this gap using the tools of XAI.