1 Introduction

Welcome to the part “XAI” (eXplainable Artificial Intelligence) of the lecture “Responsible Machine Learning with Insurance Applications”.

The basic question of XAI is:

How to explain and interpret a given model, even if it seems a black box?

Answering this question is a key aspect of responsible machine learning (ML). It not only provides valuable information and knowledge for stakeholders, but also helps to identify problems in the modeling process.

2 Scope and Taxonomy

Thanks to the rise of ML and the desire to explain its models, the field of explainable ML is developing rapidly. We will study the most important methods for structured data. Unstructured data (images, audio sequences, text, …) often has its own specialized explanation tools. Covering them is beyond the scope of this lecture.

The following taxonomy of explainability methods is often helpful, see Molnar (2019) for more background:

Global versus local explainability: Global explainability methods describe the model as a whole, while local methods focus on explaining model behavior around a single observation. We will mainly cover global methods, with one notable exception (SHAP).
Model-specific versus model-agnostic: Each modeling technique typically has its own tailored methods for interpretation. For example, a linear regression can be explained by examining its coefficients. Tree-based models are another example: The variable importance of a feature in such a model can be assessed, for instance, by studying the average loss reduction from splits on that feature. In contrast to such model-specific methods, a model-agnostic method is suitable for all types of models. Our focus is on model-agnostic techniques, although we sometimes show model-specific methods for comparison.
Intrinsic versus post-hoc: The simple structure of a linear model, an additive model, or a short decision tree makes such modeling techniques intrinsically interpretable. In contrast, models with complex structure (e.g., a neural net) are interpretable only by post-hoc analysis of the fitted model. Model-agnostic interpretability methods are always used for post-hoc explainability. However, they can also be applied to an intrinsically interpretable model.

Note: We will use the terms “explainable”, “interpretable”, and “intelligible” interchangeably, although subtle differences are sometimes pointed out in the literature.

3 Outline

This chapter introduces important notations, the necessary background on non-life insurance pricing, and the main example that will guide us on our way through the illuminating world of model explainability.

In the chapter “Explaining Models” we will learn the most important methods of post-hoc interpretability. We will also learn how to use the insights gained from interpreting complex models to improve linear models.

The third chapter, “Improving Explainability”, takes a different path. Here we explore ways to improve the intrinsic explainability of complex models by simplifying their internal structure, in particular by forcing parts of the model to be additive and/or monotonic.

4 Notation

Throughout the XAI notebooks, we consider a typical modeling situation: a distributional property T of a real-valued response variable Y should be approximated by a model m: \boldsymbol x \in \mathbb R^p \mapsto \mathbb R of a p-dimensional feature vector \boldsymbol X = (X^{(1)}, \dots, X^{(p)}) with value \boldsymbol x = (x^{(1)}, \dots, x^{(p)}) \in \mathbb R^p, i.e., T(Y\mid \boldsymbol X = \boldsymbol x) \approx m(\boldsymbol x). For brevity, we write T(Y\mid \boldsymbol X = \boldsymbol x) = T(Y\mid \boldsymbol x).

Examples of T are the expectation T =\mathbb E, or a quantile q_\alpha. The model m is then estimated by \hat m from the training data by minimizing some objective criterion typically of the form \frac{1}{\sum_{i = 1}^n w_i} \sum_{i = 1}^n w_i L(\hat y_i, y_i) + \lambda \Omega(m), where

L is a loss function (sometimes called “scoring function”) ideally strictly consistent for T, e.g., the squared error L(z, y) = (y - z)^2,
1 \le i \le n are the observations in the dataset considered,
\lambda \Omega(m) is an optional penalty to reduce overfitting,
\boldsymbol w = (w_1, \dots, w_n)^T is the vector of (optional) non-negative case weights of the n observations (w_i = 1 if there are no such case weights),
\boldsymbol y = (y_1, \dots, y_n)^T are the n observed values of Y,
\boldsymbol{\hat y} = (\hat y_1, \dots, \hat y_n)^T is the vector of predicted or fitted values, i.e., \hat y_i = \hat m(\boldsymbol x_i),
\boldsymbol x_1, \dots, \boldsymbol x_n are the feature vectors corresponding to the n observations. Consequently, x_i^{(j)} denotes the value of the j-th feature of the i-th observation.

Even if many of the concepts covered in this lecture also work for classification settings with more than two classes, we focus on regression and binary classification.

Some additional notation is introduced in the following examples of important model classes.

4.1 Example: Linear regression

The model equation of a linear regression model is assumed to satisfy \mathbb E(Y \mid \boldsymbol x) = m(\boldsymbol x) = \beta_o + \beta_1 x^{(1)} + \dots + \beta_p x^{(p)}, where (\beta_o, \beta_1, \dots, \beta_p) \in \mathbb R^{p+1} is the parameter vector to be estimated by minimizing the sum of squared errors \sum_{i = 1}^n (y_i - \hat y_i)^2 by linear least-squares.

4.2 Example: Generalized linear model (GLM)

The model equation of a generalized linear model (Nelder and Wedderburn 1972) with link function g and inverse link g^{-1} is assumed to satisfy: \mathbb E(Y \mid \boldsymbol x) = m(\boldsymbol x) = g^{-1}(\eta(\boldsymbol x)) = g^{-1}(\beta_o + \beta_1 x^{(1)} + \dots + \beta_p x^{(p)}), where the linear part \eta of the model is called the linear predictor. The parameters \beta_j are estimated by minimizing the (possibly weighted) average deviance \bar S, a measure depending on the distribution assumed for Y\mid \boldsymbol x. Typical distributions are the Poisson distribution, the Gamma distribution, the Bernoulli distribution, and the normal distribution.

In this lecture, we will often model insurance claim counts by a Poisson GLM with log link g. Its (possibly weighted) average deviance on the training data D_{\text{train}} = \{(y_i, w_i, \boldsymbol x_i), i = 1, \dots, n\} is given by the formula \bar S(\hat m, D_{\text{train}}) = \sum_{i = 1}^n w_i S(\hat y_i, y_i) / \sum_{i = 1}^n w_i, where S is the (unit) Poisson deviance of a single observation, S(\hat y_i, y_i) = 2(y_i \log(y_i / \hat y_i) - (y_i - \hat y_i)).

4.3 Example: Generalized additive model (GAM)

An important extension of the GLM is the generalized additive model (Hastie and Tibshirani 1986, 1990; Wood 2017). Its model equation is assumed to satisfy \mathbb E(Y \mid \boldsymbol x) = m(\boldsymbol x) = g^{-1}(\beta_o + f_1(x^{(1)}) + \dots + f_p(x^{(p)})) for sufficiently nice functions f_j describing the Ceteris Paribus effect of feature X^{(j)} on g(\mathbb E(Y \mid \boldsymbol x)), and \beta_o is the intercept. As functions f_j, often penalized regression smooths are used that are then being estimated to minimize the distribution specific average deviance. The functions f_j are usually zero-centered to uniquely define them. Some of the components f_j can also be fully parametric, just like in a GLM.

GAMs and GLMs can also contain interaction terms, i.e., components that depend on more than one covariate.

4.4 Example: Gradient boosted decision trees (GBDT)

Like a neural net, gradient boosted trees (Friedman 2001) allow to represent extremely flexible model equations. They can automatically pick up complex interactions between covariates, a process that would need to be done in a manual way for linear models. Such models are fitted by applying gradient boosting on the objective function chosen in line with T.

In the statistical software R, different implementation of gradient boosting exist. We will mainly work with “XGBoost” (Chen and Guestrin 2016) and “LightGBM” (Ke et al. 2017). They allow to model different functionals T, and offer a rich set of objective functions and many regularization parameters. We will, as a competitor of the Poisson GLM and the Poisson GAM, also consider Poisson boosted trees.

For details on gradient boosting, check our corresponding lecture slides, and also this section of “Statistical Computing”.

5 Non-Life Insurance Pricing

The models described above are highly relevant in insurance pricing. A brief description of this field is given here.

One of the main tasks of a non-life pricing actuary is to predict the pure premium associated with an insurance policy. It is defined as the financial loss per year or per some other relevant exposure measure. The insurance company uses this information to optimize its tariffs and to estimate expected future profit. Such predictions are usually made by statistical models based on historical data on policies and claims.

Slightly adapting the notation of Wüthrich and Buser (2021), we use the following terms to characterize an insurance policy with exposure w. Throughout these notebooks, w is measured in yearly units.

w > 0: The exposure. All other quantities will refer to this.
N: Number of claims.
C: Total claim amount.
C / w: Pure premium.
Y = N / w: The claims frequency, i.e., the number of claims per exposure.
Z = C / N: The claims severity, i.e., the average cost per claim (undefined if N = 0).
\boldsymbol X: One or more risk characteristics describing the policy.

Observed values for two fictive motor third-part liability (MTPL) policies could look like this:

id	w	N	C	C/w	Y	Z	Driver’s age	Horse power
1	1	0	0	0	0	-	28	80
2	0.5	2	5000	10000	4	2500	20	250
2	0.5	1	1000	2000	2	1000	21	250

Due to the additivity of w, N, and C, these quantities can also be defined for multiple policies together, e.g., for the entire portfolio.

Instead of directly modeling \mathbb E(C / w \mid \boldsymbol x), pricing actuaries often decompose the pure premium into a product of frequency and severity C / w = (C / w) \cdot (N / N) = (N / w) \cdot (C / N) = Y Z and derive two models: One for the claims frequency \mathbb E(Y \mid \boldsymbol x) \approx m_Y(\boldsymbol x), classically a Poisson GLM with log link and case weights w; and another one for the severity \mathbb E(Z \mid \boldsymbol x) \approx m_Z(\boldsymbol x), classically a Gamma GLM with log link and case weights N, using only rows with N>0.

Assuming conditional independence of Y and Z, the combined model for the pure premium is then given by the product of the two models, i.e., by \mathbb E(C / w \mid \boldsymbol x) \approx m_Y(\boldsymbol x) m_Z(\boldsymbol x). The GLMs can be replaced by corresponding GAMs or modern ML techniques like deep learning and boosted trees. These alternatives are ideally fitted using the same loss functions (the corresponding deviance), the same case weights, and also using log links.

For more information on non-life insurance pricing, we recommend the excellent references Wüthrich and Buser (2021) and Ohlsson and Johansson (2015).

5.1 Further remarks

The two models can use different feature sets.
The severity model uses a log link to ensure a multiplicative model structure on the response scale. Since the logarithm is not the natural link function of the Gamma GLM, this leads to a slight bias on the scale of the response, which can be eliminated, e.g., by applying an empirical multiplicative correction factor c = \sum_{i = 1}^n y_i / \sum_{i = 1}^n \hat y_i calculated on the training data.
In R, when modeling claim frequencies with GLMs or GAMs, we will use the distribution family quasipoisson. This (a) suppresses a warning on non-integer responses and (b) allows for overdispersion. It provides the same parameter estimates as the poisson family.
Instead of fitting a Poisson model for claims frequency using case weights w, one can alternatively model the claim counts N by a Poisson model without case weights but using an offset of \log(w). Such a model will predict expected claim counts rather than expected claim frequencies. Nevertheless, the estimated effects will be identical between the two variants.

6 Main Example

We will work with the French motor third-part liability (MTPL) dataset available on openML (ID 41214) to model claims frequency.

6.1 The data

The first R script of this notebook performs the following steps:

The dataset is downloaded from OpenML and saved.
Rows with identical risk factors (“Area”, “VehPower”, “VehAge”, “DrivAge”, “BonusMalus”, “VehBrand”, “VehGas”, “Density”, “Region”) are assigned to a “group_id”. We assume that rows in such a group represent the same policy, even if their “IDpol” varies. This seems to be a curious property of this dataset. As we will see later, working with grouped data is a tricky business, and ignoring grouping can lead to biased results.
Certain transformations are applied. For instance, very large feature values are truncated and small categories are collapsed to one category.
The column names of the features, of the response and of the exposure are stored in variables.
All relevant columns are described univariately.

Note: We will often use the forward pipe operator |>. It moves the left-hand side as first argument in the function on the right-hand side. An example: iris |> head() is the same as head(iris). If you need a quick refresher on R, go through the first chapter of my statistical computing lecture.

library(tidyverse)
library(OpenML)
library(farff)
library(patchwork)
library(splitTools)
library(mgcv)
library(lightgbm)

main <- "french_motor"

# Download and save dataset
df0 <- getOMLDataSet(data.id = 41214)$data

# Preprocessing
prep <- df0 |> 
  # Identify rows that might belong to the same policy
  group_by_at(
    c("Area", "VehPower", "VehAge", "DrivAge", "BonusMalus", "VehBrand", 
      "VehGas", "Density", "Region")
  ) |> 
  mutate(group_id = cur_group_id(), group_size = n()) |> 
  ungroup() |> 
  arrange(group_id) |>
  # Usual preprocessing
  mutate(
    Exposure = pmin(1, Exposure),
    Freq = pmin(15, ClaimNb / Exposure),
    VehPower = pmin(12, VehPower),
    VehAge = pmin(20, VehAge),
    VehGas = factor(VehGas),
    DrivAge = pmin(85, DrivAge),
    logDensity = log(Density),
    VehBrand = relevel(fct_lump_n(VehBrand, n = 3), "B12"),
    PolicyRegion = relevel(fct_lump_n(Region, n = 5), "R24"),
    AreaCode = Area
  )

# Covariates
xvars <- c(
  "VehAge", "DrivAge", "logDensity", "VehPower", "PolicyRegion", "VehBrand", "VehGas"
)

# Description
nrow(prep)

## [1] 678013

head(prep[, c("IDpol", "group_id", "ClaimNb", "Exposure", "Freq", xvars)], 8)

summary(prep[, c("Freq", "Exposure", xvars)])

##       Freq            Exposure            VehAge          DrivAge     
##  Min.   : 0.0000   Min.   :0.002732   Min.   : 0.000   Min.   :18.00  
##  1st Qu.: 0.0000   1st Qu.:0.180000   1st Qu.: 2.000   1st Qu.:34.00  
##  Median : 0.0000   Median :0.490000   Median : 6.000   Median :44.00  
##  Mean   : 0.1622   Mean   :0.528545   Mean   : 6.976   Mean   :45.49  
##  3rd Qu.: 0.0000   3rd Qu.:0.990000   3rd Qu.:11.000   3rd Qu.:55.00  
##  Max.   :15.0000   Max.   :1.000000   Max.   :20.000   Max.   :85.00  
##    logDensity        VehPower     PolicyRegion    VehBrand     
##  Min.   : 0.000   Min.   : 4.00   R24  :160601   B12  :166024  
##  1st Qu.: 4.522   1st Qu.: 5.00   R11  : 69791   B1   :162736  
##  Median : 5.974   Median : 6.00   R53  : 42122   B2   :159861  
##  Mean   : 5.982   Mean   : 6.43   R82  : 84752   Other:189392  
##  3rd Qu.: 7.413   3rd Qu.: 7.00   R93  : 79315                 
##  Max.   :10.204   Max.   :12.00   Other:241432                 
##      VehGas      
##  Diesel :332136  
##  Regular:345877  
##                  
##                  
##                  
##

# Histograms
prep |>
  select(Freq, Exposure, DrivAge, VehAge, VehPower, logDensity) |> 
  pivot_longer(everything()) |> 
ggplot(aes(x = value)) +
  geom_histogram(bins = 19, fill = "darkorange") +
  facet_wrap("name", scales = "free") +
  ggtitle("Histograms of numeric columns")

# Barplots
one_barplot <- function(v) {
  ggplot(prep, aes(.data[[v]])) + geom_bar(fill = "darkorange")
}
wrap_plots(
  lapply(c("ClaimNb", "VehBrand", "VehGas", "PolicyRegion"), one_barplot), 
  top = "Barplots of discrete columns"
)

Comments:

The dataset has about 680k observations.
The univariate distributions of the variables in the prepared dataset look plausible.
Rows 3 and 4 share the same “group_id”, as do rows 6, 7, and 8. Not only are all of their risk factors identical, but their “IDpol” are very similar as well. This increases confidence in our proxy “group_id”.

6.2 The models

We will now calculate three models with R:

An additive Poisson GLM fitted with the function glm() from the {stats} package, a
Poisson GAM with several penalized regression smooths with the gam() function of the {mgcv} package, and a
Poisson boosted trees model with implicit log link fitted by LightGBM (Ke et al. 2017) and tuned with grouped five-fold cross-validation.

The models are calculated on about 80% of the observations, while the rest is kept for model validation. Since we suspect that some policies are represented by multiple rows, the data split is done in a grouped way, assigning all rows with the same “group_id” to either the training or the test dataset. That logic was also applied to create the cross-validation folds for selecting or tuning the hyperparameters of the LightGBM model. More about grouped splits will follow in Chapter 2.

set.seed(22)  # Seeds the data split as well as the boosted trees model

ind <- partition(prep$group_id, p = c(train = 0.8, test = 0.2), type = "grouped")
train <- prep[ind$train, ]
test <- prep[ind$test, ]

# 1) GLM
fit_glm <- glm(
  Freq ~ VehAge + DrivAge + logDensity + VehPower + PolicyRegion + VehBrand + VehGas,
  data = train, 
  family = quasipoisson(), 
  weights = Exposure
)
summary(fit_glm)

## 
## Call:
## glm(formula = Freq ~ VehAge + DrivAge + logDensity + VehPower + 
##     PolicyRegion + VehBrand + VehGas, family = quasipoisson(), 
##     data = train, weights = Exposure)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.114e+00  4.809e-02 -43.958  < 2e-16 ***
## VehAge            -3.441e-02  1.601e-03 -21.489  < 2e-16 ***
## DrivAge           -5.104e-03  5.126e-04  -9.957  < 2e-16 ***
## logDensity         6.453e-02  4.571e-03  14.116  < 2e-16 ***
## VehPower           4.878e-03  3.751e-03   1.300  0.19346    
## PolicyRegionR11   -8.731e-02  3.169e-02  -2.755  0.00587 ** 
## PolicyRegionR53   -1.392e-02  2.984e-02  -0.466  0.64092    
## PolicyRegionR82    2.408e-05  2.546e-02   0.001  0.99925    
## PolicyRegionR93   -8.260e-02  2.810e-02  -2.939  0.00329 ** 
## PolicyRegionOther -1.191e-01  2.032e-02  -5.864 4.53e-09 ***
## VehBrandB1        -1.602e-01  2.360e-02  -6.789 1.13e-11 ***
## VehBrandB2        -1.606e-01  2.362e-02  -6.799 1.05e-11 ***
## VehBrandOther     -1.473e-01  2.205e-02  -6.680 2.39e-11 ***
## VehGasRegular      6.664e-02  1.484e-02   4.490 7.13e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.443783)
## 
##     Null deviance: 170431  on 542169  degrees of freedom
## Residual deviance: 168532  on 542156  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

# 2) GAM with penalized regression smooths and maximum k-1 df. Takes long to fit
fit_gam <- gam(
  Freq ~ s(VehAge, k = 7) + s(DrivAge, k = 7) + s(logDensity, k = 3) + 
    s(VehPower, k = 3) + PolicyRegion + VehBrand + VehGas, 
  data = train, 
  family = quasipoisson(), 
  weights = Exposure
)
summary(fit_gam)

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## Freq ~ s(VehAge, k = 7) + s(DrivAge, k = 7) + s(logDensity, k = 3) + 
##     s(VehPower, k = 3) + PolicyRegion + VehBrand + VehGas
## 
## Parametric coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       -2.21592    0.02103 -105.376  < 2e-16 ***
## PolicyRegionR11   -0.07206    0.02576   -2.798 0.005149 ** 
## PolicyRegionR53    0.00631    0.02422    0.260 0.794486    
## PolicyRegionR82    0.02171    0.02067    1.050 0.293568    
## PolicyRegionR93   -0.07566    0.02283   -3.314 0.000921 ***
## PolicyRegionOther -0.10695    0.01650   -6.483 9.00e-11 ***
## VehBrandB1        -0.12878    0.01985   -6.489 8.64e-11 ***
## VehBrandB2        -0.14097    0.01983   -7.109 1.17e-12 ***
## VehBrandOther     -0.10611    0.01859   -5.708 1.14e-08 ***
## VehGasRegular      0.05788    0.01224    4.727 2.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df      F  p-value    
## s(VehAge)     5.980  6.000 438.36  < 2e-16 ***
## s(DrivAge)    5.887  5.994 207.61  < 2e-16 ***
## s(logDensity) 1.013  1.027 280.29  < 2e-16 ***
## s(VehPower)   1.983  2.000  14.86 7.17e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.00695   Deviance explained = 2.59%
## GCV = 0.30624  Scale est. = 0.94909   n = 542170

# Visualizing effect of additive component on scale of "linear" predictor
plot(fit_gam, select = 2, main = "Additive effect of 'DrivAge' on log scale")

# 3) Boosted trees - parameters found by CV randomized search, see later
dtrain <- lgb.Dataset(
  data.matrix(train[xvars]), 
  label = train$Freq, 
  weight = train$Exposure, 
  params = list(feature_pre_filter = FALSE)
)
params <- list(
  objective = "poisson",
  learning_rate = 0.05,
  num_leaves = 31,
  lambda_l2 = 7.5,
  lambda_l1 = 4,
  colsample_bynode = 0.8,
  bagging_fraction = 0.8,
  min_data_in_leaf = 100,
  min_sum_hessian_in_leaf = 0,
  poisson_max_delta_step = 0.7
)
fit_lgb <- lgb.train(params = params, data = dtrain, nrounds = 361)

## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011115 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Info] Total Bins 370
## [LightGBM] [Info] Number of data points in the train set: 542170, number of used features: 7
## [LightGBM] [Info] Start training from score -2.321650

# Dump everything
save(prep, train, test, fit_glm, fit_gam, xvars, file = file.path(main, "intro.RData"))
lgb.save(fit_lgb, file.path(main, "fit_lgb.txt"))

Comments:

Fitting a GAM on 540k rows of data takes comparably long.
The code required to fit the boosted trees model is much longer than for the GLM and the GAM. However, the code is very general, so that only minor adaptions need to be made for other modeling situations.
Interpreting the GLM and the GAM can be easily made from the model output. For example, for the GLM, a one-point increase in “DrivAge” is associated with a change in log expected claims frequency of -0.0051. On the frequency scale, this corresponds to a change of 100\% \cdot (e^{-0.0051} - 1) \approx -0.5\%. The GAM shows a more complex (and realistic) age effect. For the GLM and the GAM, the test statistics give us an indication of the variable importance: Vehicle age and driver age appear to be among the strongest predictors. What about the boosted trees model? It will be one of the goals of the next chapter to fill this gap using the tools of XAI.

7 Exercises

Here, we will work with an Australian car claims dataset “dataCar” available in the R package {insuranceData}.

library(tidyverse)
library(insuranceData)

data(dataCar)
head(dataCar)

# Data preparation
prep <- dataCar |>  
  mutate(
    Freq = numclaims / exposure,
    Sev = claimcst0 / numclaims,
    veh_age = factor(veh_age),
    agecat = factor(agecat),
    veh_body = fct_lump_prop(veh_body, 0.1),
    log_value = log(pmax(0.3, pmin(15, veh_value)))
  )

xvars <- c("log_value", "agecat", "veh_age", "area", "veh_body", "gender")

# XGBoost parameters for the last exercise
params <- list(
  objective = "count:poisson",
  learning_rate = 0.05,
  max_depth = 2,
  reg_lambda = 1,
  reg_alpha = 2,
  colsample_bynode = 0.8,
  subsample = 1,
  min_split_loss = 0,
  max_delta_step = 0.1  # with Poisson
)
nrounds <- 304

Descriptive Analysis
1. Study the data description via ?dataCar.
2. Calculate relevant statistics at portfolio level, i.e., on the full dataset: Total and average exposure, total claim amount, number of claims, pure premium, frequency, and severity.
3. Describe the distribution of each model feature in x_var in the prepared dataset. Use tables and/or plots and/or text.
Modeling: Next, we develop different claims frequency models using the same set of features (x_vars).
1. Split the dataset into 80% for training and 20% for testing. Don’t touch the test data during modeling.
2. GLM: Fit an exposure-weighted Poisson GLM with log link to the claim frequencies. Interpret the estimated coefficients of the features “log_value” and “gender”.
3. Random forest: Model “Freq” by an exposure-weighted random forest with the R package {ranger} using the quadratic loss. Select the optimal tree depth by minimizing exposure-weighted Poisson deviance calculated from the out-of-bag (OOB) predictions on the training data. How are OOB predictions being calculated? You can use the function deviance_poisson() in the {MetricsWeighted} package to calculate deviance.
4. XGBoost: Study the documentation of XGBoost. The original publication is Chen and Guestrin (2016), and a fantastic slide deck (pdf) of the author can be found on this Link. Use the {xgboost} package to fit an exposure-weighted Poisson XGBoost model with nrounds boosting rounds and the hyperparameters stated in the code above. We will later see how to find such parameters. Note that XGBoost automatically uses a log link when fitting a Poisson model.

References

Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Ann. Stat. 29 (5): 1189–1232. https://doi.org/10.1214/aos/1013203451.

Hastie, Trevor, and Robert Tibshirani. 1986. “Generalized Additive Models.” Stat. Sci. 1 (3): 297–310. https://doi.org/10.1214/ss/1177013604.

———. 1990. Generalized Additive Models. Wiley Online Library.

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” In Advances in Neural Information Processing Systems, 30:3149–57. Curran Associates, Inc.

Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book.

Nelder, J. A., and R. W. M. Wedderburn. 1972. “Generalized Linear Models.” J. R. Stat. Soc. Series A (General) 135 (3): 370–84. https://doi.org/10.2307/2344614.

Ohlsson, E., and B. Johansson. 2015. Non-Life Insurance Pricing with Generalized Linear Models. EAA Series. Springer Berlin Heidelberg.

Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. 2nd ed. United States: CRC Press. https://doi.org/10.1201/9781315370279.

Wüthrich, Mario V., and Christoph Buser. 2021. “Data Analytics for Non-Life Insurance Pricing.” https://doi.org/10.2139/ssrn.2870308.

XAI: Introduction

Responsible Machine Learning with Insurance Applications

Michael Mayer

2024-09-03