1 Introduction

“The best explanation of a simple model is the model itself”

(Lundberg and Lee 2017)

Welcome to the last chapter of the XAI part of the lecture “Responsible Machine Learning with Insurance Applications”. It is dedicated to improving the intrinsic explainability of ML models.

Regarding intrinsic explainability, there is the following basic hierarchy:

  1. Linear additive models (like GLMs): Their model structure is so simple that the effects can be read off directly from the numerical model output.
  2. Additive models (like GAMs): It is still possible to give a complete description of the effects, but they must be visualized for interpretation.
  3. Black box models (like boosted trees or neural nets): These models contain many complex interactions, so that the effects can only be described approximately. Without interaction effects, such models would be additive and as transparent as a GAM.

The boundaries are blurred: GLMs can also include nonlinear effects and interaction terms. If one adds many interaction terms to a GLM or a GAM, it becomes almost as complex as a neural network. On the other hand, the complexity of boosted trees models and neural networks can be reduced by enforcing additivity for all or some features.

In the next two sections, we will learn how to create such additive and partly additive boosted trees and neural nets. By partly additive, we mean models that are additive in some features but have interactions between other variables. It is just another term for “additive with (possibly complex) interactions”. Examples of such partly additive models in actuarial science are as follows:

  • Keep the “time” effect additive while letting the other covariates interact freely. In this way, the ML model can be manually calibrated to obtain unbiased predictions out-of-time.
  • Use an additive gender effect while allowing other features to interact. This allows to retrospectively remove a potentially discriminatory gender effect from the ML model. (Why not simply dropping gender from the model?)
  • The location can be represented by a complex interaction effect between latitude and longitude, leaving all other features additive for interpretability.
  • Same as the last example, but representing location by a large set of (often highly-correlated) features like “distance to next bus stop” or “distance to next restaurant”, see Mayer et al. (2022) for two case studies with real-estate data. (How can such models be used for supervised dimension reduction?)
  • In a car insurance pricing model, one could use additive features for the driver’s attributes, while allowing the vehicle characteristics (engine size, power, speed, …) to interact freely.

Furthermore, we will see how monotonic constraints can additionally improve intrinsic interpretability of boosted trees.

Remarks:

  • Single decision trees do not fit into above hierarchy. They are neither linear nor additive, but still intrinsically interpretable.
  • For models with link functions, additivity is meant at the link scale, since the back-transformation to the response scale introduces interaction effects. In most cases, we interpret such a model on the link scale.
library(tidyverse)
library(withr)
library(mgcv)
library(lightgbm)
library(MetricsWeighted)
library(hstats)
library(shapviz)
library(patchwork)
library(caret)  # for scaling
library(keras)
# install_keras() # if no Python with TensorFlow is installed yet
# devtools::install_github("andrie/deepviz")
library(deepviz)

source("functions.R")

# Reload data and model-related objects
main <- "french_motor"
load(file.path(main, "intro.RData"))
fit_lgb <- lgb.load(file.path(main, "fit_lgb.txt"))

# Shuffle training data to be sure that batch creation of neural nets is random
with_seed(938, 
  train <- train[sample(1:nrow(train)), ]
)

# Calculates performance, and plots c-ICE plots of selected features, and SHAP.
# Note that this function is not written in a neat way. We would need to pass
# much more arguments to make it self-contained.
# Note: We try to select the evaluation points of the ICE curves so that the
# SHAP dependence plots agree up to a vertical shift.
# Note: The pred_fun() needs to return predictions on original scale, not log.
summary_model <- function(model, 
                          pred_fun = predict, 
                          v = c("DrivAge", "logDensity", "VehAge", "VehBrand"),
                          shap_dependence = FALSE,
                          ...) {
  with_seed(9,
    X_explain <- sample_n(train, 1000)          
  )
  
  score <- poisson_scorer(
    model, 
    X = test, 
    y = test$Freq, 
    w = test$Exposure,
    reference_mean = weighted.mean(train$Freq, train$Exposure),
    pred_fun = pred_fun,
    ...
  )$Pseudo_R2
  
  pred_log <- function(m, X, ...) {
    log(pred_fun(m, X, ...))
  }
  p <- list()
  for (z in v) {
    g <- unique(X_explain[[z]])
    p[[z]] <- ice(model, z, X = X_explain, pred_fun = pred_log, grid = g, ...) |> 
      plot(center = TRUE) +
      ggtitle(sprintf("c-ICE plot of '%s'", z)) +
      ylab("Centered log frequency")
  }
      
  the_title <- sprintf(
    "%s (deviance gain %s)", 
    deparse1(substitute(model)),
    scales::percent(score, accuracy = 0.01)
  )
  
  
  print(wrap_plots(p) + plot_annotation(the_title))
  
  if (shap_dependence) {
    shp <- shapviz(model, X_pred = data.matrix(X_explain[xvars]), X = X_explain)
    print(sv_dependence(shp, v = v) + plot_annotation("SHAP dependence plots"))
  }
}

2 Structuring Boosted Trees

While the prediction function of a single decision tree is simple to interpret, the same is not true for boosted trees models. Its prediction function m(\boldsymbol x) = m_1(\boldsymbol x) + \dots + m_K(\boldsymbol x), equals the sum of K \ge 1 decision trees m_k, and one has to rely on post-hoc explainability methods from Chapter 2 to approximately understand m. Here, we will learn two ways how to simplify the internal structure of a boosted trees model to improve its (intrinsic) interpretability. The first approach produces additive models. The second approach is more general and can be also used to fit partly additive models.

2.1 Additive boosted trees

A tree stump is a decision tree with a single split. Its prediction function has the form v_1 + (v_2 - v_1)\mathbb I(x^{(j)} \le s), where \mathbb I is the indicator function, s the split position of the selected variable X^{(j)}, and v_1, v_2 are the values of the left and right leaf. A boosted trees model with tree stumps m_1, \dots, m_K as base learners (boosted tree stumps) is thus an additive model m(\boldsymbol x) = f_1(x^{(1)}) + \dots + f_p(x^{(p)}) with piecewise constant functions f_j, 1 \le j \le p. The function f_j equals the sum of all m_k that split on X^{(j)}.

A schematic view of such a model for our French MTPL dataset could look as follows:

Scheme 1: Additive boosted trees
Scheme 1: Additive boosted trees

Unlike the classic GAM (Hastie and Tibshirani 1986, 1990; Wood 2017), the functions f_j are not smooth, which is a disadvantage in some applications. An advantage of using boosted tree stumps to fit GAMs is that modern boosting implementations can handle large datasets consisting of millions of rows and many covariates, where classic approaches would be too slow. The resulting model is as easy to interpret as a classic GAM, i.e., each component f_j can be visualized, for instance, by an ICE or PDP plot that provides a complete description of the effect of that feature.

Remark: The SHAP dependence plot of an additive feature in a boosted trees model is equivalent to the PDP/ICE plot up to a vertical shift. Thus, like the PDP, it can be interpreted Ceteris Paribus (Mayer 2022).

Additive boosted trees models are discussed in the literature, e.g., in Lou, Caruana, and Gehrke (2012) and Nori et al. (2019).

2.1.1 Example

First, we fit an additive Poisson boosted trees model with log-link to the French MTPL data. To force LightGBM to use tree stumps as base learners, we can set either num_leaves = 2 or max_depth = 1. The other parameters have been found by random parameter search using (grouped) five-fold cross-validation. How does the model compare with the classic GAM?

Besides showing relative deviance gains on the test dataset, throughout this section, we will report c-ICE plots of selected variables. For boosted trees models, we will additionally show corresponding SHAP dependence plots. c-ICE plots are evaluated at all unique feature values.

First, we analyze the classic GAM, then the additive LightGBM model.

X_train <- data.matrix(train[xvars])

dtrain <- lgb.Dataset(
  X_train, 
  label = train$Freq, 
  weight = train$Exposure,
  params = list(feature_pre_filter = FALSE)
)

# Found by cross-validation (-> french_motor/tune_additive_lgb.R)
# Of key importance: "num_leaves = 2" or "max_depth = 1"
params <- list(
  objective = "poisson",
  learning_rate = 0.5, 
  num_leaves = 2,
  lambda_l2 = 7.5,
  lambda_l1 = 0, 
  colsample_bynode = 1, 
  bagging_fraction = 1,
  poisson_max_delta_step = 0.7
)

with_seed(4438,
  fit_lgb_add <- lgb.train(params = params, data = dtrain, nrounds = 513)
)
## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003620 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Info] Total Bins 370
## [LightGBM] [Info] Number of data points in the train set: 542170, number of used features: 7
## [LightGBM] [Info] Start training from score -2.321650
pred_lgb <- function(object, newdata, ...) {
  predict(object, data.matrix(newdata[xvars]), ...)
}

summary_model(fit_gam, type = "response")

summary_model(fit_lgb_add, pred_fun = pred_lgb, shap_dependence = TRUE)

Comments:

  • c-ICE curves indicate parallel ICE curves, confirming additivity for each covariate and model on the log-scale.
  • The additive LightGBM model performs much better than the (classic) GAM model, probably a result of its non-smooth effects.
  • Thanks to additivity, the SHAP dependence plots of the LightGBM model indeed provide the same effects as the c-ICE curves.

2.2 Partly additive boosted trees

In many situations, assuming additivity for all features leads to an unrealistically simple model. One way to relax the assumption for boosted trees is to grow trees of depth \ell > 1 (not counting the root node). For instance, if \ell = 2, the model can learn all pairwise interactions (why?). Similarly, larger tree depths allow higher order interactions to be represented.

What if we want to model some features additively while using interactions for other features? Such partly additive models can be created by specifying interaction constraints IC = \{F_1, \dots, F_K\}, where each F_k \subseteq \mathcal M defines a feature subset allowed to interact with each other (Lee, Lin, and Antonio 2015).

How do interaction constraints work technically? During tree growth, a split considers features only from those F_k that contain all previous split variables of the branch. Consequently, each branch in each tree will only use features from one F_k \in IC. This implies the desired structure for the tree and also for a tree ensemble.

Interaction constraints can be used to create partly additive boosted trees models. For instance, to have a model additive in feature X^{(j)}, we can set F_1 = \{X^{(j)}\}, and make sure that X^{(j)} \notin F_k, \ k > 1.

The following scheme shows a partly additive model with interaction constraints IC = \{\{\text{DrivAge}\}, \{\text{logDensity}\}, \{\text{PolicyRegion}\}, \{\text{VehAge}, \text{VehBrand}, \text{VehGas}, \text{VehPower}\}\}. The resulting model will be additive in driver-related features while allowing vehicle features to interact.

Scheme 2: Partly additive boosted trees
Scheme 2: Partly additive boosted trees

Remarks:

  • If all elements in IC are disjoint, each tree uses features from only one F_k. (The first split variable determines which of the F_k.)
  • As an alternative to boosted tree stumps, we can specify IC = \{\{X^{(1)}\}, \dots, \{X^{(p)}\}\} to create an additive boosted trees model, see the figure below.
Additive boosted trees via interaction constraints
Additive boosted trees via interaction constraints

2.2.1 Example

Let us now create a model with above structure. We will compare it with the unconstrained LightGBM model.

# Build interaction constraint vector
ic <- list(
  c("VehAge", "VehPower", "VehBrand", "VehGas"), 
  "DrivAge",
  "logDensity",
  "PolicyRegion"
)

# Found by cross-validation (-> french_motor/tune_partly_additive_lgb.R)
# Of key importance: "interaction_constraints"
params <- list(
  objective = "poisson",
  learning_rate = 0.05, 
  num_leaves = 15,
  interaction_constraints = ic,
  lambda_l2 = 5,
  lambda_l1 = 0, 
  colsample_bynode = 1,
  bagging_fraction = 1,
  min_data_in_leaf = 20, 
  min_sum_hessian_in_leaf = 0, 
  poisson_max_delta_step = 0.1
)

with_seed(4458,
  fit_lgb_part_add <- lgb.train(params, data = dtrain, nrounds = 333, verbose = -1)
)

summary_model(fit_lgb, pred_fun = pred_lgb, shap_dependence = TRUE)

summary_model(fit_lgb_part_add, pred_fun = pred_lgb, shap_dependence = TRUE)

Comments:

  • The c-ICE curves of the partly additive model indeed show additive effects for driver-related features and interaction effects for vehicle characteristics. The SHAP analysis confirms this. The original model shows interaction effects for all features.
  • The performance of the partly additive model is not much lower than that of the original, fully flexible LightGBM model. This means that any suppressed interaction effects cannot be too strong.
  • The additive effect of “logDensity” is extremely shaky, which is unsatisfactory. Interestingly, the fully additive model was more stable in this respect. It is possible that the model tries to compensate for an omitted interaction. In the next subsection we will see how this problem can be at least partially solved.

2.3 Monotonic constraints

So far, we have improved the intrinsic interpretability of a black box by removing interactions. For tree-based models, another aspect can be exploited: Monotonicity. By monotonicity in a feature, we mean that m is a monotonically increasing or decreasing function in that feature, everything else being fixed.

In practice, the usefulness and trustworthiness of a model can drop dramatically if it fails to satisfy certain monotonic constraints. Here two examples:

  • A claims severity model for car collisions is expected to be monotonically increasing in the car value, everything else being fixed.
  • A claims frequency model for car collisions is expected to be monotonically decreasing in the deductible, Ceteris Paribus.

How do monotonic constraints work? To enforce monotonicity for the jth feature in a decision tree, the rules are as follows:

  1. A candidate split that violates the condition on its child nodes receives an infinite penalty (or negative split gain) and is therefore discarded (Scheme 3a). This enforces monotonicity at this split.
  2. To enforce monotonicity also for splits further down the tree, values of child (and grand-child etc.) nodes are capped at the average node values of the constraining split, see Scheme 3b. Capping may introduce a certain amount of bias.

In Scheme 3a, Tree A represents a prediction function (on log claims frequency scale) that is monotonically decreasing in the vehicle age. The split on the lower left of Tree B violates this constraint and is discarded (and replaced by some other split).

Scheme 3a: Monotonic constraints - basic logic
Scheme 3a: Monotonic constraints - basic logic

Scheme 3b explains how node values of subsequent splits are capped to enforce monotonicity downstream:

Scheme 3b: Monotonic constraints - capping explained
Scheme 3b: Monotonic constraints - capping explained

Remarks:

  • Monotonicity also translates to tree ensembles such as boosted trees or random forests.
  • Use a monotonic constraint only if it makes sense conditional on all possible combinations of the other features. Take a house price model, for example: At first glance, predictions should increase monotonically with the number of rooms. At second glance, this is not necessarily a good idea, because for a fixed living area, more rooms are not necessarily better.
  • A monotonicity constraint helps to reduce wiggliness of an effect.
  • Furthermore, it can be combined with interaction constraints.
  • There are also implementations of GAMs, GLMs, and neural nets that allow to take monotonicity into account.
  • For flexibly modeled effects, violations of desired monotonicity often happens as a consequence of overfitting on outlying feature values, a problem that can already dealt with during feature preprocessing by truncation.

2.3.1 Example

In our French MTPL example, there is no feature for which we should really impose monotonicity. Nevertheless, to illustrate the effect, we will force a negative effect for “VehAge” and a positive effect for “logDensity” to reduce its extreme wiggliness. Note that “logDensity” is modeled additively, while “VehAge” can freely interact with other vehicle features. For simplicity, we use the same LightGBM parameters as before.

# Using params from last example
params$monotone_constraints <- (xvars == "logDensity") - (xvars == "VehAge")

with_seed(4358,
  fit_lgb_monotone <- lgb.train(
    params, 
    data = dtrain, 
    nrounds = 333, 
    verbose = -1
  )
)  

summary_model(fit_lgb_monotone, pred_fun = pred_lgb, shap_dependence = TRUE)

Comments:

  • Indeed: Ceteris Paribus, the effect of “logDensity” is now monotonically increasing while the effect of “VehAge” decreases for each individual ICE curve.
  • Both effects are less wiggly.
  • The model performs only slightly worse than without monotonicity constraints, even without re-tuning of the parameters.
  • Monotonic constraints combine well with interaction constraints.
  • In SHAP dependence plots, monotonicity can only be seen when the feature is modeled additively. Otherwise, the vertical scatter of interaction effects covers it.

3 Structuring Neural Nets

In this lecture, we have not yet worked with neural nets. While they are typically not as performant as boosted trees models for tabular data, they are extremely versatile. Here is just a small list of their capabilities: Neural nets can

  • mimic models like GLMs or GAMs,
  • automatically learn interactions and non-linear effects (like tree-based methods),
  • fit data larger than RAM (e.g. images),
  • learn “online” (update the model with additional data),
  • fit multiple response variables at the same time,
  • model input and output of dimension higher than two (images, videos, text, audio),
  • model input and output of different input dimensions (e.g. text and images and numbers),
  • fit data with sequential structure in both in- and output (e.g., a machine translator),
  • fit models with millions of parameters,
  • perform non-linear dimension reduction.

We assume that the reader has some background knowledge on neural nets and refer to James et al. (2021) or Chollet (2018) otherwise. In the examples, we will work with TensorFlow, using the handy Keras front-end.

3.1 Software

Since Keras and a part of TensorFlow are written in Python, running the examples in this chapter will require Python with TensorFlow >= 2.11. You can obtain it like this:

  1. Install the R package {keras}.
  2. Run keras::install_keras() to get a minimal Python installation with TensorFlow.
  3. If the following code works, you are all set. (Some red start-up messages/warnings are okay.)
library(tensorflow)
tf$constant("Hello Tensorflow!")

Within Keras, we will work with the functional API. It allows to represent very general model structures (also called architectures in neural network jargon). Actually, any model that can be characterized as a directed acyclic graph (DAG) could be implemented using this interface. Non-trivial examples include additive and partly additive models.

3.2 A simple neural net

To familiarize ourselves with Keras and get used to neural network terminology, we will first create a very simple neural network - a GLM. It consists of two layers only: The input layer represents the features, and the output layer represents the response variable. The values of the output layer are transformed by the exponential “activation” function, corresponding to the inverse link function of the Poisson GLM. The output layer is fully connected or dense, which means that the value of its node is a weighted sum of all node values of the previous layer (in this case, of the input layer), see the left hand side of Scheme 4:

Scheme 4: (a) Neural net without hidden layer, (b) with hidden layer
Scheme 4: (a) Neural net without hidden layer, (b) with hidden layer

The weights are the parameters estimated by gradient descent. Normally, the gradient descent update steps are computed by backpropagation, which involves applying the chain rule to find the derivatives of the loss function with respect to all parameters. For a simple model like a GLM, this is straightforward. But imagine a deep learning model with 50 hidden layers and a non-linear transformation after each hidden layer! Computing millions or even billions of derivatives (without seeing the data) is one of the greatest achievements of modern deep learning implementations like TensorFlow and PyTorch.

(Mini-batch) gradient descent with backpropagation works as follows:

Let m_\beta denote a neural net with parameters \beta, and Q(m_\beta, D) = \sum_{(y_i, \boldsymbol x_i) \in D} L(m_\beta(\boldsymbol x_i), y_i) its total loss on a data set D with respect to the loss function L.

  1. Initialize the parameter vector \beta with random values \hat \beta.
  2. Forward step: Calculate Q(m_{\hat\beta}, D_\text{batch}) on a batch D_\text{batch} of observations. This is a small subset of the training data.
  3. Backpropagation step: Modify \hat \beta to improve Q(m_{\hat\beta}, D_\text{batch}) by gradient descent: Calculate the vector of partial derivatives \nabla \hat \beta = \frac{\partial Q(m_\beta, D_\text{batch})}{\partial \beta}\large\mid_{\beta = \hat \beta} at the current estimates \hat \beta. Use it to update \hat \beta \leftarrow \hat \beta - \lambda \nabla \hat \beta, where \lambda > 0 is a sufficiently small learning rate. In neural nets, parameters are organized in multiple layers, which makes it difficult to calculate \nabla \hat\beta. This is where backpropagation enters the game: It calculates the partial derivatives layer per layer using the chain rule, starting from the output layer.
  4. Repeat Steps 2 and 3 until each observation appeared in a batch. This is called an epoch.
  5. Repeat Step 4 for multiple epochs until some stopping criterion triggers.

Remarks:

  • Step 5 can be specified by an early-stopping callback. A callback is an action executed during training. An other callback would be to save the parameters after each epoch.
  • To improve convergence, input variables should be on a comparable scale. We will standardize them accordingly by subtracting their (training) mean and then dividing by the (training) standard deviation. Alternatively, we could scale the values of each feature so that they range from 0 to 1 (min-max scaling) or from -1 to 1.

3.2.1 Example: a GLM neural net

Let us now fit a claims frequency GLM with Keras/TensorFlow, i.e., the model in Scheme 4 (a). For simplicity, we encode all factor variables as integers, even if this is not always intelligent (see later examples how to create better models). We use a learning rate of 0.02. This means that each parameter update takes 0.02 steps towards the negative gradient.

# Standardization - required to improve init step
scaler <- preProcess(X_train)  # predict(scaler, head(X_train))

# Callbacks
cb <- list(
  callback_early_stopping(patience = 20),
  callback_reduce_lr_on_plateau(patience = 5)
)

# Mimics a GLM
make_glm <- function() {
  k_clear_session()
  tensorflow::set_random_seed(469)

  input <- layer_input(ncol(X_train))
  output <- input |> 
    layer_dense(units = 1, activation = "exponential")
  
  # Connect inputs and output
  keras_model(input, output)
}

# Create, compile and fit model
nn_glm <- make_glm() |> 
  compile(
    optimizer = optimizer_adam(0.02), loss = loss_poisson, weighted_metrics = list()
  )

# summary(nn_glm) # 8 parameters
plot_model(nn_glm)
history <- nn_glm |> 
  fit(
    x = predict(scaler, X_train), 
    y = train$Freq,
    sample_weight = train$Exposure,
    epochs = 100, 
    batch_size = 1e4,
    validation_split = 0.2,
    callbacks = cb,
    verbose = FALSE
  )
plot(history)

# Inspect
pred_nn <- function(m, X) 
  predict(
    m,
    predict(scaler, data.matrix(X[xvars])),
    batch_size = 1e4, 
    verbose = FALSE
  )

summary_model(fit_glm, type = "response")

summary_model(nn_glm, pred_fun = pred_nn)

Comments:

  • It worked: The resulting model seems to be indeed a generalized linear model.
  • The performance on the test data is slightly lower compared to the original GLM. This may be due to several reasons: First, for simplicity, we represented each feature by a linear function, including some factor variables. Second, the parameters were fitted to only 80% of the training observations. Third, we used gradient descent to fit the parameters, which is a very simple optimization algorithm.

3.3 More complex models

While it is great that a neural network can be used to fit a GLM, its strength lies in its hidden layers. The nodes on these hidden layers are latent variables that represent the original input features in a way that optimally predicts the response. The closer a hidden layer is to the output, the more optimal this representation is. This is referred to as representational learning: The neural network automatically learns relevant nonlinearities and interaction effects that would have to be added manually in a GLM. For this to work, you need a sufficiently flexible architecture (the layers and their nodes) and enough observations.

An important component are the activation functions placed after each hidden layer. They transform the weighted sum of each hidden node non-linearly, thus introducing interaction effects. Typical activation functions are the rectangular linear unit ReLU(x) = \text{max}(0, x), the standard logistic function (“sigmoid”) and the nearly identical hyperbolic tangent. The activation function applied to the output node has a different purpose: it maps the network output to the scale of the response, just like the inverse link function in a GLM. A neural network without activation functions could only represent a linear function of the inputs, regardless of how many hidden layers the model has (why?).

3.3.1 Example: A “complex” neural net

We will now modify our GLM neural network by adding three hidden layers to the model, resulting in a model with 561 parameters. Typically, the first hidden layer has more nodes than input features (“feature explosion”), after which the number of nodes shrinks again.

On the one hand, the model needs a large number of parameters to be flexible enough. On the other hand, too many parameters lead to overfitting. For tabular data, a healthy ratio of rows per parameter is somewhere between 50 and 200. When working with unstructured data, different rules apply.

In Scheme 4b, we use only one hidden layer to not overload the image.

# Architecture of neural network with hidden layers
make_full_nn <- function() {
  k_clear_session()
  tensorflow::set_random_seed(469)

  input <- layer_input(7, name = "All_features")
  output <- input |> 
    layer_dense(20, activation = "tanh", name = "Hidden_1") |>
    layer_dense(15, activation = "tanh", name = "Hidden_2") |>
    layer_dense(5, activation = "tanh", name = "Hidden_3") |> 
    layer_dense(1, activation = "exponential", name = "Output")
  
  # Connect inputs and output
  keras_model(input, output)
}

# Create, compile and fit model
nn_full <- make_full_nn() |> 
  compile(
    optimizer = optimizer_adam(0.02),
    loss = loss_poisson,
    weighted_metrics = list()
  )

# summary(nn_full) # 561 parameters
plot_model(nn_full)
history <- nn_full |>
  fit(
    x = predict(scaler, X_train), 
    y = train$Freq,
    sample_weight = train$Exposure,
    epochs = 100, 
    batch_size = 1e4, 
    validation_split = 0.2,
    callbacks = cb,
    verbose = FALSE
  )
plot(history)

summary_model(nn_full, pred_fun = pred_nn)

Comments:

  • The ICE plots show very wild and non-linear effects with strong interactions. Thus, the hidden layers care about feature representation in a fully automatic way. This is in contrast to classical GLMs where such transformations have to be added manually.
  • The performance is considerably superior to the GLM, but still lower than the boosted trees model.

3.4 Additive neural nets

In the first two examples on neural nets we have seen that we can use them to fit linear additive models as well as a typical black box. Thanks to the flexibility of neural networks, we can also use them to build GAMs. In such a “neural additive model” (Agarwal et al. 2020), each feature X^{(j)} is sent through a small single-output neural network that represents X^{(j)} as a non-linear smooth function. The outputs of these sub-networks are then directly connected to the output node to avoid interaction effects, see Scheme 5 (a):

Scheme 5: (a) Neural additive model, (b) partly additive neural net, source: Figure 1 in https://www.mdpi.com/1911-8074/15/5/193
Scheme 5: (a) Neural additive model, (b) partly additive neural net, source: Figure 1 in https://www.mdpi.com/1911-8074/15/5/193

Note that some components can also be represented linearly, i.e., without sub-network. Furthermore, note that unordered factors in neural networks are generally represented either by one-hot-encoded binary variables or by a so-called embedding layer. An embedding layer transforms an integer encoded factor to one (or more) numeric features.

3.4.1 Example

We will now build a neural additive model with three different types of inputs:

  • “VehBrand” and “PolicyRegion”: Integer encoded and represented by a 1-dimensional embedding layer each.
  • “VehGas” and “logDensity”: Scaled and represented by a linear function each. (“VehGas” is binary, so scaling would not be necessary.)
  • “VehAge”, “DrivAge”, and “VehPower”: Scaled and represented by a small neural network each.

The implied structure is similar to our classic GAM.

Normally, a statistical model requires a single data matrix for training. A neural network, like the one described here, is more flexible. In the following examples, we will pass each feature as a single-column matrix. In the same way, we could build a neural network using images, text, and numbers as inputs.

# Standardizes input and provides list of separate input matrices
prep_nn <- function(X, embed = c("VehBrand", "PolicyRegion")) {
  to_scale <- setdiff(colnames(X), embed)
  X_scaled <- predict(scaler, X)[, to_scale]
  X_scaled_with_emb <- cbind(X_scaled, X[, embed])
  lapply(asplit(X_scaled_with_emb, 2), as.matrix)
}

# Input structure of two observations
(example_rows <- prep_nn(X_train[1:2, ]))
## $VehAge
##             [,1]
## [1,] 0.745849390
## [2,] 0.005072458
## 
## $DrivAge
##            [,1]
## [1,] -1.1693169
## [2,] -0.1063998
## 
## $logDensity
##            [,1]
## [1,] -0.5020083
## [2,] -0.8232644
## 
## $VehPower
##            [,1]
## [1,] -0.2203684
## [2,] -0.2203684
## 
## $VehGas
##           [,1]
## [1,] -1.019561
## [2,] -1.019561
## 
## $VehBrand
##      [,1]
## [1,]    2
## [2,]    2
## 
## $PolicyRegion
##      [,1]
## [1,]    1
## [2,]    6
# Architecture of neural additive model
make_additive <- function() {
  k_clear_session()
  tensorflow::set_random_seed(469)

  # Linear logDensity
  logDensity_input <- layer_input(1, name = "logDensity")
  
  # Smooth DrivAge
  DrivAge_input <- layer_input(1, name = "DrivAge")
  DrivAge_smooth <- DrivAge_input |> 
    layer_dense(7, activation = "tanh", name = "Hidden_1") |> 
    layer_dense(1, name = "Smooth_1")
  
  # Categorical embedding of dimension 1
  PolicyRegion_input <- layer_input(1, name = "PolicyRegion")
  PolicyRegion_emb <- PolicyRegion_input |> 
    layer_embedding(6 + 1, 1, name = "Embed_1") |> 
    layer_flatten(name = "Flat_1")
  
  # Linear VehGas
  VehGas_input <- layer_input(1, name = "VehGas")

  # Smooth VehPower and VehAge
  VehPower_input <- layer_input(1, name = "VehPower")
  VehPower_smooth <- VehPower_input |> 
    layer_dense(3, activation = "tanh", name = "Hidden_2") |> 
    layer_dense(1, name = "Smooth_2")
  
  VehAge_input <- layer_input(1, name = "VehAge")
  VehAge_smooth <- VehAge_input |> 
    layer_dense(8, activation = "tanh", name = "Hidden_3") |> 
    layer_dense(1, name = "Smooth_3")

  # VehBrand as 1D embedding
  VehBrand_input <- layer_input(1, name = "VehBrand")
  VehBrand_emb <- VehBrand_input |> 
    layer_embedding(4 + 1, 1, name = "Embed_2") |> 
    layer_flatten(name = "Flat_2")

  # Names of list match names of input in any order. If unnamed, must be in right order
  inputs <- list(
    logDensity = logDensity_input,
    DrivAge = DrivAge_input,
    PolicyRegion = PolicyRegion_input,
    VehGas = VehGas_input,
    VehPower = VehPower_input,
    VehAge = VehAge_input,
    VehBrand = VehBrand_input
  )
  
  output <- list(VehGas_input, logDensity_input, VehAge_smooth, DrivAge_smooth, 
                 VehPower_smooth, VehBrand_emb, PolicyRegion_emb) |> 
    layer_concatenate() |> 
    layer_dense(1, activation = "exponential", name = "Output")
  
  keras_model(inputs, output)
}

# Create, compile and fit model
nn_additive <- make_additive() |> 
  compile(
    optimizer = optimizer_adam(0.1), loss = loss_poisson, weighted_metrics = list()
  )

# summary(nn_additive) # 77 parameters
plot_model(nn_additive)
history <- nn_additive |>
  fit(
    x = prep_nn(X_train), 
    y = train$Freq,
    sample_weight = train$Exposure,
    epochs = 100, 
    batch_size = 1e4, 
    validation_split = 0.2,
    callbacks = cb,
    verbose = FALSE
  )
plot(history)

pred_nn2 <- function(m, X) 
  predict(m, prep_nn(data.matrix(X[xvars])), batch_size = 1e4, verbose = FALSE)

summary_model(nn_additive, pred_fun = pred_nn2)

Comments:

  • The resulting model performs better than the classic GAM, but slightly worse than the additive boosted trees model.
  • According to the c-ICE plots, the model structure seems to be as desired.
  • To create a structured neural network, one usually needs a lot of program code.

3.5 Partly additive neural nets

Our last neural network was additive in all features. If we wanted to add pairwise interactions, we could simply use a neural network with a single hidden layer. But what if we wanted to create a partly additive model that kept driver-related effects additive while allowing vehicle variables to interact? Several architectures would provide such a model, see Scheme 5 (b) for a sketch of the idea and Mayer et al. (2022) for a use case in geographic modeling.

3.5.1 Example

In order to create a model as outlined above, we will directly connect the following components to the output layer, producing an additive model in each of the components:

  • “logDensity”: Scaled and represented as a linear function.
  • “DrivAge”: Scaled and represented by a small sub-network.
  • “PolicyRegion”: Integer encoded and represented by a one-dimensional embedding.
  • Vehicle features are represented by a sub-network with five output nodes. “VehBrand” enters the sub-network via an embedding layer.
# Architecture of partly additive model
make_partly_additive <- function() {
  k_clear_session()
  tensorflow::set_random_seed(469)

  # Additive non-vehicle features
  logDensity_input <- layer_input(1, name = "logDensity")
  
  DrivAge_input <- layer_input(1, name = "DrivAge")
  DrivAge_smooth <- DrivAge_input |> 
    layer_dense(7, activation = "tanh", name = "Hidden_DrivAge") |> 
    layer_dense(1, name = "Smooth_DrivAge")
  
  PolicyRegion_input <- layer_input(1, name = "PolicyRegion")
  PolicyRegion_emb <- PolicyRegion_input |> 
    layer_embedding(6 + 1, 1, name = "Embed_1") |> 
    layer_flatten(name = "Flat_1")

  # Vehicle features with interactions
  VehGas_input <- layer_input(1, name = "VehGas")
  VehPower_input <- layer_input(1, name = "VehPower")
  VehAge_input <- layer_input(1, name = "VehAge")
  
  VehBrand_input <- layer_input(1, name = "VehBrand")
  VehBrand_emb <- VehBrand_input |> 
    layer_embedding(4 + 1, 1, name = "Embed_2") |> 
    layer_flatten(name = "Flat_2")

  inputs <- list(
    logDensity = logDensity_input,
    DrivAge = DrivAge_input,
    PolicyRegion = PolicyRegion_input,
    VehGas = VehGas_input,
    VehPower = VehPower_input,
    VehAge = VehAge_input,
    VehBrand = VehBrand_input
  )
  
  veh_subnet <- list(VehGas_input, VehPower_input, VehAge_input, VehBrand_emb) |> 
    layer_concatenate() |> 
    layer_dense(15, activation = "tanh", name = "Hidden_Veh_1") |> 
    layer_dense(10, activation = "tanh", name = "Hidden_Veh_2") |> 
    layer_dense(5, name = "Hidden_Veh_3")
  
  output <- list(logDensity_input, DrivAge_smooth, PolicyRegion_emb, veh_subnet) |> 
    layer_concatenate() |> 
    layer_dense(1, activation = "exponential", name = "Output")
  
  keras_model(inputs, output)
}

# Create, compile and fit model
nn_part_add <- make_partly_additive() |> 
  compile(
    optimizer = optimizer_adam(0.1), loss = loss_poisson, weighted_metrics = list()
  )

# summary(nn_part_add) # 333 parameters
plot_model(nn_part_add)
history <- nn_part_add |>
  fit(
    x = prep_nn(X_train), 
    y = train$Freq,
    sample_weight = train$Exposure,
    epochs = 100, 
    batch_size = 1e4, 
    validation_split = 0.2,
    callbacks = cb,
    verbose = FALSE
  )
plot(history)

summary_model(nn_part_add, pred_fun = pred_nn2)

Comments:

  • Compared to the neural additive model, “VehAge” (and also the other vehicle-related variables) incorporates interaction effects with other vehicle characteristics. The driver characteristics remain additive.
  • Since some of the interaction effects are strong, the performance of the model has greatly improved compared to the neural additive model. Its performance is quite close to the unconstrained model.
  • The performance is worse than that of the corresponding boosted trees model with interaction constraints. Part of the difference can be explained by the fact that the neural network effects are smooth and “logDensity” is represented linearly.

4 Excursion: Tuning Boosted Trees

Hyperparameters of ML models (including GLMs) are usually selected or tuned by the help of train/test splits and (cross-)validation. However, each model class (GLMs, GAMs, random forests, boosted trees, neural nets, …) has its own peculiarities and subtleties that should be respected. In this section, we will learn how to tune the many hyperparameters of a boosted trees model, such as the objective, the learning rate, the number of trees (= boosting rounds), and further regularization parameters. We will closely follow the suggestions in Chapter “Trees” in Mayer (2021). It would be naive and irresponsible to use a completely untuned boosted trees model.

Our focus is on boosted trees, as they are usually among the most powerful modeling techniques for tabular data. In addition, we have learned the elegant workflow to build a strong GLM by analyzing such a model with SHAP.

4.1 Objective and metric

The first decision is to select a meaningful and ideally strictly consistent scoring/loss function for the distributional property T. On this basis, the evaluation metric and the objective are determined. The objective is used to fit the model, while the evaluation metric is used for model evaluation and parameter selection.

Examples:

  • For pricing models of expected claims frequency and claims severity, one typically uses the corresponding deviance as (strictly consistent) scoring function. The average deviance then serves as both the objective and the evaluation metric.
  • In many other regression situations for T = \mathbb E, one uses the squared error as strictly consistent loss function. The corresponding objective is the mean squared error (MSE), and the evaluation metric is either the MSE or its root (RMSE).
  • When T is the median, the absolute error L(z, y) = |y-z| is a strictly consistent scoring function. Its mean then serves as natural evaluation metric and objective.

4.2 Number of boosting rounds

Selecting an appropriate number of boosting rounds/trees is important: too few rounds tend to underfitting, while too many will often cause overfitting. A convenient strategy, called early stopping, is to calculate the evaluation metric after each boosting round by (cross-)validation. Once performance stops improving over a certain number of rounds, the boosting process is stopped. This approach is particularly useful because a good number of boosting rounds depends strongly on the choice of the other parameters, e.g., the learning rate. Thus, randomly selecting the number of boosting rounds along with the other parameters would often lead to a suboptimal solution.

4.3 Learning rate

The learning rate equals the weight of each tree in the final model. Its value (often between 1 and 0.005) can be chosen manually to achieve a reasonable number of boosting rounds by early stopping, aiming for a number between 100 and 1000 trees. Fewer trees will result in too simple a model, while more trees require long to fit. To quickly find a good learning rate, take advantage of the rule that dividing the learning rate by two roughly doubles the number of boosting rounds required for comparable performance. For instance, if you start with a learning rate of 0.2 and the model stops early already after only 50 rounds, you might try out a learning rate of 0.05 and expect a model that stops at about 200 trees (and hopefully performs slightly better).

4.4 Regularization parameters

Implementations of gradient boosted trees typically offer many parameters to control the size of the trees and other aspects of boosting, e.g.,

  • the number of leaf nodes,
  • the tree depth,
  • loss penalties,
  • row and column subsampling rates,
  • minimal leaf size,
  • minimal loss improvement to make another split etc.

These parameters are either selected one by one using (cross-)validation or by randomized grid search. There, many random parameter combinations are evaluated and one typically selects the combination with the best (cross-)validation performance.

Note: The number of boosting rounds picked by early stopping often compensates for a suboptimal choice of other parameters. In fact, most parameters are highly interdependent in the sense that different parameter combinations can result in similar model performance.

4.5 Strategy

To sum up, a simple strategy for selecting good parameters for a boosted trees model is as follows:

  1. Start by selecting a meaningful and ideally strictly consistent loss function for the functional T of the response. This determines the objective and the evaluation metric.
  2. Choose the learning rate to produce a model with about 100 to 1000 trees, with early stopping by (cross-)validation.
  3. Select the other parameters manually or by random search using (cross-)validation.

Often the performance improvements from Step 3 are modest, so this part can sometimes be skipped, especially when working with large data.

4.6 Example: French MTPL (LightGBM)

How did we select the hyperparameters of the LightGBM model fitted on claims frequencies of the French MTPL data?

The initial step was to choose the Poisson deviance as loss function (strictly consistent for the expectation), determining both the objective and evaluation metric. The other steps are summarized in the code below. Compared to a standard LightGBM tuning workflow, we passed our own (grouped) cross-validation folds to lgb.cv().

# Rerunning takes long; results might depend on seed
library(tidyverse)
library(splitTools)
library(lightgbm)

# Reload data and model-related objects
main <- "french_motor"
load(file.path(main, "intro.RData"))

# Data interface of LightGBM
dtrain <- lgb.Dataset(
  data.matrix(train[xvars]), 
  label = train$Freq, 
  weight = train$Exposure, 
  params = list(feature_pre_filter = FALSE)
)

# STEP 1: Model for expected claims frequency -> Poisson deviance

# STEP 2: Select learning rate so that optimal number of rounds by early stopping is
# somewhere between 100 and 1000
params <- list(
  objective = "poisson",
  learning_rate = 0.05  # play with this value
)

# Grouped cross-validation folds (hold-out indices)
folds <- create_folds(train$group_id, k = 5, type = "grouped", invert = TRUE)

# k-fold grouped cross-validation to see how many trees are required by early stopping
cvm <- lgb.cv(
  params = params,
  data = dtrain,
  nrounds = 5000,
  folds = folds,
  # nfold = 5,  # for the ungrouped case
  early_stopping_rounds = 20,
  eval_freq = 50
)
cvm

# STEP 3: Random search for other parameters (some parameters are fixed)
grid <- expand.grid(
  iteration = NA,
  score = NA,
  objective = "poisson",
  learning_rate = 0.05,
  num_leaves = c(15, 31, 63),
  lambda_l2 = c(0, 2.5, 5, 7.5),
  lambda_l1 = c(0, 4),
  colsample_bynode = c(0.8, 1),
  bagging_fraction = c(0.8, 1),
  min_data_in_leaf = c(20, 50, 100),
  min_sum_hessian_in_leaf = c(0, 0.001, 0.1),
  poisson_max_delta_step = c(0.1, 0.7),  # for Poisson objective
  stringsAsFactors = FALSE
)

# Full grid search or randomized search?
max_size <- 32
grid_size <- nrow(grid)
if (grid_size > max_size) {
  grid <- grid[sample(grid_size, max_size), ]
  grid_size <- max_size
}

# Loop over grid and fit LGB with grouped five-fold CV and early stopping
grid_file <- file.path(main, "grid_base_lgb.rds")

pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) {
  cvm <- lgb.cv(
    params = as.list(grid[i, -(1:2)]),
    data = dtrain,
    nrounds = 5000,
    folds = folds,
    # nfold = 5,  # for the ungrouped case
    early_stopping_rounds = 20,
    verbose = -1
  )

  # Store optimal number of boo
  sting rounds and the cross-validation score
  grid[i, 1:2] <- as.list(cvm)[c("best_iter", "best_score")]
  setTxtProgressBar(pb, i)

  # Save grid to survive hard crashs
  saveRDS(grid, file = grid_file)
}

# Load grid
grid <- readRDS(grid_file) |> 
  arrange(score)

# Show some strong parameter combinations
grid |> 
  select_if(function(x) n_distinct(x) >= 2L) |> 
  head()

# Fit model on best parameter combination
fit_lgb <- lgb.train(
  params = as.list(grid[1, -(1:2)]), 
  data = dtrain, 
  nrounds = grid[1, "iteration"]
)

4.7 Australian car claims (XGBoost)

In the exercises, we have worked with XGBoost to model expected Australian car claims frequencies. Here is the full code that was originally used to find the proposed parameter combination of the model used, e.g., in the exercises of Chapter 1.

# Rerunning takes long; results might depend on seed
library(withr)
library(tidyverse)
library(insuranceData)
library(xgboost)

data(dataCar)

# Data preparation
prep <- dataCar |> 
  mutate(
    Freq = numclaims / exposure,
    Sev = claimcst0 / numclaims,
    veh_age = factor(veh_age),
    agecat = factor(agecat),
    veh_body = fct_lump_prop(veh_body, 0.1),
    log_value = log(pmax(0.3, pmin(15, veh_value)))
  )

xvars <- c("log_value", "agecat", "veh_age", "area", "veh_body", "gender")

# Data split
with_seed(304, 
  ix <- sample(nrow(prep), 0.8 * nrow(prep))
)
train <- prep[ix, ]
test <- prep[-ix, ]

# Data interface of XGB
dtrain <- xgb.DMatrix(
  data.matrix(train[xvars]), 
  label = train$Freq, 
  weight = train$exposure
)

# STEP 1: Model for the expected claims frequency 
# -> Poisson deviance is both strictly consistent and meaningful

# STEP 2: Select learning rate to get reasonable number of trees by early stopping
params <- list(
  objective = "count:poisson",
  learning_rate = 0.05  # play with this value
)

# k-fold cross-validation to see if number of trees is reasonable
cvm <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 5000,
  nfold = 5,
  early_stopping_rounds = 20,
  showsd = FALSE, 
  print_every_n = 50
)
cvm

# STEP 3: Random search for other parameters (some parameters are fixed)
grid <- expand.grid(
  iteration = NA,
  cv_score = NA,
  train_score = NA,
  objective = "count:poisson",
  learning_rate = 0.05,
  max_depth = 2:4,
  reg_lambda = 0:3,
  reg_alpha = 0:3,
  colsample_bynode = c(0.8, 1),
  subsample = c(0.8, 1),
  min_split_loss = c(0, 0.001),
  max_delta_step = c(0.1, 0.7),  # with Poisson objective
  stringsAsFactors = FALSE
)

# Full grid search or randomized search?
max_size <- 32
grid_size <- nrow(grid)
if (grid_size > max_size) {
  grid <- grid[sample(grid_size, max_size), ]
  grid_size <- max_size
}

# Loop over grid and fit XGB with five-fold CV and early stopping
grid_file <- file.path("australian_car", "grid_xgb.rds")

pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) {  # i <- 1
  cvm <- xgb.cv(
    params = as.list(grid[i, -(1:3)]),
    data = dtrain,
    nrounds = 5000,
    nfold = 5,
    early_stopping_rounds = 20,
    verbose = 0
  )
  
  # Store optimal number of boosting rounds and both the training and the CV score
  grid[i, 1] <- cvm$best_iteration
  grid[i, 2:3] <- cvm$evaluation_log[, c(4, 2)][cvm$best_iteration]
  setTxtProgressBar(pb, i)
  
  # Save grid to survive hard crashs
  saveRDS(grid, file = grid_file)
}

# Load and sort grid
grid <- readRDS(grid_file) |>
  arrange(cv_score)
head(grid)

# Fit final model with best parameter combination
with_seed(345, 
  fit_xgb <- xgb.train(
    params = as.list(grid[1, -(1:3)]), 
    data = dtrain, 
    nrounds = grid[1, "iteration"]
  )
)

5 Exercises

We use an insurance dataset available from openML (ID 42876). It was synthetically generated by Colin Priest. The raw data describes 100,000 workers compensation claims regarding ultimate loss (the model response), the initial case estimate (the main predictor), the dates of accident and reporting, demographic information on the worker and finally also a short text describing the accident.

The main aim will be to craft intelligible models predicting the ultimate claim amount per claim. All models should be Gamma regressions with log link.

You can freely do the programming yourself, or use the following code. We will focus on rows with WeeklyPay >= 200and HoursWorkedPerWeek >= 20.

library(OpenML)
library(farff)
library(tidyverse)
library(lubridate)
library(withr)
library(xgboost)
library(MetricsWeighted)
library(hstats)
library(shapviz)
library(caret)
library(keras)

#===============================
# Download and preprocess
#===============================

raw <- tibble(getOMLDataSet(data.id = 42876)$data)

# Clip small and/or large values
clip <- function(x, low = -Inf, high = Inf) {
  pmax(pmin(x, high), low)  
}

prep <- raw |> 
  filter(WeeklyPay >= 200, HoursWorkedPerWeek >= 20) |> 
  mutate(
    Ultimate = clip(UltimateIncurredClaimCost, high = 1e6),
    LogInitial = log(clip(InitialCaseEstimate, 1e2, 1e5)),
    LogWeeklyPay = log(clip(WeeklyPay, 100, 2000)),
    LogAge = log(clip(Age, 17, 70)),
    Female = (Gender == "F") * 1L,
    Married = (MaritalStatus == "M") * 1L,
    PartTime = (PartTimeFullTime == "P") * 1L,
    
    DateTimeOfAccident = as_datetime(DateTimeOfAccident),
    
    LogDelay = log1p(as.numeric(ymd(DateReported) - as_date(DateTimeOfAccident))),
    DateNum = decimal_date(DateTimeOfAccident),
    WeekDay = wday(DateTimeOfAccident, week_start = 1),   # 1 = Monday
    Hour = hour(DateTimeOfAccident)
  )

# Lost claim amount -> in practice, use as correction or "smearing" factor
(smearing_factor <- 1 - with(prep, sum(Ultimate) / sum(UltimateIncurredClaimCost)))

#===============================
# Variable groups
#===============================

x_continuous <- c("LogInitial", "LogWeeklyPay", "LogDelay", "LogAge", "DateNum")
x_discrete <- c("PartTime", "Female", "Married", "WeekDay", "Hour")
xvars <- c(x_continuous, x_discrete)

#===============================
# Univariate analysis
#===============================

# Some rows
prep |>
  select(all_of(c("Ultimate", xvars))) |>
  head()

# Histogram of the response on log scale
ggplot(prep, aes(Ultimate)) +
  geom_histogram(bins = 19, fill = "darkred") +
  scale_x_log10()

# Histograms of continuous predictors
prep |> 
  select(all_of(x_continuous)) |> 
  pivot_longer(everything()) |> 
  ggplot(aes(value)) +
  geom_histogram(bins = 19, fill = "darkred") +
  facet_wrap("name", scales = "free", ncol = 2)

# Barplots of discrete predictors (using the fact that the values are all integers)
prep |> 
  select(all_of(x_discrete)) |> 
  pivot_longer(everything()) |> 
  ggplot(aes(factor(value))) +
  geom_bar(fill = "darkred") +
  facet_wrap("name", scales = "free", ncol = 2)

#===============================
# Data splits for models
#===============================

with_seed(656, 
  .in <- sample(nrow(prep), 0.8 * nrow(prep), replace = FALSE)
)

train <- prep[.in, ]
test <- prep[-.in, ]
y_train <- train$Ultimate
y_test <- test$Ultimate
X_train <- train[xvars]
X_test <- test[xvars]

#===============================
# XGBoost: Additive model 
# (Exercise 2)
#===============================

nrounds <- 484

params <- list(
  objective = "reg:gamma",
  learning_rate = 0.1, 
  max_depth = 1, 
  reg_lambda = 1,
  reg_alpha = 3, 
  colsample_bynode = 1, 
  subsample = 1, 
  min_split_loss = 0.001
)

#===============================
# XGBoost: Partly additive model
# # (Exercise 4)
#===============================

# Build interaction constraint vector
additive <- c("Female", "DateNum")
ic <- c(
  list(which(!(xvars %in% additive)) - 1),
  as.list(which(xvars %in% additive) - 1)
)
ic

nrounds2 <- 202

params2 <- list(
  objective = "reg:gamma",
  learning_rate = 0.1, 
  max_depth = 2,
  interaction_constraints = ic,
  reg_lambda = 2,
  reg_alpha = 0,
  colsample_bynode = 0.8, 
  subsample = 1, 
  min_split_loss = 0
)

For all models, we will use the features specified in xvars.

  1. Classic model
    1. Create an additive Gamma GLM with log link.
    2. Interpret the effect of “LogInitial”.
    3. How large is the relative deviance reduction compared to an intercept-only model on the training data?
  2. Additive XGBoost model
    1. Fit an additive XGBoost Gamma regression using the parameters provided above. Note that XGBoost automatically uses a log link when fitting a Gamma model.
    2. How much better is its relative deviance gain on the test data compared to the (training) performance of the GLM?
    3. Interpret the effects with a SHAP analysis.
    4. Refit the model by forcing “LogWeeklyPay” to have a monotonically non-decreasing effect. What is the impact on the test performance? On the SHAP dependence plot of this variable?
  3. Interactions
    1. Describe in words how to construct a boosted trees model with pairwise interactions.
    2. Same question for a neural net.
    3. How to construct a boosted trees model that is additive in two variables and has pairwise interactions for the other features?
  4. Partly additive XGBoost model
    1. Fit an XGBoost Gamma regression that is additive in “Female” and “DateNum” and has pairwise interactions for the other variables. Use the parameter combination provided above.
    2. Why could a model like this make sense?
    3. Compare its relative deviance gain on the test data with the additive model of Exercise 2ii.
    4. Study SHAP dependence plots. Does the model have the desired structure?
  5. Neural net
    1. Study the code below that creates an additive Gamma regression with log link. Play with its parameters and the architecture. Can you improve the test performance?
    2. Optional. Modify the model to be additive for “Female” and “DateNum”, but contains interactions for the other variables. Inspect the model using c-ICE plots.
#==========================================
# Neural net of Exercise 5i
# The code above needs to be run first
#==========================================

# Callbacks
cb <- list(
  callback_early_stopping(patience = 20),
  callback_reduce_lr_on_plateau(patience = 5)
)

# Constructing Gamma deviance loss
loss_gamma <- function(y_true, y_pred) {
  -k_log(y_true / y_pred) + y_true / y_pred
}

# Scaling and splitting into feature lists
scaler <- preProcess(X_train)
prep_nn <- function(X) {
  X_scaled <- predict(scaler, X)
  lapply(asplit(X_scaled, 2), as.matrix)
}
(example_rows <- prep_nn(X_train[1:2, ]))

# Building the architecture
make_additive <- function() {
  k_clear_session()
  tensorflow::set_random_seed(469)
  
  v <- colnames(X_train)
  outputs <- inputs <- lapply(v, function(z) layer_input(1, name = z))
  names(inputs) <- v
  
  # Smooth only non-binary features
  for (i in 1:length(v)) {
    if (!(v[i] %in% c("PartTime", "Female", "Married")))
      outputs[[i]] <- outputs[[i]] |>     
        layer_dense(8, activation = "tanh") |>
        layer_dense(4, activation = "tanh") |> 
        layer_dense(1)
  }

  output <- outputs |> 
    layer_concatenate() |> 
    layer_dense(units = 1, activation = "exponential")
  
  keras_model(inputs, output)
}

# Create, compile, and fit model
nn_additive <- make_additive()
# summary(nn_additive) # 410 parameters

history <- nn_additive |> 
  compile(optimizer = optimizer_adam(0.01), loss = loss_gamma) |> 
  fit(
    x = prep_nn(X_train), 
    y = y_train, 
    epochs = 100, 
    batch_size = 400, 
    validation_split = 0.2,
    callbacks = cb
  )

# Inspect model
pred_nn <- function(m, X, log_scale = FALSE) {
  out <- predict(m, prep_nn(X[xvars]), batch_size = 1000, verbose = FALSE)
  if (log_scale) log(out) else out
} 

pred <- pred_nn(nn_additive, X_test)
r_squared_gamma(y_test, pred, reference_mean = mean(y_train))  # 0.330

for (v in c("LogWeeklyPay", "DateNum", "Female")) {
  p <- ice(nn_additive, v = v, X = X_train, pred_fun = pred_nn, log_scale = TRUE) |> 
    plot(center = TRUE)
  print(p)
}

References

Agarwal, Rishabh, Nicholas Frosst, Xuezhou Zhang, Rich Caruana, and Geoffrey E. Hinton. 2020. “Neural Additive Models: Interpretable Machine Learning with Neural Nets.” CoRR abs/2004.13912. https://doi.org/10.48550/arXiv.2004.13912.
Chollet, François. 2018. Deep Learning with R. Shelter Island, New York: Manning Publications.
Hastie, Trevor, and Robert Tibshirani. 1986. Generalized Additive Models.” Stat. Sci. 1 (3): 297–310. https://doi.org/10.1214/ss/1177013604.
———. 1990. Generalized Additive Models. Wiley Online Library.
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. Springer Texts in Statistics. Springer US.
Lee, Simon C. K., Sheldon Lin, and Katrien Antonio. 2015. “Delta Boosting Machine and Its Application in Actuarial Modeling.” In. Institute of Actuaries of Australia.
Lou, Yin, Rich Caruana, and Johannes Gehrke. 2012. “Intelligible Models for Classification and Regression.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–58. KDD ’12. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2339530.2339556.
Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4765–74. Curran Associates, Inc. https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
Mayer, Michael. 2021. “Introduction to Machine Learning.” https://github.com/mayer79/ml_lecture.
———. 2022. “SHAP for Additively Modeled Features in a Boosted Trees Model.” arXiv. https://doi.org/10.48550/ARXIV.2207.14490.
Mayer, Michael, Steven C. Bourassa, Martin Hoesli, and Donato Scognamiglio. 2022. “Machine Learning Applications to Land and Structure Valuation.” Journal of Risk and Financial Management 15 (5). https://doi.org/10.3390/jrfm15050193.
Nori, Harsha, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. “InterpretML: A Unified Framework for Machine Learning Interpretability.” CoRR abs/1909.09223. https://doi.org/10.48550/arXiv.1909.00922.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. 2nd ed. United States: CRC Press. https://doi.org/10.1201/9781315370279.