“The best explanation of a simple model is the model itself”
Welcome to the last chapter of the XAI part of the lecture “Responsible Machine Learning with Insurance Applications”. It is dedicated to improving the intrinsic explainability of ML models.
Regarding intrinsic explainability, there is the following basic hierarchy:
The boundaries are blurred: GLMs can also include nonlinear effects and interaction terms. If one adds many interaction terms to a GLM or a GAM, it becomes almost as complex as a neural network. On the other hand, the complexity of boosted trees models and neural networks can be reduced by enforcing additivity for all or some features.
In the next two sections, we will learn how to create such additive and partly additive boosted trees and neural nets. By partly additive, we mean models that are additive in some features but have interactions between other variables. It is just another term for “additive with (possibly complex) interactions”. Examples of such partly additive models in actuarial science are as follows:
Furthermore, we will see how monotonic constraints can additionally improve intrinsic interpretability of boosted trees.
Remarks:
library(tidyverse)
library(withr)
library(mgcv)
library(lightgbm)
library(MetricsWeighted)
library(hstats)
library(shapviz)
library(patchwork)
library(caret) # for scaling
library(keras)
# install_keras() # if no Python with TensorFlow is installed yet
# devtools::install_github("andrie/deepviz")
library(deepviz)
source("functions.R")
# Reload data and model-related objects
main <- "french_motor"
load(file.path(main, "intro.RData"))
fit_lgb <- lgb.load(file.path(main, "fit_lgb.txt"))
# Shuffle training data to be sure that batch creation of neural nets is random
with_seed(938,
train <- train[sample(1:nrow(train)), ]
)
# Calculates performance, and plots c-ICE plots of selected features, and SHAP.
# Note that this function is not written in a neat way. We would need to pass
# much more arguments to make it self-contained.
# Note: We try to select the evaluation points of the ICE curves so that the
# SHAP dependence plots agree up to a vertical shift.
# Note: The pred_fun() needs to return predictions on original scale, not log.
summary_model <- function(model,
pred_fun = predict,
v = c("DrivAge", "logDensity", "VehAge", "VehBrand"),
shap_dependence = FALSE,
...) {
with_seed(9,
X_explain <- sample_n(train, 1000)
)
score <- poisson_scorer(
model,
X = test,
y = test$Freq,
w = test$Exposure,
reference_mean = weighted.mean(train$Freq, train$Exposure),
pred_fun = pred_fun,
...
)$Pseudo_R2
pred_log <- function(m, X, ...) {
log(pred_fun(m, X, ...))
}
p <- list()
for (z in v) {
g <- unique(X_explain[[z]])
p[[z]] <- ice(model, z, X = X_explain, pred_fun = pred_log, grid = g, ...) |>
plot(center = TRUE) +
ggtitle(sprintf("c-ICE plot of '%s'", z)) +
ylab("Centered log frequency")
}
the_title <- sprintf(
"%s (deviance gain %s)",
deparse1(substitute(model)),
scales::percent(score, accuracy = 0.01)
)
print(wrap_plots(p) + plot_annotation(the_title))
if (shap_dependence) {
shp <- shapviz(model, X_pred = data.matrix(X_explain[xvars]), X = X_explain)
print(sv_dependence(shp, v = v) + plot_annotation("SHAP dependence plots"))
}
}
While the prediction function of a single decision tree is simple to interpret, the same is not true for boosted trees models. Its prediction function m(\boldsymbol x) = m_1(\boldsymbol x) + \dots + m_K(\boldsymbol x), equals the sum of K \ge 1 decision trees m_k, and one has to rely on post-hoc explainability methods from Chapter 2 to approximately understand m. Here, we will learn two ways how to simplify the internal structure of a boosted trees model to improve its (intrinsic) interpretability. The first approach produces additive models. The second approach is more general and can be also used to fit partly additive models.
A tree stump is a decision tree with a single split. Its prediction function has the form v_1 + (v_2 - v_1)\mathbb I(x^{(j)} \le s), where \mathbb I is the indicator function, s the split position of the selected variable X^{(j)}, and v_1, v_2 are the values of the left and right leaf. A boosted trees model with tree stumps m_1, \dots, m_K as base learners (boosted tree stumps) is thus an additive model m(\boldsymbol x) = f_1(x^{(1)}) + \dots + f_p(x^{(p)}) with piecewise constant functions f_j, 1 \le j \le p. The function f_j equals the sum of all m_k that split on X^{(j)}.
A schematic view of such a model for our French MTPL dataset could look as follows:
Unlike the classic GAM (Hastie and Tibshirani 1986, 1990; Wood 2017), the functions f_j are not smooth, which is a disadvantage in some applications. An advantage of using boosted tree stumps to fit GAMs is that modern boosting implementations can handle large datasets consisting of millions of rows and many covariates, where classic approaches would be too slow. The resulting model is as easy to interpret as a classic GAM, i.e., each component f_j can be visualized, for instance, by an ICE or PDP plot that provides a complete description of the effect of that feature.
Remark: The SHAP dependence plot of an additive feature in a boosted trees model is equivalent to the PDP/ICE plot up to a vertical shift. Thus, like the PDP, it can be interpreted Ceteris Paribus (Mayer 2022).
Additive boosted trees models are discussed in the literature, e.g., in Lou, Caruana, and Gehrke (2012) and Nori et al. (2019).
First, we fit an additive Poisson boosted trees model with log-link
to the French MTPL data. To force LightGBM to use tree stumps as base
learners, we can set either num_leaves = 2
or
max_depth = 1
. The other parameters have been found by
random parameter search using (grouped) five-fold cross-validation. How
does the model compare with the classic GAM?
Besides showing relative deviance gains on the test dataset, throughout this section, we will report c-ICE plots of selected variables. For boosted trees models, we will additionally show corresponding SHAP dependence plots. c-ICE plots are evaluated at all unique feature values.
First, we analyze the classic GAM, then the additive LightGBM model.
X_train <- data.matrix(train[xvars])
dtrain <- lgb.Dataset(
X_train,
label = train$Freq,
weight = train$Exposure,
params = list(feature_pre_filter = FALSE)
)
# Found by cross-validation (-> french_motor/tune_additive_lgb.R)
# Of key importance: "num_leaves = 2" or "max_depth = 1"
params <- list(
objective = "poisson",
learning_rate = 0.5,
num_leaves = 2,
lambda_l2 = 7.5,
lambda_l1 = 0,
colsample_bynode = 1,
bagging_fraction = 1,
poisson_max_delta_step = 0.7
)
with_seed(4438,
fit_lgb_add <- lgb.train(params = params, data = dtrain, nrounds = 513)
)
## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003620 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Info] Total Bins 370
## [LightGBM] [Info] Number of data points in the train set: 542170, number of used features: 7
## [LightGBM] [Info] Start training from score -2.321650
pred_lgb <- function(object, newdata, ...) {
predict(object, data.matrix(newdata[xvars]), ...)
}
summary_model(fit_gam, type = "response")
summary_model(fit_lgb_add, pred_fun = pred_lgb, shap_dependence = TRUE)
Comments:
In many situations, assuming additivity for all features leads to an unrealistically simple model. One way to relax the assumption for boosted trees is to grow trees of depth \ell > 1 (not counting the root node). For instance, if \ell = 2, the model can learn all pairwise interactions (why?). Similarly, larger tree depths allow higher order interactions to be represented.
What if we want to model some features additively while using interactions for other features? Such partly additive models can be created by specifying interaction constraints IC = \{F_1, \dots, F_K\}, where each F_k \subseteq \mathcal M defines a feature subset allowed to interact with each other (Lee, Lin, and Antonio 2015).
How do interaction constraints work technically? During tree growth, a split considers features only from those F_k that contain all previous split variables of the branch. Consequently, each branch in each tree will only use features from one F_k \in IC. This implies the desired structure for the tree and also for a tree ensemble.
Interaction constraints can be used to create partly additive boosted trees models. For instance, to have a model additive in feature X^{(j)}, we can set F_1 = \{X^{(j)}\}, and make sure that X^{(j)} \notin F_k, \ k > 1.
The following scheme shows a partly additive model with interaction constraints IC = \{\{\text{DrivAge}\}, \{\text{logDensity}\}, \{\text{PolicyRegion}\}, \{\text{VehAge}, \text{VehBrand}, \text{VehGas}, \text{VehPower}\}\}. The resulting model will be additive in driver-related features while allowing vehicle features to interact.
Remarks:
Let us now create a model with above structure. We will compare it with the unconstrained LightGBM model.
# Build interaction constraint vector
ic <- list(
c("VehAge", "VehPower", "VehBrand", "VehGas"),
"DrivAge",
"logDensity",
"PolicyRegion"
)
# Found by cross-validation (-> french_motor/tune_partly_additive_lgb.R)
# Of key importance: "interaction_constraints"
params <- list(
objective = "poisson",
learning_rate = 0.05,
num_leaves = 15,
interaction_constraints = ic,
lambda_l2 = 5,
lambda_l1 = 0,
colsample_bynode = 1,
bagging_fraction = 1,
min_data_in_leaf = 20,
min_sum_hessian_in_leaf = 0,
poisson_max_delta_step = 0.1
)
with_seed(4458,
fit_lgb_part_add <- lgb.train(params, data = dtrain, nrounds = 333, verbose = -1)
)
summary_model(fit_lgb, pred_fun = pred_lgb, shap_dependence = TRUE)
summary_model(fit_lgb_part_add, pred_fun = pred_lgb, shap_dependence = TRUE)
Comments:
So far, we have improved the intrinsic interpretability of a black box by removing interactions. For tree-based models, another aspect can be exploited: Monotonicity. By monotonicity in a feature, we mean that m is a monotonically increasing or decreasing function in that feature, everything else being fixed.
In practice, the usefulness and trustworthiness of a model can drop dramatically if it fails to satisfy certain monotonic constraints. Here two examples:
How do monotonic constraints work? To enforce monotonicity for the jth feature in a decision tree, the rules are as follows:
In Scheme 3a, Tree A represents a prediction function (on log claims frequency scale) that is monotonically decreasing in the vehicle age. The split on the lower left of Tree B violates this constraint and is discarded (and replaced by some other split).
Scheme 3b explains how node values of subsequent splits are capped to enforce monotonicity downstream:
Remarks:
In our French MTPL example, there is no feature for which we should really impose monotonicity. Nevertheless, to illustrate the effect, we will force a negative effect for “VehAge” and a positive effect for “logDensity” to reduce its extreme wiggliness. Note that “logDensity” is modeled additively, while “VehAge” can freely interact with other vehicle features. For simplicity, we use the same LightGBM parameters as before.
# Using params from last example
params$monotone_constraints <- (xvars == "logDensity") - (xvars == "VehAge")
with_seed(4358,
fit_lgb_monotone <- lgb.train(
params,
data = dtrain,
nrounds = 333,
verbose = -1
)
)
summary_model(fit_lgb_monotone, pred_fun = pred_lgb, shap_dependence = TRUE)
Comments:
In this lecture, we have not yet worked with neural nets. While they are typically not as performant as boosted trees models for tabular data, they are extremely versatile. Here is just a small list of their capabilities: Neural nets can
We assume that the reader has some background knowledge on neural nets and refer to James et al. (2021) or Chollet (2018) otherwise. In the examples, we will work with TensorFlow, using the handy Keras front-end.
Since Keras and a part of TensorFlow are written in Python, running the examples in this chapter will require Python with TensorFlow >= 2.11. You can obtain it like this:
keras::install_keras()
to get a minimal Python
installation with TensorFlow.library(tensorflow)
tf$constant("Hello Tensorflow!")
Within Keras, we will work with the functional API. It allows to represent very general model structures (also called architectures in neural network jargon). Actually, any model that can be characterized as a directed acyclic graph (DAG) could be implemented using this interface. Non-trivial examples include additive and partly additive models.
To familiarize ourselves with Keras and get used to neural network terminology, we will first create a very simple neural network - a GLM. It consists of two layers only: The input layer represents the features, and the output layer represents the response variable. The values of the output layer are transformed by the exponential “activation” function, corresponding to the inverse link function of the Poisson GLM. The output layer is fully connected or dense, which means that the value of its node is a weighted sum of all node values of the previous layer (in this case, of the input layer), see the left hand side of Scheme 4:
The weights are the parameters estimated by gradient descent. Normally, the gradient descent update steps are computed by backpropagation, which involves applying the chain rule to find the derivatives of the loss function with respect to all parameters. For a simple model like a GLM, this is straightforward. But imagine a deep learning model with 50 hidden layers and a non-linear transformation after each hidden layer! Computing millions or even billions of derivatives (without seeing the data) is one of the greatest achievements of modern deep learning implementations like TensorFlow and PyTorch.
(Mini-batch) gradient descent with backpropagation works as follows:
Let m_\beta denote a neural net with parameters \beta, and Q(m_\beta, D) = \sum_{(y_i, \boldsymbol x_i) \in D} L(m_\beta(\boldsymbol x_i), y_i) its total loss on a data set D with respect to the loss function L.
Remarks:
Let us now fit a claims frequency GLM with Keras/TensorFlow, i.e., the model in Scheme 4 (a). For simplicity, we encode all factor variables as integers, even if this is not always intelligent (see later examples how to create better models). We use a learning rate of 0.02. This means that each parameter update takes 0.02 steps towards the negative gradient.
# Standardization - required to improve init step
scaler <- preProcess(X_train) # predict(scaler, head(X_train))
# Callbacks
cb <- list(
callback_early_stopping(patience = 20),
callback_reduce_lr_on_plateau(patience = 5)
)
# Mimics a GLM
make_glm <- function() {
k_clear_session()
tensorflow::set_random_seed(469)
input <- layer_input(ncol(X_train))
output <- input |>
layer_dense(units = 1, activation = "exponential")
# Connect inputs and output
keras_model(input, output)
}
# Create, compile and fit model
nn_glm <- make_glm() |>
compile(
optimizer = optimizer_adam(0.02), loss = loss_poisson, weighted_metrics = list()
)
# summary(nn_glm) # 8 parameters
plot_model(nn_glm)
history <- nn_glm |>
fit(
x = predict(scaler, X_train),
y = train$Freq,
sample_weight = train$Exposure,
epochs = 100,
batch_size = 1e4,
validation_split = 0.2,
callbacks = cb,
verbose = FALSE
)
plot(history)
# Inspect
pred_nn <- function(m, X)
predict(
m,
predict(scaler, data.matrix(X[xvars])),
batch_size = 1e4,
verbose = FALSE
)
summary_model(fit_glm, type = "response")
summary_model(nn_glm, pred_fun = pred_nn)
Comments:
While it is great that a neural network can be used to fit a GLM, its strength lies in its hidden layers. The nodes on these hidden layers are latent variables that represent the original input features in a way that optimally predicts the response. The closer a hidden layer is to the output, the more optimal this representation is. This is referred to as representational learning: The neural network automatically learns relevant nonlinearities and interaction effects that would have to be added manually in a GLM. For this to work, you need a sufficiently flexible architecture (the layers and their nodes) and enough observations.
An important component are the activation functions placed after each hidden layer. They transform the weighted sum of each hidden node non-linearly, thus introducing interaction effects. Typical activation functions are the rectangular linear unit ReLU(x) = \text{max}(0, x), the standard logistic function (“sigmoid”) and the nearly identical hyperbolic tangent. The activation function applied to the output node has a different purpose: it maps the network output to the scale of the response, just like the inverse link function in a GLM. A neural network without activation functions could only represent a linear function of the inputs, regardless of how many hidden layers the model has (why?).
We will now modify our GLM neural network by adding three hidden layers to the model, resulting in a model with 561 parameters. Typically, the first hidden layer has more nodes than input features (“feature explosion”), after which the number of nodes shrinks again.
On the one hand, the model needs a large number of parameters to be flexible enough. On the other hand, too many parameters lead to overfitting. For tabular data, a healthy ratio of rows per parameter is somewhere between 50 and 200. When working with unstructured data, different rules apply.
In Scheme 4b, we use only one hidden layer to not overload the image.
# Architecture of neural network with hidden layers
make_full_nn <- function() {
k_clear_session()
tensorflow::set_random_seed(469)
input <- layer_input(7, name = "All_features")
output <- input |>
layer_dense(20, activation = "tanh", name = "Hidden_1") |>
layer_dense(15, activation = "tanh", name = "Hidden_2") |>
layer_dense(5, activation = "tanh", name = "Hidden_3") |>
layer_dense(1, activation = "exponential", name = "Output")
# Connect inputs and output
keras_model(input, output)
}
# Create, compile and fit model
nn_full <- make_full_nn() |>
compile(
optimizer = optimizer_adam(0.02),
loss = loss_poisson,
weighted_metrics = list()
)
# summary(nn_full) # 561 parameters
plot_model(nn_full)
history <- nn_full |>
fit(
x = predict(scaler, X_train),
y = train$Freq,
sample_weight = train$Exposure,
epochs = 100,
batch_size = 1e4,
validation_split = 0.2,
callbacks = cb,
verbose = FALSE
)
plot(history)
summary_model(nn_full, pred_fun = pred_nn)
Comments:
In the first two examples on neural nets we have seen that we can use them to fit linear additive models as well as a typical black box. Thanks to the flexibility of neural networks, we can also use them to build GAMs. In such a “neural additive model” (Agarwal et al. 2020), each feature X^{(j)} is sent through a small single-output neural network that represents X^{(j)} as a non-linear smooth function. The outputs of these sub-networks are then directly connected to the output node to avoid interaction effects, see Scheme 5 (a):
Note that some components can also be represented linearly, i.e., without sub-network. Furthermore, note that unordered factors in neural networks are generally represented either by one-hot-encoded binary variables or by a so-called embedding layer. An embedding layer transforms an integer encoded factor to one (or more) numeric features.
We will now build a neural additive model with three different types of inputs:
The implied structure is similar to our classic GAM.
Normally, a statistical model requires a single data matrix for training. A neural network, like the one described here, is more flexible. In the following examples, we will pass each feature as a single-column matrix. In the same way, we could build a neural network using images, text, and numbers as inputs.
# Standardizes input and provides list of separate input matrices
prep_nn <- function(X, embed = c("VehBrand", "PolicyRegion")) {
to_scale <- setdiff(colnames(X), embed)
X_scaled <- predict(scaler, X)[, to_scale]
X_scaled_with_emb <- cbind(X_scaled, X[, embed])
lapply(asplit(X_scaled_with_emb, 2), as.matrix)
}
# Input structure of two observations
(example_rows <- prep_nn(X_train[1:2, ]))
## $VehAge
## [,1]
## [1,] 0.745849390
## [2,] 0.005072458
##
## $DrivAge
## [,1]
## [1,] -1.1693169
## [2,] -0.1063998
##
## $logDensity
## [,1]
## [1,] -0.5020083
## [2,] -0.8232644
##
## $VehPower
## [,1]
## [1,] -0.2203684
## [2,] -0.2203684
##
## $VehGas
## [,1]
## [1,] -1.019561
## [2,] -1.019561
##
## $VehBrand
## [,1]
## [1,] 2
## [2,] 2
##
## $PolicyRegion
## [,1]
## [1,] 1
## [2,] 6
# Architecture of neural additive model
make_additive <- function() {
k_clear_session()
tensorflow::set_random_seed(469)
# Linear logDensity
logDensity_input <- layer_input(1, name = "logDensity")
# Smooth DrivAge
DrivAge_input <- layer_input(1, name = "DrivAge")
DrivAge_smooth <- DrivAge_input |>
layer_dense(7, activation = "tanh", name = "Hidden_1") |>
layer_dense(1, name = "Smooth_1")
# Categorical embedding of dimension 1
PolicyRegion_input <- layer_input(1, name = "PolicyRegion")
PolicyRegion_emb <- PolicyRegion_input |>
layer_embedding(6 + 1, 1, name = "Embed_1") |>
layer_flatten(name = "Flat_1")
# Linear VehGas
VehGas_input <- layer_input(1, name = "VehGas")
# Smooth VehPower and VehAge
VehPower_input <- layer_input(1, name = "VehPower")
VehPower_smooth <- VehPower_input |>
layer_dense(3, activation = "tanh", name = "Hidden_2") |>
layer_dense(1, name = "Smooth_2")
VehAge_input <- layer_input(1, name = "VehAge")
VehAge_smooth <- VehAge_input |>
layer_dense(8, activation = "tanh", name = "Hidden_3") |>
layer_dense(1, name = "Smooth_3")
# VehBrand as 1D embedding
VehBrand_input <- layer_input(1, name = "VehBrand")
VehBrand_emb <- VehBrand_input |>
layer_embedding(4 + 1, 1, name = "Embed_2") |>
layer_flatten(name = "Flat_2")
# Names of list match names of input in any order. If unnamed, must be in right order
inputs <- list(
logDensity = logDensity_input,
DrivAge = DrivAge_input,
PolicyRegion = PolicyRegion_input,
VehGas = VehGas_input,
VehPower = VehPower_input,
VehAge = VehAge_input,
VehBrand = VehBrand_input
)
output <- list(VehGas_input, logDensity_input, VehAge_smooth, DrivAge_smooth,
VehPower_smooth, VehBrand_emb, PolicyRegion_emb) |>
layer_concatenate() |>
layer_dense(1, activation = "exponential", name = "Output")
keras_model(inputs, output)
}
# Create, compile and fit model
nn_additive <- make_additive() |>
compile(
optimizer = optimizer_adam(0.1), loss = loss_poisson, weighted_metrics = list()
)
# summary(nn_additive) # 77 parameters
plot_model(nn_additive)
history <- nn_additive |>
fit(
x = prep_nn(X_train),
y = train$Freq,
sample_weight = train$Exposure,
epochs = 100,
batch_size = 1e4,
validation_split = 0.2,
callbacks = cb,
verbose = FALSE
)
plot(history)
pred_nn2 <- function(m, X)
predict(m, prep_nn(data.matrix(X[xvars])), batch_size = 1e4, verbose = FALSE)
summary_model(nn_additive, pred_fun = pred_nn2)
Comments:
Our last neural network was additive in all features. If we wanted to add pairwise interactions, we could simply use a neural network with a single hidden layer. But what if we wanted to create a partly additive model that kept driver-related effects additive while allowing vehicle variables to interact? Several architectures would provide such a model, see Scheme 5 (b) for a sketch of the idea and Mayer et al. (2022) for a use case in geographic modeling.
In order to create a model as outlined above, we will directly connect the following components to the output layer, producing an additive model in each of the components:
# Architecture of partly additive model
make_partly_additive <- function() {
k_clear_session()
tensorflow::set_random_seed(469)
# Additive non-vehicle features
logDensity_input <- layer_input(1, name = "logDensity")
DrivAge_input <- layer_input(1, name = "DrivAge")
DrivAge_smooth <- DrivAge_input |>
layer_dense(7, activation = "tanh", name = "Hidden_DrivAge") |>
layer_dense(1, name = "Smooth_DrivAge")
PolicyRegion_input <- layer_input(1, name = "PolicyRegion")
PolicyRegion_emb <- PolicyRegion_input |>
layer_embedding(6 + 1, 1, name = "Embed_1") |>
layer_flatten(name = "Flat_1")
# Vehicle features with interactions
VehGas_input <- layer_input(1, name = "VehGas")
VehPower_input <- layer_input(1, name = "VehPower")
VehAge_input <- layer_input(1, name = "VehAge")
VehBrand_input <- layer_input(1, name = "VehBrand")
VehBrand_emb <- VehBrand_input |>
layer_embedding(4 + 1, 1, name = "Embed_2") |>
layer_flatten(name = "Flat_2")
inputs <- list(
logDensity = logDensity_input,
DrivAge = DrivAge_input,
PolicyRegion = PolicyRegion_input,
VehGas = VehGas_input,
VehPower = VehPower_input,
VehAge = VehAge_input,
VehBrand = VehBrand_input
)
veh_subnet <- list(VehGas_input, VehPower_input, VehAge_input, VehBrand_emb) |>
layer_concatenate() |>
layer_dense(15, activation = "tanh", name = "Hidden_Veh_1") |>
layer_dense(10, activation = "tanh", name = "Hidden_Veh_2") |>
layer_dense(5, name = "Hidden_Veh_3")
output <- list(logDensity_input, DrivAge_smooth, PolicyRegion_emb, veh_subnet) |>
layer_concatenate() |>
layer_dense(1, activation = "exponential", name = "Output")
keras_model(inputs, output)
}
# Create, compile and fit model
nn_part_add <- make_partly_additive() |>
compile(
optimizer = optimizer_adam(0.1), loss = loss_poisson, weighted_metrics = list()
)
# summary(nn_part_add) # 333 parameters
plot_model(nn_part_add)
history <- nn_part_add |>
fit(
x = prep_nn(X_train),
y = train$Freq,
sample_weight = train$Exposure,
epochs = 100,
batch_size = 1e4,
validation_split = 0.2,
callbacks = cb,
verbose = FALSE
)
plot(history)
summary_model(nn_part_add, pred_fun = pred_nn2)
Comments:
Hyperparameters of ML models (including GLMs) are usually selected or tuned by the help of train/test splits and (cross-)validation. However, each model class (GLMs, GAMs, random forests, boosted trees, neural nets, …) has its own peculiarities and subtleties that should be respected. In this section, we will learn how to tune the many hyperparameters of a boosted trees model, such as the objective, the learning rate, the number of trees (= boosting rounds), and further regularization parameters. We will closely follow the suggestions in Chapter “Trees” in Mayer (2021). It would be naive and irresponsible to use a completely untuned boosted trees model.
Our focus is on boosted trees, as they are usually among the most powerful modeling techniques for tabular data. In addition, we have learned the elegant workflow to build a strong GLM by analyzing such a model with SHAP.
The first decision is to select a meaningful and ideally strictly consistent scoring/loss function for the distributional property T. On this basis, the evaluation metric and the objective are determined. The objective is used to fit the model, while the evaluation metric is used for model evaluation and parameter selection.
Examples:
Selecting an appropriate number of boosting rounds/trees is important: too few rounds tend to underfitting, while too many will often cause overfitting. A convenient strategy, called early stopping, is to calculate the evaluation metric after each boosting round by (cross-)validation. Once performance stops improving over a certain number of rounds, the boosting process is stopped. This approach is particularly useful because a good number of boosting rounds depends strongly on the choice of the other parameters, e.g., the learning rate. Thus, randomly selecting the number of boosting rounds along with the other parameters would often lead to a suboptimal solution.
The learning rate equals the weight of each tree in the final model. Its value (often between 1 and 0.005) can be chosen manually to achieve a reasonable number of boosting rounds by early stopping, aiming for a number between 100 and 1000 trees. Fewer trees will result in too simple a model, while more trees require long to fit. To quickly find a good learning rate, take advantage of the rule that dividing the learning rate by two roughly doubles the number of boosting rounds required for comparable performance. For instance, if you start with a learning rate of 0.2 and the model stops early already after only 50 rounds, you might try out a learning rate of 0.05 and expect a model that stops at about 200 trees (and hopefully performs slightly better).
Implementations of gradient boosted trees typically offer many parameters to control the size of the trees and other aspects of boosting, e.g.,
These parameters are either selected one by one using (cross-)validation or by randomized grid search. There, many random parameter combinations are evaluated and one typically selects the combination with the best (cross-)validation performance.
Note: The number of boosting rounds picked by early stopping often compensates for a suboptimal choice of other parameters. In fact, most parameters are highly interdependent in the sense that different parameter combinations can result in similar model performance.
To sum up, a simple strategy for selecting good parameters for a boosted trees model is as follows:
Often the performance improvements from Step 3 are modest, so this part can sometimes be skipped, especially when working with large data.
How did we select the hyperparameters of the LightGBM model fitted on claims frequencies of the French MTPL data?
The initial step was to choose the Poisson deviance as loss function
(strictly consistent for the expectation), determining both the
objective and evaluation metric. The other steps are summarized in the
code below. Compared to a standard LightGBM tuning workflow, we passed
our own (grouped) cross-validation folds to lgb.cv()
.
# Rerunning takes long; results might depend on seed
library(tidyverse)
library(splitTools)
library(lightgbm)
# Reload data and model-related objects
main <- "french_motor"
load(file.path(main, "intro.RData"))
# Data interface of LightGBM
dtrain <- lgb.Dataset(
data.matrix(train[xvars]),
label = train$Freq,
weight = train$Exposure,
params = list(feature_pre_filter = FALSE)
)
# STEP 1: Model for expected claims frequency -> Poisson deviance
# STEP 2: Select learning rate so that optimal number of rounds by early stopping is
# somewhere between 100 and 1000
params <- list(
objective = "poisson",
learning_rate = 0.05 # play with this value
)
# Grouped cross-validation folds (hold-out indices)
folds <- create_folds(train$group_id, k = 5, type = "grouped", invert = TRUE)
# k-fold grouped cross-validation to see how many trees are required by early stopping
cvm <- lgb.cv(
params = params,
data = dtrain,
nrounds = 5000,
folds = folds,
# nfold = 5, # for the ungrouped case
early_stopping_rounds = 20,
eval_freq = 50
)
cvm
# STEP 3: Random search for other parameters (some parameters are fixed)
grid <- expand.grid(
iteration = NA,
score = NA,
objective = "poisson",
learning_rate = 0.05,
num_leaves = c(15, 31, 63),
lambda_l2 = c(0, 2.5, 5, 7.5),
lambda_l1 = c(0, 4),
colsample_bynode = c(0.8, 1),
bagging_fraction = c(0.8, 1),
min_data_in_leaf = c(20, 50, 100),
min_sum_hessian_in_leaf = c(0, 0.001, 0.1),
poisson_max_delta_step = c(0.1, 0.7), # for Poisson objective
stringsAsFactors = FALSE
)
# Full grid search or randomized search?
max_size <- 32
grid_size <- nrow(grid)
if (grid_size > max_size) {
grid <- grid[sample(grid_size, max_size), ]
grid_size <- max_size
}
# Loop over grid and fit LGB with grouped five-fold CV and early stopping
grid_file <- file.path(main, "grid_base_lgb.rds")
pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) {
cvm <- lgb.cv(
params = as.list(grid[i, -(1:2)]),
data = dtrain,
nrounds = 5000,
folds = folds,
# nfold = 5, # for the ungrouped case
early_stopping_rounds = 20,
verbose = -1
)
# Store optimal number of boo
sting rounds and the cross-validation score
grid[i, 1:2] <- as.list(cvm)[c("best_iter", "best_score")]
setTxtProgressBar(pb, i)
# Save grid to survive hard crashs
saveRDS(grid, file = grid_file)
}
# Load grid
grid <- readRDS(grid_file) |>
arrange(score)
# Show some strong parameter combinations
grid |>
select_if(function(x) n_distinct(x) >= 2L) |>
head()
# Fit model on best parameter combination
fit_lgb <- lgb.train(
params = as.list(grid[1, -(1:2)]),
data = dtrain,
nrounds = grid[1, "iteration"]
)
In the exercises, we have worked with XGBoost to model expected Australian car claims frequencies. Here is the full code that was originally used to find the proposed parameter combination of the model used, e.g., in the exercises of Chapter 1.
# Rerunning takes long; results might depend on seed
library(withr)
library(tidyverse)
library(insuranceData)
library(xgboost)
data(dataCar)
# Data preparation
prep <- dataCar |>
mutate(
Freq = numclaims / exposure,
Sev = claimcst0 / numclaims,
veh_age = factor(veh_age),
agecat = factor(agecat),
veh_body = fct_lump_prop(veh_body, 0.1),
log_value = log(pmax(0.3, pmin(15, veh_value)))
)
xvars <- c("log_value", "agecat", "veh_age", "area", "veh_body", "gender")
# Data split
with_seed(304,
ix <- sample(nrow(prep), 0.8 * nrow(prep))
)
train <- prep[ix, ]
test <- prep[-ix, ]
# Data interface of XGB
dtrain <- xgb.DMatrix(
data.matrix(train[xvars]),
label = train$Freq,
weight = train$exposure
)
# STEP 1: Model for the expected claims frequency
# -> Poisson deviance is both strictly consistent and meaningful
# STEP 2: Select learning rate to get reasonable number of trees by early stopping
params <- list(
objective = "count:poisson",
learning_rate = 0.05 # play with this value
)
# k-fold cross-validation to see if number of trees is reasonable
cvm <- xgb.cv(
params = params,
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
showsd = FALSE,
print_every_n = 50
)
cvm
# STEP 3: Random search for other parameters (some parameters are fixed)
grid <- expand.grid(
iteration = NA,
cv_score = NA,
train_score = NA,
objective = "count:poisson",
learning_rate = 0.05,
max_depth = 2:4,
reg_lambda = 0:3,
reg_alpha = 0:3,
colsample_bynode = c(0.8, 1),
subsample = c(0.8, 1),
min_split_loss = c(0, 0.001),
max_delta_step = c(0.1, 0.7), # with Poisson objective
stringsAsFactors = FALSE
)
# Full grid search or randomized search?
max_size <- 32
grid_size <- nrow(grid)
if (grid_size > max_size) {
grid <- grid[sample(grid_size, max_size), ]
grid_size <- max_size
}
# Loop over grid and fit XGB with five-fold CV and early stopping
grid_file <- file.path("australian_car", "grid_xgb.rds")
pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) { # i <- 1
cvm <- xgb.cv(
params = as.list(grid[i, -(1:3)]),
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
verbose = 0
)
# Store optimal number of boosting rounds and both the training and the CV score
grid[i, 1] <- cvm$best_iteration
grid[i, 2:3] <- cvm$evaluation_log[, c(4, 2)][cvm$best_iteration]
setTxtProgressBar(pb, i)
# Save grid to survive hard crashs
saveRDS(grid, file = grid_file)
}
# Load and sort grid
grid <- readRDS(grid_file) |>
arrange(cv_score)
head(grid)
# Fit final model with best parameter combination
with_seed(345,
fit_xgb <- xgb.train(
params = as.list(grid[1, -(1:3)]),
data = dtrain,
nrounds = grid[1, "iteration"]
)
)
We use an insurance dataset available from openML (ID 42876). It was synthetically generated by Colin Priest. The raw data describes 100,000 workers compensation claims regarding ultimate loss (the model response), the initial case estimate (the main predictor), the dates of accident and reporting, demographic information on the worker and finally also a short text describing the accident.
The main aim will be to craft intelligible models predicting the ultimate claim amount per claim. All models should be Gamma regressions with log link.
You can freely do the programming yourself, or use the following
code. We will focus on rows with WeeklyPay >= 200
and
HoursWorkedPerWeek >= 20
.
library(OpenML)
library(farff)
library(tidyverse)
library(lubridate)
library(withr)
library(xgboost)
library(MetricsWeighted)
library(hstats)
library(shapviz)
library(caret)
library(keras)
#===============================
# Download and preprocess
#===============================
raw <- tibble(getOMLDataSet(data.id = 42876)$data)
# Clip small and/or large values
clip <- function(x, low = -Inf, high = Inf) {
pmax(pmin(x, high), low)
}
prep <- raw |>
filter(WeeklyPay >= 200, HoursWorkedPerWeek >= 20) |>
mutate(
Ultimate = clip(UltimateIncurredClaimCost, high = 1e6),
LogInitial = log(clip(InitialCaseEstimate, 1e2, 1e5)),
LogWeeklyPay = log(clip(WeeklyPay, 100, 2000)),
LogAge = log(clip(Age, 17, 70)),
Female = (Gender == "F") * 1L,
Married = (MaritalStatus == "M") * 1L,
PartTime = (PartTimeFullTime == "P") * 1L,
DateTimeOfAccident = as_datetime(DateTimeOfAccident),
LogDelay = log1p(as.numeric(ymd(DateReported) - as_date(DateTimeOfAccident))),
DateNum = decimal_date(DateTimeOfAccident),
WeekDay = wday(DateTimeOfAccident, week_start = 1), # 1 = Monday
Hour = hour(DateTimeOfAccident)
)
# Lost claim amount -> in practice, use as correction or "smearing" factor
(smearing_factor <- 1 - with(prep, sum(Ultimate) / sum(UltimateIncurredClaimCost)))
#===============================
# Variable groups
#===============================
x_continuous <- c("LogInitial", "LogWeeklyPay", "LogDelay", "LogAge", "DateNum")
x_discrete <- c("PartTime", "Female", "Married", "WeekDay", "Hour")
xvars <- c(x_continuous, x_discrete)
#===============================
# Univariate analysis
#===============================
# Some rows
prep |>
select(all_of(c("Ultimate", xvars))) |>
head()
# Histogram of the response on log scale
ggplot(prep, aes(Ultimate)) +
geom_histogram(bins = 19, fill = "darkred") +
scale_x_log10()
# Histograms of continuous predictors
prep |>
select(all_of(x_continuous)) |>
pivot_longer(everything()) |>
ggplot(aes(value)) +
geom_histogram(bins = 19, fill = "darkred") +
facet_wrap("name", scales = "free", ncol = 2)
# Barplots of discrete predictors (using the fact that the values are all integers)
prep |>
select(all_of(x_discrete)) |>
pivot_longer(everything()) |>
ggplot(aes(factor(value))) +
geom_bar(fill = "darkred") +
facet_wrap("name", scales = "free", ncol = 2)
#===============================
# Data splits for models
#===============================
with_seed(656,
.in <- sample(nrow(prep), 0.8 * nrow(prep), replace = FALSE)
)
train <- prep[.in, ]
test <- prep[-.in, ]
y_train <- train$Ultimate
y_test <- test$Ultimate
X_train <- train[xvars]
X_test <- test[xvars]
#===============================
# XGBoost: Additive model
# (Exercise 2)
#===============================
nrounds <- 484
params <- list(
objective = "reg:gamma",
learning_rate = 0.1,
max_depth = 1,
reg_lambda = 1,
reg_alpha = 3,
colsample_bynode = 1,
subsample = 1,
min_split_loss = 0.001
)
#===============================
# XGBoost: Partly additive model
# # (Exercise 4)
#===============================
# Build interaction constraint vector
additive <- c("Female", "DateNum")
ic <- c(
list(which(!(xvars %in% additive)) - 1),
as.list(which(xvars %in% additive) - 1)
)
ic
nrounds2 <- 202
params2 <- list(
objective = "reg:gamma",
learning_rate = 0.1,
max_depth = 2,
interaction_constraints = ic,
reg_lambda = 2,
reg_alpha = 0,
colsample_bynode = 0.8,
subsample = 1,
min_split_loss = 0
)
For all models, we will use the features specified in
xvars
.
#==========================================
# Neural net of Exercise 5i
# The code above needs to be run first
#==========================================
# Callbacks
cb <- list(
callback_early_stopping(patience = 20),
callback_reduce_lr_on_plateau(patience = 5)
)
# Constructing Gamma deviance loss
loss_gamma <- function(y_true, y_pred) {
-k_log(y_true / y_pred) + y_true / y_pred
}
# Scaling and splitting into feature lists
scaler <- preProcess(X_train)
prep_nn <- function(X) {
X_scaled <- predict(scaler, X)
lapply(asplit(X_scaled, 2), as.matrix)
}
(example_rows <- prep_nn(X_train[1:2, ]))
# Building the architecture
make_additive <- function() {
k_clear_session()
tensorflow::set_random_seed(469)
v <- colnames(X_train)
outputs <- inputs <- lapply(v, function(z) layer_input(1, name = z))
names(inputs) <- v
# Smooth only non-binary features
for (i in 1:length(v)) {
if (!(v[i] %in% c("PartTime", "Female", "Married")))
outputs[[i]] <- outputs[[i]] |>
layer_dense(8, activation = "tanh") |>
layer_dense(4, activation = "tanh") |>
layer_dense(1)
}
output <- outputs |>
layer_concatenate() |>
layer_dense(units = 1, activation = "exponential")
keras_model(inputs, output)
}
# Create, compile, and fit model
nn_additive <- make_additive()
# summary(nn_additive) # 410 parameters
history <- nn_additive |>
compile(optimizer = optimizer_adam(0.01), loss = loss_gamma) |>
fit(
x = prep_nn(X_train),
y = y_train,
epochs = 100,
batch_size = 400,
validation_split = 0.2,
callbacks = cb
)
# Inspect model
pred_nn <- function(m, X, log_scale = FALSE) {
out <- predict(m, prep_nn(X[xvars]), batch_size = 1000, verbose = FALSE)
if (log_scale) log(out) else out
}
pred <- pred_nn(nn_additive, X_test)
r_squared_gamma(y_test, pred, reference_mean = mean(y_train)) # 0.330
for (v in c("LogWeeklyPay", "DateNum", "Female")) {
p <- ice(nn_additive, v = v, X = X_train, pred_fun = pred_nn, log_scale = TRUE) |>
plot(center = TRUE)
print(p)
}