This is the second chapter of the XAI part of the lecture “Responsible Machine Learning with Insurance Applications”.
It is dedicated to inspecting and explaining supervised learning models. The basic workflow usually follows these steps:
In addition to being a fruitful source of information, each of these steps may reveal problems in the modeling process that should be addressed. The fewer problems remain, the higher the confidence in the model and the modeler.
In this chapter, we will explore these steps in detail. The main reference is Molnar (2019) and our tutorial (Mayer and Lorentzen 2020) on XAI in insurance. The main focus is on global, model-agnostic, post-hoc explainability.
In the R programming language, the most popular package for model-agnostic interpretation is “DALEX” (Biecek 2018). However, we will mainly work with our “hstats” package (Mayer 2023) since all of its methods are capable of handling case weights. This is particularly important in insurance applications, where many models use some form of weighting, such as exposure, sum insured, or premiums paid. Furthermore, {hstats} can calculate Friedman’s H-statistics.
In the section on SHAP, we will additionally use our package “shapviz” (Mayer 2022).
We start by loading the data and models from the Chapter “Introduction”, and define some helper functions and objects.
library(tidyverse)
library(patchwork)
library(withr)
library(mgcv)
library(lightgbm)
library(splines)
library(MetricsWeighted)
library(hstats)
library(rpart)
library(rpart.plot)
library(shapviz)
source("functions.R")
# Reload data and model-related objects
main <- "french_motor"
load(file.path(main, "intro.RData"))
fit_lgb <- lgb.load(file.path(main, "fit_lgb.txt"))
# Get predictions for all three models
models <- list(GLM = fit_glm, GAM = fit_gam, LGB = fit_lgb)
predict.list <- function(object, newdata, log_scale = FALSE) {
raw_preds <- cbind(
GLM = predict(object$GLM, newdata),
GAM = predict(object$GAM, newdata),
LGB = predict(object$LGB, data.matrix(newdata[xvars]), type = "raw")
)
if (log_scale) raw_preds else exp(raw_preds)
}
# Do we get predictions from all models?
predict(models, tail(test))
## GLM GAM LGB
## 1 0.12031704 0.11174467 0.09953763
## 2 0.14161424 0.15328422 0.10600385
## 3 0.11124378 0.10372200 0.09968847
## 4 0.10748088 0.09425247 0.09929751
## 5 0.09931363 0.09418269 0.09234605
## 6 0.12536508 0.11090939 0.12821975
We start interpreting the models by studying model performance using one or more relevant performance measures (often the average loss or a derived measure) based on a clean model validation strategy. Performance information is almost always of interest. Studying it is also helpful in finding gross errors in the modeling process:
Performance is typically evaluated on a hold-out dataset not used for model tuning or model fitting. Comparison with the performance on the training data gives a good indication of how much the model tends to overfit (or more precisely: of the model optimism).
We now evaluate all three models on the hold-out dataset by calculating the exposure-weighted average Poisson deviance \bar S(\hat m, D_\text{test}), see Chapter “Introduction”. In addition, we consider the relative deviance improvement (or “pseudo R-squared”) 1 − \frac{\bar S(\hat m, D_\text{test})}{\bar S(\hat m_\text{trivial}, D_\text{test})} compared to the intercept-only model with constant predictions \hat m_\text{trivial}(\boldsymbol x) = \frac{\sum_{D_\text{train}} w_i y_i}{\sum_{D_\text{train}} w_i}. Note that \hat m_\text{trivial} is ideally calculated on the training data, but sometimes, it is calculated on the test data for simplicity.
poisson_scorer(
models,
X = test,
y = test$Freq,
w = test$Exposure,
reference_mean = weighted.mean(train$Freq, train$Exposure)
)
Comments:
Next, let us compare the hold-out performance scores with their in-sample counterparts:
poisson_scorer(models, X = train, y = train$Freq, w = train$Exposure)
Comment: There is a certain optimism - the training performance is clearly higher than the test performance, especially for the LightGBM model.
In our example on French claims frequencies, we have split the dataset into train/test by grouped splitting on the column “group_id”. The same logic was also applied in cross-validation to search for good parameter combinations of the boosted trees model.
Grouped splitting works as follows: Unlike standard random splitting, where each row is randomly assigned to a data partition, all rows in a group are randomly assigned to the same partition. Thus, a group appears in exactly one partition. This ensures that no within-group information is leaked across partitions. Failure to do so may lead to biased performance evaluations and consequently to biased model selection. Typically, very flexible modeling techniques benefit from such bias, which unfortunately means that overfitting is rewarded.
Grouped data occur quite frequently. Here some examples:
Ignoring data structure during data partitioning is a common flaw in data science. Avoiding it is important.
How would the model performances of our GLM and the boosted trees model change if grouping were ignored?
First, we show the data rows belonging to the largest “group_id”.
# The largest group_id
prep |>
filter(group_size == max(prep$group_size)) |>
select_at(c("IDpol", "group_id", "Freq", "Exposure", xvars))
# Random split
n <- nrow(prep)
with_seed(1,
ix <- sample(1:n, 0.8 * n)
)
train_u <- prep[ix, ]
test_u <- prep[-ix, ]
# GLM
fit_glm_u <- glm(
reformulate(xvars, "Freq"),
data = train_u,
family = quasipoisson(),
weights = Exposure
)
# Boosted trees
dtrain_u <- lgb.Dataset(
data.matrix(train_u[xvars]),
label = train_u$Freq,
weight = train_u$Exposure,
params = list(feature_pre_filter = FALSE)
)
# Parameter set was determined by CV on *randomly split* folds,
# see "french_motor/tune_nongrouped_lgb.R"
params_u <- list(
objective = "poisson",
learning_rate = 0.05,
num_leaves = 63,
lambda_l2 = 2.5,
lambda_l1 = 0,
colsample_bynode = 0.8,
bagging_fraction = 1,
min_data_in_leaf = 20,
min_sum_hessian_in_leaf = 0,
poisson_max_delta_step = 0.1
)
with_seed(837,
fit_lgb_u <- lgb.train(params = params_u, data = dtrain_u, nrounds = 362)
)
## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008490 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Info] Total Bins 370
## [LightGBM] [Info] Number of data points in the train set: 542410, number of used features: 7
## [LightGBM] [Info] Start training from score -2.323655
# Relative deviance gains
poisson_scorer(
models,
X = test_u,
y = test_u$Freq,
w = test_u$Exposure,
reference_mean = weighted.mean(train_u$Freq, train_u$Exposure)
)
Comments:
To further dig into the topic of grouped data, let us create an artificial regression situation with ten features. Feature values and the response are randomly generated with no correlation between columns. We then use a 80%/20% split to examine how well a linear regression and a random forest perform in-sample and out-of-sample with increasing proportions of duplicated rows. In an ideal validation scheme, the out-of-sample R-squared should be close to 0. What do we get with random splitting? What if we perform grouped splitting on groups of identical rows?
library(tidyverse)
library(ranger)
# 1) Generate regression dataset without any true associations
n <- 1000
p <- 10
set.seed(1)
df <- rnorm(n * p) |>
matrix(ncol = p) |>
as.data.frame() |>
mutate(y = rnorm(n), group = 1:n) |>
select(group, y, everything())
df[1:5, 1:5]
# 2) Initialize results array
dims <- list(
prop = seq(0, 1, by = 0.1),
splittype = c("random", "grouped"),
model = c("linear", "random forest"),
datapart = c("train", "valid")
)
res <- array(dim = lapply(dims, length), dimnames = dims)
# 3) Define helper functions
# - Same prediction function for lm() and ranger()
pred <- function(fit, df) {
if (inherits(fit, "lm")) {
return(predict(fit, df))
}
predict(fit, df)$predictions
}
# - R-squared
r2 <- function(y, pred, train_y = y) {
mse <- function(x, z) mean((x - z)^2)
1 - mse(y, pred) / mse(y, mean(train_y))
}
# - Train and (clean) validation R-squared
both_r2 <- function(fit, train, valid) {
c(r2(train$y, pred(fit, train)),
r2(valid$y, pred(fit, valid), train_y = train$y))
}
# 4) Simulation
for (prop in dims$prop) {
# Add some duplicates to the original data
df_ <- rbind(df, df[sample(1:n, n * prop), ])
n_ <- nrow(df_)
for (splittype in dims$splittype) {
if (splittype == "random") {
ix <- (1:n_) %in% sample(1:n_, n_ * 0.8)
} else {
ix <- df_$group %in% sample(1:n, n * 0.8)
}
df_train <- df_[ix, ]
df_valid <- df_[!ix, ]
for (model in dims$model) {
model_function <- switch(model, linear = lm, ranger)
fit <- model_function(y ~ . - group, data = df_train)
res[as.character(prop), splittype, model, ] <- both_r2(fit, df_train, df_valid)
}
}
}
# 5) Visualization
as.data.frame.table(res) |>
ggplot(aes(x = as.numeric(as.character(prop)), y = Freq, color = splittype)) +
geom_line(linewidth = 1.2) +
facet_grid(model ~ datapart) +
theme(legend.position = c(0.2, 0.8), legend.title = element_blank()) +
scale_color_viridis_d(begin = 0.2, end = 0.8, option = "inferno") +
labs(y = "R-squared", x = "Proportion of duplicates added", title = "Simulation")
Comments:
In our main modeling example, we use the grouping structure only to create clean data splits. However, in other situations, one may want to explicitly use within-group information in the model: An example is an insurance product that is associated with a high claim frequency, such as insurance for large fleets of vehicles. Here, one could apply a policy-specific residual correction according to credibility theory. As with time series data, such panel-data require special attention not only in the choice of the validation scheme, but also in feature construction, selection of modeling technique, and calculation of predictions.
Although we cannot cover these aspects here, we cannot emphasize enough the importance of paying attention to data structure.
After having studied model performance, we now go a step further and consider variable importance. It is often a very valuable information to see which features have the strongest impact on the predictions, and - on the other hand - which features are hardly of any importance. Besides the pure information gain, variable importance also helps to challenge the correctness of the model. Are the results as expected or not? If a seemingly unimportant variable is one of the top predictors, then maybe some sort of information leakage from the response to that feature has happened (see also the discussion in section “Performance”). On the other hand, if a key variable does not show any relevance in the model, this could mean that some mistake happened in data preparation or that the modeler does not yet sufficiently understand the modeling situation or the data.
There are different model-specific ways to assess variable importance, here some examples:
The most frequently applied model-agnostic technique is called permutation importance introduced in the famous random forest paper (Breiman 2001) for random forests and later generalized by Fisher, Rudin, and Dominici (2018) to other models. It measures by how much a relevant performance measure \hat S (e.g., the average deviance or the RMSE) of a fitted model \hat m, evaluated on a dataset D, worsens after randomly shuffling the values of the j-th feature. The larger the performance change, the more important the feature.
Formally, permutation importance \text{PVI}(j, D) of the j-th feature X^{(j)} and data D can be defined as follows: \text{PVI}(j, D) = \hat S(\hat m, D^{(j)}) - \hat S(\hat m, D), where D^{(j)} is a version of D with randomly permuted values in the column representing the j-th feature X^{(j)}. For simplicity, we assume that the performance measure \hat S is defined such that smaller values mean better performance.
In pseudocode, copied from Mayer and Lorentzen (2020), the algorithm to calculate the permutation importance for all features is as follows:
If there are p features and n is the sample size of D, then n(p+1) predictions have to be calculated. Compared to other methods from XAI, this is a relatively cheap operation. Note that the process can be repeated multiple times to increase (and assess) robustness of the results.
Let us now calculate permutation importance for our three models on the hold-out dataset and with respect to the exposure-weighted average Poisson deviance:
with_seed(40,
imp <- perm_importance(
models,
v = xvars,
X = test,
y = "Freq",
w = "Exposure",
loss = "poisson",
n_max = 1e5,
m_rep = 2,
verbose = FALSE
)
)
plot(imp) +
ggtitle("Permutation importance regarding Poisson deviance")
Comments: The results make intuitively sense, even if they differ across models.
To compare the results of permutation importance with those from a model-specific variable importance measure, we also calculate split gain importance values of our boosted trees model. Note that split gain importance is calculated on the training data.
# Split gain importance regarding Poisson deviance
fit_lgb |>
lgb.importance() |>
lgb.plot.importance()
Comments: The results are different from permutation importance. Nevertheless, the two methods identify the same feature as the most important one.
One of the advantages of an additive GLM over a complex ML model is its simple structure: the model equation directly tells how the predictions (at the scale of the link function) will react to changes in a covariate X^{(j)} of interest, holding all other covariates fixed (“Ceteris Paribus”). Due to additivity, the effect of X^{(j)} is the same for all observations.
In this section, we will explore methods that describe the effects of covariates for general ML models. Due to possibly complex interaction effects, such descriptions are usually only approximations.
In a linear regression \mathbb E (Y \mid \boldsymbol{x}) \approx m(\boldsymbol{x}) = \beta_o +\beta_1 x^{(1)} + \dots + \beta_p x^{(p)}, the effect of a covariate X^{(j)} is described by its estimated coefficient \hat \beta_j. Similarly, in an additive model \mathbb E (Y \mid \boldsymbol{x}) \approx m(\boldsymbol{x}) = \beta_o +f_1(x^{(1)}) + \dots + f_p(x^{(p)}), the estimated effect \hat{f_j} of X^{(j)} can be visualized by plotting the component \hat{f_j}(v) against v. This curve will be parallel to the corresponding individual conditional expectation (ICE) curve (Goldstein et al. 2015) of any observation.
The ICE function for covariate X^{(j)} of model m and an observation with covariate vector \boldsymbol x \in {\mathbb R}^p is given by \text{ICE}_j: v \in {\mathbb R} \mapsto m(v, \boldsymbol x_{\setminus j}), where \boldsymbol x_{\setminus j} denotes all but the j-th component of \boldsymbol x, which is being replaced by the value v.
The corresponding ICE curve represents the graph (v, \text{ICE}_j(v)) for a grid of values v \in \mathbb R. In pseudocode, copied from Mayer and Lorentzen (2020), the algorithm to calculate ICE_j(v) works as follows:
An ICE plot finally shows the ICE curves of multiple observations.
How do ICE and c-ICE curves for the driver age of 200 randomly selected observations look on both the frequency and log-frequency scale?
# Same observations for all (sub-)plots
with_seed(3,
X_ice <- sample_n(train, 100)
)
# Original scale
ice_freq <- ice(models, v = "DrivAge", X = X_ice)
plot(ice_freq) +
labs(y = "Frequency", title = "ICE plot of 'DrivAge'")
plot(ice_freq, center = TRUE) +
labs(y = "Frequency (centered)", title = "c-ICE plot of 'DrivAge'")
# Link scale
ice_log <- ice(models, v = "DrivAge", X = X_ice, log_scale = TRUE)
plot(ice_log) +
labs(y = "log frequency", title = "ICE plot of 'DrivAge' (log scale)")
plot(ice_log, center = TRUE) +
labs(y = "log frequency (centered)", title = "c-ICE plot of 'DrivAge' (log scale)")
Comments:
The pointwise average of many ICE curves is called a partial dependence plot (PDP), originally introduced by Friedman (2001) in his seminal work on gradient boosting. It describes the effect of the variable X^{(j)}, averaged over all interaction effects and holding all other variables fixed. The empirical partial dependence function for model \hat m and feature X^{(j)} is given by \text{PD}_j(v) = \frac{1}{n} \sum_{i = 1}^n \hat m(v, \boldsymbol x_{i,\setminus j}), where \boldsymbol x_{i,\setminus j} denotes the feature vector without j-th component of observation i in some reference dataset, often the training or the test data. In modeling situations involving case weights, the arithmetic average is replaced by a corresponding weighted average. \text{PD}_j(v) serves as estimate of \mathbb E_{\boldsymbol X_{\setminus j}}m(v, \boldsymbol X_{\setminus j}) . The expectation runs over the joint marginal distribution of \boldsymbol X_{\setminus j}, i.e., of all features except the j-th component.
The corresponding PDP represents the graph (v, \text{PD}_j(v)) for a grid of values v \in \mathbb R.
The algorithm (in pseudocode) for evaluating the partial dependence function for a grid of values for v is copied from Mayer and Lorentzen (2020):
Sometimes two-dimensional partial dependence plots of the features X^{(j)} and X^{(k)} are to be studied. Then, the corresponding empirical partial dependence function is given by \text{PD}_{jk}(v_j, v_k) = \frac{1}{n} \sum_{i = 1}^n \hat m(v_j, v_k, \boldsymbol x_{i,\setminus \{j, k\}}), where \boldsymbol x_{i,\setminus \{j, k\}} is the feature vector of observation i without the j-th and k-th components. We will need this construction later in the section on interaction strength.
Let us now study the (exposure-weighted) partial dependence plots for all features in our claims frequency models on the link scale.
# Each partial dependence plot corresponds to the average of 1000 ICE curves
p <- list()
for (v in xvars) {
p[[v]] <- plot(partial_dep(models, v = v, X = train, log_scale = TRUE)) +
ggtitle(v)
}
wrap_plots(p, ncol = 2, guides = "collect") &
ylim(-3, -1.2) &
ylab("log scale")
Comments:
An alternative to the partial dependence plot, the accumulated local effects (ALE) plot (Apley and Zhu 2016), attempts to overcome the first two (related) negative points. However, ALE plots are more difficult to calculate and to interpret. We will not consider them here.
In addition to ICE plots and PDPs, classic diagnostic regression plots (adapted for larger data) can be considered.
Let us now consider these three types of plots for the covariate “DrivAge” evaluated on the hold-out data:
# Note: this code would greatly simplify without case weights
temp <- test |>
transform(DrivAge_c = cut(DrivAge, breaks = seq(15, 85, by = 5)))
# Average response
temp |>
group_by(DrivAge_c) |>
summarize(Average_response = weighted.mean(Freq, Exposure)) |>
ggplot(aes(DrivAge_c, Average_response)) +
geom_point(color = "darkblue") +
ggtitle("Response vs. covariate plot of 'DrivAge'")
# Average predictions
preds <- predict(models, test) |>
data.frame() |>
cbind(temp[c("Freq", "Exposure", "DrivAge_c")]) |>
pivot_longer(cols = c("GLM", "GAM", "LGB"), values_to = "Pred", names_to = "Model") |>
transform(Resid = Freq - Pred)
preds |>
group_by(DrivAge_c, Model) |>
summarize(Average_predicted = weighted.mean(Pred, Exposure), .groups = "drop") |>
ggplot(aes(DrivAge_c, Average_predicted, group = Model, color = Model)) +
geom_point() +
geom_line() +
ggtitle("Predicted vs. covariate plot of 'DrivAge'")
preds |>
group_by(DrivAge_c, Model) |>
summarize(Average_residual = weighted.mean(Resid, Exposure), .groups = "drop") |>
ggplot(aes(DrivAge_c, Average_residual, group = Model, color = Model)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0) +
ggtitle("Residuals vs. covariate plot of 'DrivAge'")
Comments:
In linear models, one has to identify and add meaningful interaction effects by hand. This is in contrast to modern ML methods, which typically have sufficient flexibility to add complex interaction effects fully automatically. We have seen that ICE plots give a visual impression of the interaction strength associated with a single covariate X^{(j)}. Another interesting question is: between which variable pairs do strong interactions occur?
This can be done by looking at Friedman’s H statistic (Friedman and Popescu 2008) for pairwise interactions between covariates X^{(j)} and X^{(k)}. It is defined as follows: H^2_{jk}=\frac{\sum_{i=1}^n \Big[\text{PD}_{jk}(x_i^{(j)}, x_i^{(k)}) - \text{PD}_{j}(x_i^{(j)}) - \text{PD}_{k}(x_i^{(k)})\Big]^2}{\sum_{i=1}^n \Big[\text{PD}_{jk}(x_i^{(j)}, x_i^{(k)})\Big]^2}, where the sums run over a reference dataset, e.g., a subset of the training or test data. Here, the partial dependence functions are all zero-mean centered. The part in the parentheses of the numerator measures the difference between the two-dimensional effect surface and the two univariate effects. Thus, H^2 can be interpreted as the proportion of variability in the joint effect of X^{(j)} and X^{(j)} unexplained by the main effects. The smaller the value, the weaker the pairwise interaction. A value close to 1 indicates that almost all of the effect of X^{(j)} and X^{(k)} comes from the pairwise interaction. According to this interpretation, it is often easier to work with H^2 than with H.
Since H^2 is a relative measure, it can be high even if the overall effects of the considered variables are weak. Thus, we propose to consider also the following measure of absolute interaction strength: \tilde H_{jk} = \sqrt{\frac{1}{n}\sum_{i=1}^n\Big[\text{PD}_{jk}(x_i^{(j)}, x_i^{(k)}) - \text{PD}_{j}(x_i^{(j)}) - \text{PD}_{k}(x_i^{(k)})\Big]^2}.
We now calculate both Friedman’s H^2 and \tilde H for all variable pairs and the LightGBM model. Furthermore, we visualize a strong interaction by a stratified PDP.
with_seed(4,
H <- hstats(
fit_lgb,
X = data.matrix(train[xvars]),
w = train$Exposure,
pairwise_m = length(xvars),
type = "raw",
verbose = FALSE
)
)
plot(h2_pairwise(H)) +
labs(title = "Relative interaction strength", x = "Friedman's H-squared")
plot(h2_pairwise(H, normalize = FALSE, squared = FALSE)) +
labs(title = "Absolute interaction strength", x = "H tilde")
# Visualization of strong interaction by studying a stratified PDP
partial_dep(
models, v = "VehAge", X = train, w = "Exposure", BY = "VehBrand", log_scale = TRUE
) |>
plot() +
labs(y = "log frequency", title = "PDP of 'VehAge' stratified by 'VehBrand'")
Comments:
Another model-agnostic technique for examining the internals of a complex model m goes back to Craven and Shavlik (1995), see Molnar (2019) for a modern take. It works as follows: An intrinsically interpretable model m_I, usually a small decision tree, is fitted to the predictions of \hat m using the same covariates as m and minimizing the mean squared error. \hat m_I is called a (global) surrogate model for \hat m. The higher the R-squared of \hat m_I, the better \hat m can be explained by studying the surrogate \hat m_I.
For our boosted trees model, a surrogate tree model looks like this:
preds <- predict(fit_lgb, data.matrix(train[xvars]))
fit_tree <- rpart(
reformulate(xvars, "preds"),
data = train,
xval = 0,
weights = Exposure,
maxdepth = 4
)
rpart.plot(fit_tree, roundint = FALSE, type = 5)
r_squared(preds, predict(fit_tree, train), w = train$Exposure)
## [1] 0.698304
Comments:
Sometimes there are good reasons (e.g. intrinsic explainability) for using a (generalized) linear model as the final model. However, building a strong linear model is quite complex: the modeler must consider how to represent the features and where to add nonlinear terms and interactions. A modern way to guide through these steps is to build a black box model and inspect it using XAI to obtain information about where and how to optimize the linear model. Following Mayer and Lorentzen (2020), the basic logic is as follows:
After going through these steps, the modified linear model can be re-examined. Hopefully, the results will now be closer to the optimized black box model while retaining intrinsic explainability.
Let us now try to optimize our GLM based on the insights gained so far:
fit_glm_imp <- glm(
Freq ~ VehPower + ns(VehAge, 5) * VehBrand + VehGas + ns(DrivAge, 5) +
logDensity + PolicyRegion,
data = train,
family = quasipoisson(),
weights = Exposure
)
# Performance
poisson_scorer(
fit_glm_imp,
X = test,
y = test$Freq,
w = test$Exposure,
reference_mean = weighted.mean(train$Freq, train$Exposure),
type = "response"
)
# Effects: some partial dependence plots on link scale
p1 <- partial_dep(fit_glm_imp, v = "DrivAge", X = train) |>
plot() +
labs(y = "log frequency", title = "PDP of 'DrivAge'")
p2 <- partial_dep(
fit_glm_imp, v = "VehAge", X = train, w = "Exposure", BY = "VehBrand"
) |>
plot() +
labs(y = "log frequency", title = "PDP of 'VehAge' stratified by 'VehBrand'")
p1 + p2
Comments:
SHAP (SHapley Additive ExPlanations) is the key player in explaining a ML model locally, i.e., around a single observation. It was introduced in Lundberg and Lee (2017) and has since received much attention, not only when working with tabular data but also with unstructured data like text or images. The basic idea of SHAP is to decompose a model prediction into additive contributions of the features in a fair way. Repeating this process for many observations provides a powerful method for explaining the model as a whole. The roots of the method goes back to a classic result on cooperative game theory (Shapley 1953). Thus, and in contrast to many other local explainability methods, SHAP is based on a solid theoretic foundation.
Another important local technique is LIME (Local Interpretable Model-Agnostic Explanations, Ribeiro, Singh, and Guestrin (2016)), primarily used for models with unstructured input like text or image data. It works as follows: An artificial dataset is created with feature values slightly different from those of the observation to be explained. Its predictions are then regressed on a simplified feature space by a white-box surrogate model, such as linear regression, assigning higher weights to more similar observations. This surrogate model provides a local interpretation of the original model in the neighborhood of the selected observation.
For more details on SHAP, see Mayer, Meier, and Wüthrich (2023).
So what is SHAP? To answer this question, we first introduce Shapley values from cooperative game theory.
Consider the following situation:
How can the total payoff be distributed fairly among the individual players? The answer was given in Shapley (1953): The amount that player j should receive is called Shapley value and equals \phi_j(v) = \phi_j = \sum_{\mathcal L \subseteq \mathcal M \setminus\{j\}} \underbrace{\frac{|\mathcal L|!(p - |\mathcal L| - 1)!}{p!}}_\text{Shapley weight} \big(\underbrace{v(\mathcal L \cup \{j\}) - v(\mathcal L)}_{\text{Contribution of player } j}\big). Thus, a player’s contribution is equal to the weighted average of his or her contribution to each possible coalition \mathcal L of other players.
This can be read as the (unweighted) average of the contribution of player j over all p! permutations of players. To see this, consider a permutation of the players. The |\mathcal L| players appearing prior to player j form the coalition \mathcal L, while the p - |\mathcal L| - 1 players \mathcal Q after position j do not play. We can permute the players in \mathcal L, and also the players in \mathcal Q without effect on the contribution of player j. This leads to the numerator of the Shapley weight.
To characterize “fairness”, Shapley postulated axioms that may be described by these four desirable properties:
Efficiency: v(\mathcal M) = \sum_{j = 0}^p \phi_j, where \phi_o = v(\emptyset) denotes the non-distributed payoff (often set to 0 in cooperative games).
Symmetry: If v(\mathcal L \cup \{i\}) = v(\mathcal L \cup \{j\}) for every \mathcal L \subseteq \mathcal M \setminus\{i, j\}, then \phi_i = \phi_j.
Dummy player: If v(\mathcal L \cup \{j\}) = v(\mathcal L) for all coalitions \mathcal L \subseteq \mathcal M \setminus\{j\}, then \phi_j = 0.
Linearity: Consider two cooperative games with gain functions v and w. Then, \phi_j(v + w) = \phi_j(v) + \phi_j(w) and \phi_j(\alpha v) = \alpha \phi_j(v) for all 1 \le j \le p and \alpha \in \mathbb R.
Shapley values are the only way to distribute total winnings among players in a way that satisfies these four properties.
In statistics, Shapley values were proposed by Lipovetsky and Conklin (2001) as a strategy to decompose the R-squared of a linear regression into additive contributions of the covariates. Their approach was based on retraining the model for every feature subset. Other early work on Shapley values in statistics and ML include Štrumbelj and Kononenko (2010), Štrumbelj and Kononenko (2014), and Lundberg and Lee (2017). They propose to use Shapley values to decompose predictions into feature contributions. Thus, the outcome of the cooperative game is the prediction, and the features are the players.
Formally, the prediction m(\boldsymbol x) of a given feature vector \boldsymbol x is to be decomposed into contributions \phi_j \in \mathbb R, 1 \le j \le p, such that m(\boldsymbol x) = \phi_o + \sum_{j = 1}^p \phi_j, where typically \phi_o = \mathbb E(m(\boldsymbol X)). Only if the \phi_j are Shapley values, the decomposition will be fair. In order to apply Shapley’s formula, a contribution function v must be selected. A natural candidate is v(\mathcal L) = m(\mathbf x_\mathcal L), where \mathbf x_\mathcal L represents the components in the feature subset \mathcal L \subseteq \mathcal M selected from the full feature set \mathcal M. Since models cannot simply “turn off” some features during prediction, the estimation of m(\mathbf x_\mathcal L) is highly non-trivial, and there is some controversy (see e.g., Sundararajan and Najmi (2019) or Janzing, Minorics, and Bloebaum (2020)) whether to approximate it using marginal expectations \mathbb E(m(\mathbf x_{\mathcal L}, \mathbf X_{\mathcal M \setminus \mathcal L})) or conditional expectations \mathbb E(m(\mathbf x_{\mathcal L}, \mathbf X_{\mathcal M \setminus \mathcal L}) \mid \mathbf X_{\mathcal L} = \mathbf x_\mathcal L)). The core question is: should the statistical dependence between features in \mathcal L and features in \mathcal M \setminus \mathcal L be broken (marginal) or not (conditional)? The rules of causal inference require the former, while statistically, the latter would be more natural.
There exist several algorithms to calculate SHAP values for an observation \mathbf x. Three particularly important ones are the following:
We won’t go into more detail about how the above methods work but rather note:
SHAP has a solid theoretical foundation. In practice, some of it is lost because statistics is not mathematics.
Remark: In the following two situations, exact Shapley values can be calculated:
Before we dive into an example, here are the main reasons for SHAP’s success:
Throughout this section, we will use the claims frequency model fitted with boosted trees. SHAP values are calculated with the TreeSHAP algorithm included in LightGBM. It provides SHAP values on the link scale.
What is the SHAP decomposition for the first observation?
# Calculate SHAP decomposition for first observation
train_1 <- train[1, xvars]
shp <- shapviz(fit_lgb, X_pred = data.matrix(train_1[xvars]), X = train_1)
summary(shp)
## 'shapviz' object representing
## - SHAP matrix of dimension 1 x 7
## - feature data.frame of dimension 1 x 7
## - baseline value of -2.371182
##
## SHAP values of first 1 observations:
## VehAge DrivAge logDensity VehPower PolicyRegion VehBrand
## [1,] 0.8615413 0.6498939 -0.03593755 -0.1330453 -0.03682505 0.1681178
## VehGas
## [1,] 0.3326465
##
## Corresponding feature values:
## VehAge DrivAge logDensity VehPower PolicyRegion VehBrand VehGas
## 1 0 18 3.871201 4 Other B12 Regular
# Two typical visualizations
sv_waterfall(shp) +
ggtitle("Waterfall plot of one observation")
# For slides
# sv_waterfall(shp, annotation_size = 5, size = 18) +
# theme(axis.text = element_text(size = 15), axis.title = element_text(size = 18))
sv_force(shp) +
ggtitle("Force plot of one observation")
# Does sum of SHAP equals prediction at log link?
get_baseline(shp) + sum(get_shap_values(shp))
## [1] -0.5647904
predict(fit_lgb, data.matrix(train_1[xvars]), type = "raw")
## [1] -0.5647904
Comments:
Interpreting the model locally around a particular observation, while interesting, provides little information about the model as a whole, i.e., at the global level. However, we can compute SHAP decompositions for a sufficiently large set of n observations and then use descriptive statistics to derive global properties of the model. Thanks to the very fast TreeSHAP algorithm, this works especially well for tree-based models.
Let X be the (n \times p) feature matrix with elements x_{ij} and \Phi the corresponding (n \times p) matrix of SHAP values with elements \phi_{ij}, 1 \le i \le n, 1 \le j \le p. Furthermore, let \phi_o = \frac{1}{n} \sum_{i = 1}^n \hat m(\boldsymbol x_i) be the average prediction over the n observations with feature vectors \boldsymbol x_i. By construction, the SHAP values satisfy \hat m(\boldsymbol x_i) = \phi_o + \sum_{j = 1}^p \phi_{ij} for all 1 \le i \le n. Based on X and \Phi, we can investigate feature importance and feature effects using descriptive statistics.
The absolute SHAP value |\phi_{ij}| quantifies the importance of feature j in predicting observation i. Taking the average over the n observations results in a variable importance measure called SHAP feature importance, defined as I_j=\frac{1}{n}\sum_{i=1}^n|\phi_{ij}|. The interpretation is simple: I_j indicates the average absolute impact of feature j on the prediction.
The values of I_j can be visualized as a bar plot. An alternative is the so-called SHAP summary plot. It shows for each feature j a horizontal scatter plot of the SHAP values \phi_{ij}, where the vertical scatter is proportional to their density. Such a beeswarm plot is usually enhanced by highlighting the (min-max scaled) feature values on the color axis. In this way, the direction of the effect can be seen, i.e., whether high feature values would increase or decrease predictions.
Let us now decompose the predictions of n = 2000 randomly picked observations from the training data and examine their SHAP importance.
with_seed(536,
X_explain <- sample_n(train[xvars], 2000)
)
system.time(
shp <- shapviz(fit_lgb, X_pred = data.matrix(X_explain), X = X_explain)
)
## user system elapsed
## 4.09 0.33 0.52
summary(shp)
## 'shapviz' object representing
## - SHAP matrix of dimension 2000 x 7
## - feature data.frame of dimension 2000 x 7
## - baseline value of -2.371182
##
## SHAP values of first 2 observations:
## VehAge DrivAge logDensity VehPower PolicyRegion
## [1,] -0.193252860 0.007193131 0.186410124 -0.07370722 -0.08530982
## [2,] -0.001457604 -0.119818310 -0.001393612 0.12044796 -0.06117225
## VehBrand VehGas
## [1,] -0.006689112 -0.015635201
## [2,] -0.026296833 0.005503366
##
## Corresponding feature values:
## VehAge DrivAge logDensity VehPower PolicyRegion VehBrand VehGas
## 1 13 53 7.721792 4 Other B1 Regular
## 2 2 55 6.198479 11 Other B12 Diesel
sv_importance(shp, show_numbers = TRUE) +
ggtitle("SHAP importance plot")
sv_importance(shp, kind = "beeswarm") +
ggtitle("SHAP summary plot")
Comments:
To see how a feature affects the prediction, SHAP dependence plots are useful. The dependence plot of a feature j plots the SHAP values \phi_{ij} against the corresponding feature values x_{ij}. Thus, it represents the graph \{(x_{ij}, \phi_{ij}), 1 \le i \le n\}.
How does the dependence plot of feature “VehAge” looks like?
sv_dependence(shp, "VehAge", color_var = NULL) +
ggtitle("SHAP dependence plot of 'VehAge'")
Comments:
At least two approaches are common to study interaction effects using SHAP:
SHAP interactions: SHAP decomposes a prediction into additive contributions of each feature. In the same vein, Lundberg et al. (2020) introduced SHAP interactions, which decompose a prediction into additive contributions of each feature pair. Compared to ordinary SHAP, the computational complexity increases by a factor p, which is why SHAP interaction values are mainly used with the very fast TreeSHAP algorithm, i.e., for tree-based models.
Heuristics: Several heuristics are available that try to identify the feature that explains most of the vertical scatter in the dependence plot conditional on the value on the horizontal axis. The selected feature is then visualized on the color axis.
Using Approach 2, we now highlight a strongly interacting feature on the color axis.
sv_dependence(shp, "VehAge", alpha = 0.4) +
ggtitle("SHAP dependence plot of 'VehAge' with color axis")
Comments:
A typical workflow with tree-based models is to decompose many (1000-2000 predictions), study SHAP importance and the dependence plots of all features. What are we waiting for?
with_seed(536,
X_explain <- sample_n(train[xvars], 2000)
)
shp <- shapviz(fit_lgb, X_pred = data.matrix(X_explain), X = X_explain)
sv_importance(shp, show_numbers = TRUE) +
ggtitle("SHAP importance plot")
sv_dependence(shp, v = xvars, alpha = 0.5) +
plot_layout(ncol = 2) &
ylim(-0.7, 1.7)
Comments:
It is amazing how much information can be generated with so few lines of code. If the ultimate goal of the modeling task is to obtain a strong (generalized) linear model, we can develop an ML model, evaluate its performance, perform a compact SHAP analysis, and then use these findings to craft a strong GLM. This adopts the logic from the section “Improving Linear Models through XAI” in an extremely efficient manner.
From the plots above, we could say:
Importance: “VehAge”, “logDensity” and “DrivAge” are the three most important features. The GLM will hopefully benefit if their effects are well represented. There is no variable that we should omit.
Effects:
Let us modify the GLM accordingly and focus on the resulting test performance:
fit_glm_imp2 <- glm(
Freq ~ (VehAge == 0) + ns(VehAge, 4) + I(VehAge == 0):I(VehBrand == "B12") +
VehBrand + logDensity + ns(DrivAge, 5) + VehPower + VehGas + VehGas:I(VehAge <= 5),
data = train,
family = quasipoisson(),
weights = Exposure
)
poisson_scorer(
fit_glm_imp2,
X = test,
y = test$Freq,
w = test$Exposure,
reference_mean = weighted.mean(train$Freq, train$Exposure),
type = "response"
)
Comment: The relative deviance gain is clearly
higher compared to the original GLM (1%) and also better than the first
improved version fit_glm_imp
(3.3%).
dataCar
data. To make the example self-contained, we repeat
data preprocessing and modeling.library(withr)
library(tidyverse)
library(insuranceData)
library(ranger)
library(xgboost)
library(MetricsWeighted)
library(hstats)
library(shapviz)
library(patchwork)
library(splines)
source("functions.R")
data(dataCar)
# Data preparation
prep <- dataCar |>
mutate(
Freq = numclaims / exposure,
Sev = claimcst0 / numclaims,
veh_age = factor(veh_age),
agecat = factor(agecat),
veh_body = fct_lump_prop(veh_body, 0.1),
log_value = log(pmax(0.3, pmin(15, veh_value)))
)
xvars <- c("log_value", "agecat", "veh_age", "area", "veh_body", "gender")
with_seed(304,
ix <- sample(nrow(prep), 0.8 * nrow(prep))
)
train <- prep[ix, ]
test <- prep[-ix, ]
# Models
form <- Freq ~ log_value + agecat + veh_age + area + veh_body + gender
fit_glm <- glm(form, data = train, weights = exposure, family = quasipoisson())
fit_rf <- ranger(
form, data = train, case.weights = train$exposure, max.depth = 4, seed = 400
)
params <- list(
objective = "count:poisson",
learning_rate = 0.05,
max_depth = 2,
reg_lambda = 1,
reg_alpha = 2,
colsample_bynode = 0.8,
subsample = 1,
min_split_loss = 0,
max_delta_step = 0.1 # with Poisson
)
nrounds <- 304
dtrain <- xgb.DMatrix(
data.matrix(train[xvars]),
label = train$Freq,
weight = train$exposure
)
with_seed(345,
fit_xgb <- xgb.train(params = params, data = dtrain, nrounds = nrounds)
)
# Adapted from lecture notes
models <- list(GLM = fit_glm, RF = fit_rf, XGB = fit_xgb)
predict.list <- function(object, newdata) {
cbind(
GLM = predict(object$GLM, newdata = newdata, type = "response"),
RF = predict(object$RF, data = newdata)$predictions,
XGB = predict(object$XGB, newdata = data.matrix(newdata[xvars]))
)
}