Exercise Set 6: Explainability

In this exercise set, we will be looking at explainability.

Data

The dataset we will be looking at this time is a dataset regarding the determinants of wages from the 1985 Current Population Survey (Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley), which is open source and available at OpenML. As last time, you’re welcome to use a dataset of your own.

Load data

Here we load our input data into a DataFrame called X and our target data into a Series called y.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Get wage data
from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)

X = survey.data[survey.feature_names]

y = survey.target

Here we describe the data using both the documentation which came with the data, but also by computing summary statistics for the input data and target data.

print(survey.DESCR)

**Author**:   
**Source**: Unknown - Date unknown  
**Please cite**:   

Determinants of Wages from the 1985 Current Population Survey

Summary:
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. We wish to determine (i) whether wages are related to these characteristics and (ii) whether there is a gender gap in wages.
Based on residual plots, wages were log-transformed to stabilize the variance. Age and work experience were almost perfectly correlated (r=.98). Multiple regression of log wages against sex, age, years of education, work experience, union membership, southern residence, and occupational status showed that these covariates were related to wages (pooled F test, p < .0001). The effect of age was not significant after controlling for experience. Standardized residual plots showed no patterns, except for one large outlier with lower wages than expected. This was a male, with 22 years of experience and 12 years of education, in a management position, who lived in the north and was not a union member. Removing this person from the analysis did not substantially change the results, so that the final model included the entire sample.
Adjusting for all other variables in the model, females earned 81% (75%, 88%) the wages of males (p < .0001). Wages increased 41% (28%, 56%) for every 5 additional years of education (p < .0001). They increased by 11% (7%, 14%) for every additional 10 years of experience (p < .0001). Union members were paid 23% (12%, 36%) more than non-union members (p < .0001). Northerns were paid 11% (2%, 20%) more than southerns (p =.016). Management and professional positions were paid most, and service and clerical positions were paid least (pooled F-test, p < .0001). Overall variance explained was R2 = .35.
In summary, many factors describe the variations in wages: occupational status, years of experience, years of education, sex, union membership and region of residence. However, despite adjustment for all factors that were available, there still appeared to be a gender gap in wages. There is no readily available explanation for this gender gap.

Authorization: Public Domain

Reference: Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley.

Description:  The datafile contains 534 observations on 11 variables sampled from the Current Population Survey of 1985.  This data set demonstrates multiple regression, confounding, transformations, multicollinearity, categorical variables, ANOVA, pooled tests of significance, interactions and model building strategies.

Variable names in order from left to right:
EDUCATION: Number of years of education.
SOUTH: Indicator variable for Southern Region (1=Person lives in        South, 0=Person lives elsewhere).
SEX: Indicator variable for sex (1=Female, 0=Male).
EXPERIENCE: Number of years of work experience.
UNION: Indicator variable for union membership (1=Union member,         0=Not union member).
WAGE: Wage (dollars per hour).
AGE: Age (years).
RACE: Race (1=Other, 2=Hispanic, 3=White).
OCCUPATION: Occupational category (1=Management,        2=Sales, 3=Clerical, 4=Service, 5=Professional, 6=Other).
SECTOR: Sector (0=Other, 1=Manufacturing, 2=Construction).
MARR: Marital Status (0=Unmarried,  1=Married)


Therese Stukel
Dartmouth Hitchcock Medical Center
One Medical Center Dr.
Lebanon, NH 03756
e-mail: stukel@dartmouth.edu


Information about the dataset
CLASSTYPE: numeric
CLASSINDEX: none specific

Downloaded from openml.org.

X.describe(include="all")

	EDUCATION	SOUTH	SEX	EXPERIENCE	UNION	AGE	RACE	OCCUPATION	SECTOR	MARR
count	534.000000	534	534	534.000000	534	534.000000	534	534	534	534
unique	NaN	2	2	NaN	2	NaN	3	6	3	2
top	NaN	no	male	NaN	not_member	NaN	White	Other	Other	Married
freq	NaN	378	289	NaN	438	NaN	440	156	411	350
mean	13.018727	NaN	NaN	17.822097	NaN	36.833333	NaN	NaN	NaN	NaN
std	2.615373	NaN	NaN	12.379710	NaN	11.726573	NaN	NaN	NaN	NaN
min	2.000000	NaN	NaN	0.000000	NaN	18.000000	NaN	NaN	NaN	NaN
25%	12.000000	NaN	NaN	8.000000	NaN	28.000000	NaN	NaN	NaN	NaN
50%	12.000000	NaN	NaN	15.000000	NaN	35.000000	NaN	NaN	NaN	NaN
75%	15.000000	NaN	NaN	26.000000	NaN	44.000000	NaN	NaN	NaN	NaN
max	18.000000	NaN	NaN	55.000000	NaN	64.000000	NaN	NaN	NaN	NaN

y.describe()

count    534.000000
mean       9.024064
std        5.139097
min        1.000000
25%        5.250000
50%        7.780000
75%       11.250000
max       44.500000
Name: WAGE, dtype: float64

In this exercise set, we will not perform hyperparameteroptimization, as this simplifies the code.

However, some methods can only explain the training data, whereas some can explain both the training and test data. To examine this, we split up the sample.

Exercise 1.1

Fill in the missing code such that we split the data into 80% train and 20% test.

Hints:

We have previously looked at datasplitting. Try looking at last sessions exercises.

from sklearn.model_selection import FILL IN

X_train, X_test, y_train, y_test = FILL IN(FILL IN, random_state=73)

Linear regression

In this section of the exercises, we will be looking at (regularized) linear regression.

We will be looking at different types of plots, and how regularizing linear regression with LASSO can create sparse models.

First we will use some of sklearn’s built-in functionality for easily handling categorical variables.

Exercise 2.1

What does this code do? Fill in the missing comments.

Hints:

We have not used this functionality before.

If you’re in doubt, try looking up the documentation.

Think about how numerical and categorical variables should be encoded.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

#
categorical_columns = ["RACE", "OCCUPATION", "SECTOR", "MARR", "UNION", "SEX", "SOUTH"]
numerical_columns = ["EDUCATION", "EXPERIENCE", "AGE"]

#
categorical_encoder = OneHotEncoder()

#
col_transformer = ColumnTransformer(
    [
        ("cat", categorical_encoder, categorical_columns)
    ],
    verbose_feature_names_out=False,
    remainder='passthrough'
)

To start out with, we only introduce a slight amount of regularization.

Exercise 2.2

Fill in the missing code and comments to implement a LASSO regression with \(\alpha=0.0001\).

Hints:

We created a transformer for categorical columns in the previous exercise.

We have looked at this in previous sessions. Try looking at those.

from sklearn.pipeline import FILL IN 
from sklearn.linear_model import FILL IN 
from sklearn.preprocessing import FILL IN 

#
lasso_model = Pipeline([
    ('preprocessor',FILL IN),
    ('scaler', FILL IN),
    ('regr', FILL IN)
    ]
)

#
lasso_model.fit(X_train, y_train)

#
feature_names = lasso_model[:-1].get_feature_names_out()

#
lasso_coefs = pd.DataFrame(
    lasso_model[-1].coef_,
    columns=["Coefficients"],
    index=feature_names,
)

lasso_coefs

As we know from session 4, LASSO creates sparse models due to the corner solutions stemming from L1 normalization.

Exercise 2.3

Looking at the coefficients from exercise 2.2, which coefficients are equal to zero? Why is the second line of code necessary to get the answer we intuitively agree with?

Hints:

Think about numerical accuracy and computers.

print('Equal to zero')
print(lasso_coefs.loc[lasso_coefs.Coefficients == 0])
print()
print('Close to zero')
print(lasso_coefs.loc[np.isclose(lasso_coefs.Coefficients, 0)])

Exercise 2.4

The following code creates a weight plot for the coefficients. How should these be interpreted? Fill in the missing comments.

Hints:

On what scale are the features measured?

fig, ax = plt.subplots(1,1, figsize=(10,10))
#
sns.pointplot(data=lasso_coefs.reset_index(), x='Coefficients', y='index', join=False, ax=ax)
#
plt.title("Lasso model, low regularization")
#
plt.axvline(x=0, color="grey", linestyle='--')
#
plt.xlabel("Coefficient values")
plt.ylabel('')
plt.show()

Exercise 2.5

The following code creates an effect plot for the model. How should these be interpreted? Fill in the missing comments.

Hints:

On what scale are the effects measured?

What does [:-1] and [-1] select from the pipeline?

#
lasso_contributions = lasso_model[:-1].transform(X_train) * lasso_model[-1].coef_
df_lasso_contributions = pd.DataFrame(lasso_contributions, columns=feature_names)

#
fig, ax = plt.subplots(1, 1, figsize=(10,10))
sns.boxplot(data=df_lasso_contributions, orient='h', ax=ax)
plt.axvline(x=0, color="grey", linestyle='--')
plt.xlabel('Prediction contribution')
plt.show()

Exercise 2.6

The following code creates an effect plot for the model with an individual plotted as well. In what situations would a plot like this be useful? Fill in the missing comments.

Hints:

Who’s the receiver of the explanation?

fig, ax = plt.subplots(1, 1, figsize=(10,10))
sns.boxplot(data=df_lasso_contributions, ax=ax)
# 
sns.scatterplot(data=df_lasso_contributions.iloc[0].T, s=200, color="red", marker="x")
plt.xticks(rotation=90)
plt.axhline(y=0, color="grey", linestyle='--')
plt.ylabel('Prediction contribution')
plt.show()

Exercise 2.7

Try running the code with other values of \(\alpha\). Does the model become easier to interpret? Are the same coefficients always high/low?

Hints:

Multicollinearity.

Decision tree

In this section, we will be looking at how to plot decision trees as flowcharts.

Exercise 3.1

Fill in the missing code and comments to create a plot for a decision tree with max depth 1.

Hints:

We’ve covered this in previous sessions. Try looking at our previous materials.

from sklearn.tree import plot_tree, FILL IN 

#
decision_model = Pipeline([
    ('preprocessor', col_transformer),
    ('regr', FILL IN)
    ]
)

#
decision_feature_names = decision_model[0].get_feature_names_out()

#
decision_model.fit(X_train, y_train)

#
plt.figure(figsize=(10,10)) 
plot_tree(decision_model[-1], feature_names = decision_feature_names, impurity=False)
plt.show()

Exercise 3.2

Repeat the following exercise with max depth 2 and 5. Is the model still explainable?

Hints:

Are you explaining a single prediction or the full model?

To increase legibility, try changing the figsize parameter, e.g. (50,10) for the larger model.

# YOUR CODE

Model agnostic methods

In this section of the exercises, we will cover model agnostic methods, such as permutation feature importance, partial dependence plots and SHAP values.

We first create a random forest model, which is not intrinsically interpretable. This is the model we want to understand.

from sklearn.ensemble import RandomForestRegressor

forest_model = Pipeline([
    ('preprocessor', col_transformer),
    ('regr', RandomForestRegressor())
    ]
)
forest_model.fit(X_train, y_train)

feature_names = forest_model[:-1].get_feature_names_out()

Permutation feature importance

Exercise 4.1

Fill in the missing comments to create a feature importance plot. How does one interpret these plots? What’s the difference between the two plots?

Hints:

How are permutation feature importances calculated, and what is measured?

from sklearn.inspection import permutation_importance

#
result = permutation_importance(
    forest_model, X_train, y_train, n_repeats=10, random_state=73, n_jobs=2, scoring='neg_mean_squared_error'
)

#
sorted_importances_idx = result.importances_mean.argsort()
importances = pd.DataFrame(
    result.importances[sorted_importances_idx].T,
    columns=X_test.columns[sorted_importances_idx],
)

#
ax = importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances (train set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in score")
ax.figure.tight_layout()

#
result = permutation_importance(
    forest_model, X_test, y_test, n_repeats=10, random_state=73, n_jobs=2, scoring='neg_mean_squared_error'
)

#
sorted_importances_idx = result.importances_mean.argsort()
importances = pd.DataFrame(
    result.importances[sorted_importances_idx].T,
    columns=X_test.columns[sorted_importances_idx],
)

#
ax = importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances (test set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in score")
ax.figure.tight_layout()

Partial dependencies

sklearn.inspection has built-in functionality for creating plots from partial dependencies. An example including both partial dependence plots and individual conditional expectations can be seen on this page.

Exercise 4.2

Create two one-way partial dependence plots for EDUCATION and AGE, as well as a two-way partial dependence plot in a single figure. Use the training data as background data.

Hints:

This corresponds to part one of the first plot on the webpage.

The code uses feature column indices to note what features to include.

from sklearn.inspection import PartialDependenceDisplay

# Index to feature
for idx, col in enumerate(X_train.columns):
    print(idx, col)

# Alternatively, use dictionary
mapper = {name:idx for idx, name in enumerate(X.columns)}

# Input model, data and columns
PartialDependenceDisplay.from_estimator(FILL IN)
plt.show()

Instead of masking the heterogeneity, we can also limit ourselves to one-way plots and display the heterogeneity in the partial dependencies, creating so-called individual expectation plots.

Exercise 4.3

Create an individual conditional expectation plot for both EDUCATION and AGE in a single plot. It should first be uncentered and then centered, and always include the average. Which do you prefer?

Hints:

Change the kind of the plot

# YOUR CODE

SHAP

Finally, we have the game theoretically founded SHapley Additive exPlanations, which are calculated using the SHAP package. They have a lot of different examples on their page, so look to there for code!

Note that all of the following plots can be created using a single line of code, which perhaps is also one of the causes for it’s popularity.

Exercise 4.4

Fill in the missing comments, and explain to yourself what each of the inputs to the shap.TreeExplainer does.

Does the fact that we use training data as our background data influence what observations we can calculate SHAP values for?

Hints:

The shap.TreeExplainer only accepts arrays as data input.

What is the background data used for?

import shap

#
forest_feature_names = forest_model[:-1].get_feature_names_out()

#
X_train_as_array = forest_model[:-1].transform(X_train)

#
explainer = shap.TreeExplainer(model = forest_model[-1], data = X_train_as_array, feature_names=forest_feature_names, seed=73, feature_perturbation = 'interventional')

#
shap_values = explainer(X_train_as_array,)

Exercise 4.5

Create a force plot of the first observation

Hints:

If the package complains regarding initialization of js, try adding matplotlib=True

# YOUR CODE

Exercise 4.5

Sometimes the force plots can get a bit messy with long column names.

Create a waterfall plot of the first observation.

Did it help with the issue? What are possible downsides?

# YOUR CODE

Exercise 4.6

Although SHAP values are local explanations, they can still be aggregated for groups to gain global insights.

Create a bar plot of the mean absolute SHAP value for each feature

# YOUR CODE

Exercise 4.7

We can also create plots akin to partial dependence plots, examining heterogeneity for different values of a feature

Create a scatter plot of the SHAP values for EDUCATION

# YOUR CODE

Exercise 4.8

To examine interactions between features, dependence plots can be coloured by another feature.

Create a scatter plot of the SHAP values for EDUCATION where the colour is decided by the value of AGE

# YOUR CODE

Finally, we can examine heterogeneity for multiple features at once using beeswarm plots.

Exercise 4.9

Create a beeswarm plot of the SHAP values.

# YOUR CODE