import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Get wage data
from sklearn.datasets import fetch_openml
= fetch_openml(data_id=534, as_frame=True)
survey
= survey.data[survey.feature_names]
X
= survey.target y
Exercise Set 6: Explainability
In this exercise set, we will be looking at explainability.
Data
The dataset we will be looking at this time is a dataset regarding the determinants of wages from the 1985 Current Population Survey (Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley), which is open source and available at OpenML. As last time, you’re welcome to use a dataset of your own.
Load data
Here we load our input data into a DataFrame
called X
and our target data into a Series
called y
.
Here we describe the data using both the documentation which came with the data, but also by computing summary statistics for the input data and target data.
print(survey.DESCR)
**Author**:
**Source**: Unknown - Date unknown
**Please cite**:
Determinants of Wages from the 1985 Current Population Survey
Summary:
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. We wish to determine (i) whether wages are related to these characteristics and (ii) whether there is a gender gap in wages.
Based on residual plots, wages were log-transformed to stabilize the variance. Age and work experience were almost perfectly correlated (r=.98). Multiple regression of log wages against sex, age, years of education, work experience, union membership, southern residence, and occupational status showed that these covariates were related to wages (pooled F test, p < .0001). The effect of age was not significant after controlling for experience. Standardized residual plots showed no patterns, except for one large outlier with lower wages than expected. This was a male, with 22 years of experience and 12 years of education, in a management position, who lived in the north and was not a union member. Removing this person from the analysis did not substantially change the results, so that the final model included the entire sample.
Adjusting for all other variables in the model, females earned 81% (75%, 88%) the wages of males (p < .0001). Wages increased 41% (28%, 56%) for every 5 additional years of education (p < .0001). They increased by 11% (7%, 14%) for every additional 10 years of experience (p < .0001). Union members were paid 23% (12%, 36%) more than non-union members (p < .0001). Northerns were paid 11% (2%, 20%) more than southerns (p =.016). Management and professional positions were paid most, and service and clerical positions were paid least (pooled F-test, p < .0001). Overall variance explained was R2 = .35.
In summary, many factors describe the variations in wages: occupational status, years of experience, years of education, sex, union membership and region of residence. However, despite adjustment for all factors that were available, there still appeared to be a gender gap in wages. There is no readily available explanation for this gender gap.
Authorization: Public Domain
Reference: Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley.
Description: The datafile contains 534 observations on 11 variables sampled from the Current Population Survey of 1985. This data set demonstrates multiple regression, confounding, transformations, multicollinearity, categorical variables, ANOVA, pooled tests of significance, interactions and model building strategies.
Variable names in order from left to right:
EDUCATION: Number of years of education.
SOUTH: Indicator variable for Southern Region (1=Person lives in South, 0=Person lives elsewhere).
SEX: Indicator variable for sex (1=Female, 0=Male).
EXPERIENCE: Number of years of work experience.
UNION: Indicator variable for union membership (1=Union member, 0=Not union member).
WAGE: Wage (dollars per hour).
AGE: Age (years).
RACE: Race (1=Other, 2=Hispanic, 3=White).
OCCUPATION: Occupational category (1=Management, 2=Sales, 3=Clerical, 4=Service, 5=Professional, 6=Other).
SECTOR: Sector (0=Other, 1=Manufacturing, 2=Construction).
MARR: Marital Status (0=Unmarried, 1=Married)
Therese Stukel
Dartmouth Hitchcock Medical Center
One Medical Center Dr.
Lebanon, NH 03756
e-mail: stukel@dartmouth.edu
Information about the dataset
CLASSTYPE: numeric
CLASSINDEX: none specific
Downloaded from openml.org.
="all") X.describe(include
EDUCATION | SOUTH | SEX | EXPERIENCE | UNION | AGE | RACE | OCCUPATION | SECTOR | MARR | |
---|---|---|---|---|---|---|---|---|---|---|
count | 534.000000 | 534 | 534 | 534.000000 | 534 | 534.000000 | 534 | 534 | 534 | 534 |
unique | NaN | 2 | 2 | NaN | 2 | NaN | 3 | 6 | 3 | 2 |
top | NaN | no | male | NaN | not_member | NaN | White | Other | Other | Married |
freq | NaN | 378 | 289 | NaN | 438 | NaN | 440 | 156 | 411 | 350 |
mean | 13.018727 | NaN | NaN | 17.822097 | NaN | 36.833333 | NaN | NaN | NaN | NaN |
std | 2.615373 | NaN | NaN | 12.379710 | NaN | 11.726573 | NaN | NaN | NaN | NaN |
min | 2.000000 | NaN | NaN | 0.000000 | NaN | 18.000000 | NaN | NaN | NaN | NaN |
25% | 12.000000 | NaN | NaN | 8.000000 | NaN | 28.000000 | NaN | NaN | NaN | NaN |
50% | 12.000000 | NaN | NaN | 15.000000 | NaN | 35.000000 | NaN | NaN | NaN | NaN |
75% | 15.000000 | NaN | NaN | 26.000000 | NaN | 44.000000 | NaN | NaN | NaN | NaN |
max | 18.000000 | NaN | NaN | 55.000000 | NaN | 64.000000 | NaN | NaN | NaN | NaN |
y.describe()
count 534.000000
mean 9.024064
std 5.139097
min 1.000000
25% 5.250000
50% 7.780000
75% 11.250000
max 44.500000
Name: WAGE, dtype: float64
In this exercise set, we will not perform hyperparameteroptimization, as this simplifies the code.
However, some methods can only explain the training data, whereas some can explain both the training and test data. To examine this, we split up the sample.
Exercise 1.1
Fill in the missing code such that we split the data into 80% train and 20% test.
Hints:
We have previously looked at datasplitting. Try looking at last sessions exercises.
from sklearn.model_selection import FILL IN
= FILL IN(FILL IN, random_state=73) X_train, X_test, y_train, y_test
Linear regression
In this section of the exercises, we will be looking at (regularized) linear regression.
We will be looking at different types of plots, and how regularizing linear regression with LASSO can create sparse models.
First we will use some of sklearn
’s built-in functionality for easily handling categorical variables.
Exercise 2.1
What does this code do? Fill in the missing comments.
Hints:
We have not used this functionality before.
If you’re in doubt, try looking up the documentation.
Think about how numerical and categorical variables should be encoded.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
#
= ["RACE", "OCCUPATION", "SECTOR", "MARR", "UNION", "SEX", "SOUTH"]
categorical_columns = ["EDUCATION", "EXPERIENCE", "AGE"]
numerical_columns
#
= OneHotEncoder()
categorical_encoder
#
= ColumnTransformer(
col_transformer
["cat", categorical_encoder, categorical_columns)
(
],=False,
verbose_feature_names_out='passthrough'
remainder )
To start out with, we only introduce a slight amount of regularization.
Exercise 2.2
Fill in the missing code and comments to implement a LASSO regression with \(\alpha=0.0001\).
Hints:
We created a transformer for categorical columns in the previous exercise.
We have looked at this in previous sessions. Try looking at those.
from sklearn.pipeline import FILL IN
from sklearn.linear_model import FILL IN
from sklearn.preprocessing import FILL IN
#
= Pipeline([
lasso_model 'preprocessor',FILL IN),
('scaler', FILL IN),
('regr', FILL IN)
(
]
)
#
lasso_model.fit(X_train, y_train)
#
= lasso_model[:-1].get_feature_names_out()
feature_names
#
= pd.DataFrame(
lasso_coefs -1].coef_,
lasso_model[=["Coefficients"],
columns=feature_names,
index
)
lasso_coefs
As we know from session 4, LASSO creates sparse models due to the corner solutions stemming from L1 normalization.
Exercise 2.3
Looking at the coefficients from exercise 2.2, which coefficients are equal to zero? Why is the second line of code necessary to get the answer we intuitively agree with?
Hints:
Think about numerical accuracy and computers.
print('Equal to zero')
print(lasso_coefs.loc[lasso_coefs.Coefficients == 0])
print()
print('Close to zero')
print(lasso_coefs.loc[np.isclose(lasso_coefs.Coefficients, 0)])
Exercise 2.4
The following code creates a weight plot for the coefficients. How should these be interpreted? Fill in the missing comments.
Hints:
On what scale are the features measured?
= plt.subplots(1,1, figsize=(10,10))
fig, ax #
=lasso_coefs.reset_index(), x='Coefficients', y='index', join=False, ax=ax)
sns.pointplot(data#
"Lasso model, low regularization")
plt.title(#
=0, color="grey", linestyle='--')
plt.axvline(x#
"Coefficient values")
plt.xlabel('')
plt.ylabel( plt.show()
Exercise 2.5
The following code creates an effect plot for the model. How should these be interpreted? Fill in the missing comments.
Hints:
On what scale are the effects measured?
What does
[:-1]
and[-1]
select from the pipeline?
#
= lasso_model[:-1].transform(X_train) * lasso_model[-1].coef_
lasso_contributions = pd.DataFrame(lasso_contributions, columns=feature_names)
df_lasso_contributions
#
= plt.subplots(1, 1, figsize=(10,10))
fig, ax =df_lasso_contributions, orient='h', ax=ax)
sns.boxplot(data=0, color="grey", linestyle='--')
plt.axvline(x'Prediction contribution')
plt.xlabel( plt.show()
Exercise 2.6
The following code creates an effect plot for the model with an individual plotted as well. In what situations would a plot like this be useful? Fill in the missing comments.
Hints:
Who’s the receiver of the explanation?
= plt.subplots(1, 1, figsize=(10,10))
fig, ax =df_lasso_contributions, ax=ax)
sns.boxplot(data#
=df_lasso_contributions.iloc[0].T, s=200, color="red", marker="x")
sns.scatterplot(data=90)
plt.xticks(rotation=0, color="grey", linestyle='--')
plt.axhline(y'Prediction contribution')
plt.ylabel( plt.show()
Exercise 2.7
Try running the code with other values of \(\alpha\). Does the model become easier to interpret? Are the same coefficients always high/low?
Hints:
Multicollinearity.
Decision tree
In this section, we will be looking at how to plot decision trees as flowcharts.
Exercise 3.1
Fill in the missing code and comments to create a plot for a decision tree with max depth 1.
Hints:
We’ve covered this in previous sessions. Try looking at our previous materials.
from sklearn.tree import plot_tree, FILL IN
#
= Pipeline([
decision_model 'preprocessor', col_transformer),
('regr', FILL IN)
(
]
)
#
= decision_model[0].get_feature_names_out()
decision_feature_names
#
decision_model.fit(X_train, y_train)
#
=(10,10))
plt.figure(figsize-1], feature_names = decision_feature_names, impurity=False)
plot_tree(decision_model[ plt.show()
Exercise 3.2
Repeat the following exercise with max depth 2 and 5. Is the model still explainable?
Hints:
Are you explaining a single prediction or the full model?
To increase legibility, try changing the
figsize
parameter, e.g.(50,10)
for the larger model.
# YOUR CODE
Model agnostic methods
In this section of the exercises, we will cover model agnostic methods, such as permutation feature importance, partial dependence plots and SHAP values.
We first create a random forest model, which is not intrinsically interpretable. This is the model we want to understand.
from sklearn.ensemble import RandomForestRegressor
= Pipeline([
forest_model 'preprocessor', col_transformer),
('regr', RandomForestRegressor())
(
]
)
forest_model.fit(X_train, y_train)
= forest_model[:-1].get_feature_names_out() feature_names
Permutation feature importance
Exercise 4.1
Fill in the missing comments to create a feature importance plot. How does one interpret these plots? What’s the difference between the two plots?
Hints:
How are permutation feature importances calculated, and what is measured?
from sklearn.inspection import permutation_importance
#
= permutation_importance(
result =10, random_state=73, n_jobs=2, scoring='neg_mean_squared_error'
forest_model, X_train, y_train, n_repeats
)
#
= result.importances_mean.argsort()
sorted_importances_idx = pd.DataFrame(
importances
result.importances[sorted_importances_idx].T,=X_test.columns[sorted_importances_idx],
columns
)
#
= importances.plot.box(vert=False, whis=10)
ax "Permutation Importances (train set)")
ax.set_title(=0, color="k", linestyle="--")
ax.axvline(x"Decrease in score")
ax.set_xlabel( ax.figure.tight_layout()
#
= permutation_importance(
result =10, random_state=73, n_jobs=2, scoring='neg_mean_squared_error'
forest_model, X_test, y_test, n_repeats
)
#
= result.importances_mean.argsort()
sorted_importances_idx = pd.DataFrame(
importances
result.importances[sorted_importances_idx].T,=X_test.columns[sorted_importances_idx],
columns
)
#
= importances.plot.box(vert=False, whis=10)
ax "Permutation Importances (test set)")
ax.set_title(=0, color="k", linestyle="--")
ax.axvline(x"Decrease in score")
ax.set_xlabel( ax.figure.tight_layout()
Partial dependencies
sklearn.inspection
has built-in functionality for creating plots from partial dependencies. An example including both partial dependence plots and individual conditional expectations can be seen on this page.
Exercise 4.2
Create two one-way partial dependence plots for
EDUCATION
andAGE
, as well as a two-way partial dependence plot in a single figure. Use the training data as background data.Hints:
This corresponds to part one of the first plot on the webpage.
The code uses feature column indices to note what features to include.
from sklearn.inspection import PartialDependenceDisplay
# Index to feature
for idx, col in enumerate(X_train.columns):
print(idx, col)
# Alternatively, use dictionary
= {name:idx for idx, name in enumerate(X.columns)}
mapper
# Input model, data and columns
PartialDependenceDisplay.from_estimator(FILL IN) plt.show()
Instead of masking the heterogeneity, we can also limit ourselves to one-way plots and display the heterogeneity in the partial dependencies, creating so-called individual expectation plots.
Exercise 4.3
Create an individual conditional expectation plot for both
EDUCATION
andAGE
in a single plot. It should first be uncentered and then centered, and always include the average. Which do you prefer?Hints:
Change the
kind
of the plot
# YOUR CODE
SHAP
Finally, we have the game theoretically founded SHapley Additive exPlanations, which are calculated using the SHAP package. They have a lot of different examples on their page, so look to there for code!
Note that all of the following plots can be created using a single line of code, which perhaps is also one of the causes for it’s popularity.
Exercise 4.4
Fill in the missing comments, and explain to yourself what each of the inputs to the
shap.TreeExplainer
does.Does the fact that we use training data as our background data influence what observations we can calculate SHAP values for?
Hints:
The
shap.TreeExplainer
only accepts arrays as data input.What is the background data used for?
import shap
#
= forest_model[:-1].get_feature_names_out()
forest_feature_names
#
= forest_model[:-1].transform(X_train)
X_train_as_array
#
= shap.TreeExplainer(model = forest_model[-1], data = X_train_as_array, feature_names=forest_feature_names, seed=73, feature_perturbation = 'interventional')
explainer
#
= explainer(X_train_as_array,) shap_values
Exercise 4.5
Create a
force
plot of the first observationHints:
If the package complains regarding initialization of js, try adding
matplotlib=True
# YOUR CODE
Exercise 4.5
Sometimes the force plots can get a bit messy with long column names.
Create a
waterfall
plot of the first observation.Did it help with the issue? What are possible downsides?
# YOUR CODE
Exercise 4.6
Although SHAP values are local explanations, they can still be aggregated for groups to gain global insights.
Create a
bar
plot of the mean absolute SHAP value for each feature
# YOUR CODE
Exercise 4.7
We can also create plots akin to partial dependence plots, examining heterogeneity for different values of a feature
Create a
scatter
plot of the SHAP values forEDUCATION
# YOUR CODE
Exercise 4.8
To examine interactions between features, dependence plots can be coloured by another feature.
Create a
scatter
plot of the SHAP values forEDUCATION
where the colour is decided by the value ofAGE
# YOUR CODE
Finally, we can examine heterogeneity for multiple features at once using beeswarm
plots.
Exercise 4.9
Create a
beeswarm
plot of the SHAP values.
# YOUR CODE