In this exercise set we will be working with estimation of conditional average treatment effects assuming selection on observables.
First we will use the econml package in Python to estimate a double machine learning causal forest, and in the second part we will use the grf package in R to estimate a causal forest.
In this exercise we will be using data from LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American economic review, 604-620, regarding a job training field experiment, where we will examine possible treatment effect heterogeneity treatment effects, downloaded from NYU but supplied to you in csv format in a sligthly cleaned format. The object of interest is real earnings in 1978 and we assume selection on observables and overlap.
Python
In this first part of the exercise, we will be utilizing Python and econml.
First we load some packages and the data.
import matplotlib.pyplot as pltimport numpy as np import pandas as pd import seaborn as snsnp.random.seed(73)plt.style.use('seaborn-whitegrid')%matplotlib inlinedf = pd.read_csv('nsw.csv', index_col=0)df.describe()
treat
age
education
black
hispanic
married
nodegree
re75
re78
count
705.000000
705.000000
705.000000
705.000000
705.000000
705.000000
705.000000
705.000000
705.000000
mean
0.415603
24.607092
10.272340
0.795745
0.107801
0.164539
0.774468
3116.271386
5586.166074
std
0.493176
6.666376
1.720412
0.403443
0.310350
0.371027
0.418229
5104.566478
6269.582709
min
0.000000
17.000000
3.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
25%
0.000000
19.000000
9.000000
1.000000
0.000000
0.000000
1.000000
0.000000
0.000000
50%
0.000000
23.000000
10.000000
1.000000
0.000000
0.000000
1.000000
1122.621000
4159.919000
75%
1.000000
27.000000
11.000000
1.000000
0.000000
0.000000
1.000000
4118.681000
8881.665000
max
1.000000
55.000000
16.000000
1.000000
1.000000
1.000000
1.000000
37431.660000
60307.930000
Exercise 1.1
Subset the treatment, outcome and covariates from the DataFrame as T, y and X, respectively
Hints:
A DataFrame supports method .drop(), if one wishes to drop multiple columns at once.
# Your answer here
Exercise 1.2
Estimate a CausalForest using the econml package. Make sure that you use the splitting criterion as in the generalized random forest, but otherwise use default parameters.
Hints:
The documentation for the CausalForest can be found here
The splitting criterion is handled by the criterion parameter
# Your answer here
Exercise 1.3
Predict out of bag estimates of the CATE, and create a histogram of these. Do you observe heterogeneity?
Hints:
The documentation for the CausalForest can be found here
Out of bag is sometimes shortened oob
You can create a histogram using sns.histplot
# Your answer here
Exercise 1.4
Estimate a CausalForestDML using the econml package. Make sure that you 1) use the splitting criterion as in the generalized random forest, 2) interpret the treatment as a discrete treatment and 3) estimate a thousand trees, but otherwise use default parameters.
Hints:
The documentation for the CausalForestDML can be found here
# Your code here
Exercise 1.5
Report the doubly robust average treatment effect
Hints:
The documentation for the CausalForestDML can be found here
A function which summarizes the model findings is available
# Your code here
Exercise 1.6
Examine what variables drive the heterogeneity using the split based feature importance method.
Hints:
The documentation for the CausalForestDML can be found here
The feature importances and feature names are available through a method
The feature value colourbar might disappear, in which the following to lines of code might help:
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)
# Your code here
Exercise 1.10
Create a shallow decision tree of depth 2 to explain the CATE as a function of the inputs using SingleTreeCateInterpreter. Does the tree split on the variables you expected it to?
In the second part of the exercise, we will be utilizing Python and grf. I will supply R code in code that has been commented out (# for code, ## for comments), and they will not be able to run in Python if you comment them in again. One can run R code in Python using rpy2, but I generally recommend using R and not rpy2 due to the complexity of problem solving if rpy2 fails.
The many different functions available in grf can be seen in their reference and they have a lot of great tutorials, accessible at the top of their page.