import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import urllib.request
## Ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Set seed and plot style
np.random.seed(73)
plt.style.use('seaborn-whitegrid')
%matplotlib inlineExercise set 9: Double machine learning
In this exercise set we will once again be working with estimation of conditional average treatment effects assuming selection on observables. However, this time the focus will be more broadly on double machine learning in econml, learning how to utilize these methods to get estimates, both assuming a partially linear model (LinearDML) and non-parametrically using causal forests (CausalForestDML).
As an example we will examine the age-old question of orange juice price-elasticity, which, as we all know, have haunted economists for millenia. To answer this question we will use a subset of Dominick’s dataset from the James M. Kilts Center, University of Chicago Booth School of Business. The data is a repeated cross sectional from stores (which we pool), where our main variables of interest are the amount of units sold (outcome) and the price of orange juice (treatment) and median income in the neighborhood (treatment effect heterogeneity). A description of the dataset can be seen here. Throughout, we assume selection on observables. This exercise was in part inspired by the econml notebooks on causal model selection and double machine learning.
Getting started
In this first part of the exercise we will be digging straight into estimating treatment effects using double machine learning. As such, we will be postponing training of models for predicting Y and T for just a moment, although this is an essential part of double machine learning and should not be neglected.
Exercise 1.1
Load the data using the following code and verify that you have correctly loaded the
DataFrameby printing the first 5 rows.NOTE: The following code will download the file which might take a few seconds dependent on your internet.
Hints:
DataFrame’s have a method called .head()
# Import the data
file_name = "oj_large.csv"
if not os.path.isfile(file_name):
print("Download file")
urllib.request.urlretrieve("https://msalicedatapublic.blob.core.windows.net/datasets/OrangeJuice/oj_large.csv", file_name)
oj_data = pd.read_csv(file_name)# Your codeExercise 1.2
The following code subsets the data into different parts corresponding to
X,Y,WandT, but have been named temporary names. Which is which, and why?Hints:
What’s just confounders and what drives heterogeneity?
# Prepare data
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
scaler = StandardScaler()
non_categorical_cols = ['feat', 'AGE60', 'EDUC', 'ETHNIC', 'HHLARGE', 'WORKWOM', 'HVAL150', 'SSTRDIST', 'SSTRVOL', 'CPDIST5', 'CPWVOL5']
categorical_cols = ['brand']
temp_1 = scaler.fit_transform(oj_data[non_categorical_cols].values)
temp_2 = scaler.fit_transform(pd.get_dummies(oj_data[categorical_cols]).values)
# Stacks categorical and non categorical variables together
temp_3 = np.hstack([temp_1, temp_2])
temp_4 = np.log(oj_data["price"]).values
temp_5 = oj_data['logmove'].values
temp_6 = scaler.fit_transform(oj_data[['INCOME']].values)
X = # FILL IN
W = # FILL IN
Y = # FILL IN
T = # FILL IN
XW = np.hstack([X, W])
Y_train, Y_val, T_train, T_val, X_train, X_val, W_train, W_val, XW_train, XW_val = train_test_split(Y, T, X, W, XW, test_size=.2)Exercise 1.3
Create an instance of a
LinearDMLand fit it to the training data using default input parameters.Hints:
There’s an example on this page
# Your codeExercise 1.4
Look at the documentation for
LinearDML, which can be found here.How are the models for
YandTcreated? Does this explain why the data was scaled?
# Your answerExercise 1.5
Get an estimate of the treatment effect heterogeneity using the
summarymethod. Is the sign as expected?
# Your codeExercise 1.6
Estimate and plot the conditional average treatment effect and the the 95% confidence interval with the
LinearDMLmodel on the followingX_testdata, which generates counterfactual income levels ranging from -1 to 1.Hints:
There documentation for
LinearDMLcan be found on this page.Try looking for methods that start with
effect
## Generate test data
min_income = -1
max_income = 1
delta = (1 - (-1)) / 100
X_test = np.arange(min_income, max_income + delta - 0.001, delta).reshape(-1,1)
# Calculate treatment effect and interval
te_pred_linear = # FILL IN
te_pred_interval_linear = # FILL IN
# Plot Orange Juice elasticity as a function of income
plt.figure(figsize=(10,6))
plt.plot(X_test, te_pred_linear, label="Linear model")
plt.xlabel(r'Scale(Income)')
plt.ylabel('Orange Juice Elasticity')
plt.legend()
plt.title("Orange Juice Elasticity vs Income")
plt.show()Exercise 1.7
Create an instance of a
CausalForestDMLand fit it to the training data using default input parameters.Hints:
It follows exactly the same recipe as
LinearDML.
# Your codeExercise 1.8
Estimate and plot the conditional average treatment effect and the the 95% confidence interval with the
CausalForestDMLmodelX_test.Hints:
It follows exactly the same recipe as
LinearDML.
# Your codeAs discussed during the lecture, we can use the R-loss to evaluate the fit of the conditional average treatment effect and use this to perform causal model selection.
This could be done manually, but much like grf in R, econml also offers automatic tuning with the tune method, which we will utilize to tune the causal forest in CausalForestDML.
Exercise 1.9
Create an instance of a
CausalForestDML, tune it on the training data and then fit it to the training data, all using default input parameters.Hints:
The call to
tunethe model looks exactly like the call tofitthe model.
# Your codeExercise 1.8
Estimate and plot the conditional average treatment effect and the the 95% confidence interval with the tuned
CausalForestDMLmodel onX_test.Hints:
It follows exactly the same recipe as
LinearDMLand an untunedCausalForestDML.
# Your codeExercise 1.9
Plot the conditional average treatment effect and the the 95% confidence interval for all three models on
X_test. Which do your prefer?Hints:
You can call
plotandfill_betweenrepeatedly before callingxlabel
# Your codeExercise 1.10
econmlimplements a scoring function using the R-loss, called theRscorer. Fit theRscorerto the appropriate data sample.NOTE: The
Rscorerneeds a model to create residuals. Here we input aLassoCV, which is also the default in the double machine learning models. As such we obtain similar residuals.Hints:
Should we use training data or held out data for causal model selection?
# Import model
from sklearn.linear_model import LassoCV
from econml.score import RScorer
# Create scorer
scorer = RScorer(model_y=LassoCV(), model_t=LassoCV())
# FILL INExercise 1.11
Score the models using the
Rscorer’sbest_modelmethod. Which model is the preferred one? Is it preferred over a constant average treatment effect?Hints:
If you’re in doubt as to which model the method has selected, you can return all the scores by setting
return_scores = Trueand compare the best score to the listThe
best_modelmethod accepts a list of fitted estimators, and the documentation can be seen here
# Your codePredicting treatment and outcome
Having now looked closer at how to code up estimation of heterogeneous treatment effects using double machine learning estimators, we will start with the task that is predicting both Y and T, from which we can learn the optimal hyperparameters to pass on to our double machine learning estimators.
In practice, this is probably where you will be spending most of the time, optimizing features and models to accurately predict treatment and outcome, thus achieving better converge rates.
We have covered this in both session 3 and 4, where we covered model selection and supervised learning, respectively.
Exercise 2.1
What covariates should we use to predict
YandT? Is this part of the train test split we made in exercise 1.1?Hints:
Look at the struqtural equations in the lecture slides. What enters in the nuisance functions?
# Your answerTo make this go by slightly faster, I have pre-selected three models and their respective hyperparametergrids for which to search over. Furthermore, we utilize 2 fold cross validation (matching the default amount of folds in econml) and 10 random hyperparameter combinations using RandomizedSearchCV, which we covered in session 3.
Each model should be using negative_mean_squared_error, and you should make note of the best performing hyper mean squared error, such that you can compare performance across estimators.
Exercise 2.2
Create a
Lassoto predict the outcome,Y.Save the best hyperparameter combination.
Hints:
The best score and best hyperparameter can be found using the methods
best_score_andbest_param_respectively.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn # FILL IN
pipe_lasso = Pipeline([
# FILL IN
]
)
param_grid = {'lasso__alpha':np.logspace(-5, 3, 10)}
rs = RandomizedSearchCV(
# FILL IN
)
# Fit
# FILL IN
# Score
# FILL INExercise 2.3
Create a
RandomForestRegressorto predict the outcome,Y.Save the best hyperparameter combination.
Hints:
The best score and best hyperparameter can be found using the methods
best_score_andbest_param_respectively.
from sklearn # FILL IN
pipe_forest = Pipeline([
# FILL IN
]
)
param_grid= [{
'forest__n_estimators': np.unique(np.logspace(0, 3, 25).astype(int)),
'forest__max_features': np.arange(0.1, 1.01, 0.1),
'forest__min_samples_split': np.arange(0.01, 0.2, 0.01),
'forest__min_samples_leaf': np.arange(2, 50, 2),
'forest__max_depth': np.unique(np.logspace(1, 4, 20).astype(int))
}]
rs = RandomizedSearchCV(
# FILL IN
)
# Fit
# FILL IN
# Score
# FILL INExercise 2.4
Create a
HistGradientBoostingRegressorto predict the outcome,Y.Save the best hyperparameter combination.
NOTE: The
HistGradientBoostingRegressoris an efficient model which does gradient boosting.Hints:
The best score and best hyperparameter can be found using the methods
best_score_andbest_param_respectively.
# Your code
from sklearn # FILL IN
pipe_booster = Pipeline([
# FILL IN
]
)
param_grid= [{
'booster__min_samples_leaf': np.arange(2, 50, 2),
'booster__max_depth': np.unique(np.logspace(1, 4, 20).astype(int)),
'booster__learning_rate':np.arange(0,1.001,0.1)
}]
rs = RandomizedSearchCV(
# FILL IN
)
# Fit
# FILL IN
# Score
# FILL INExercise 2.5
Which model best predicts
Y? Create a model of that type with the best hyperparameter combination
# Your codeExercise 2.6
You have now found the model which best predicts
Y, and now we repeat the same process forTTo find the best model to predict
T, repeat exercise 2.2 through 2.5 but predictingT
# Your codeExercise 2.7
Estimate a
LinearDMLas well as an untuned and tunedCausalForestDMLusing the new models forTandY.Plot the conditional average treatment effect and the the 95% confidence interval for all three models on
X_test. Which do your prefer?
# Your codeExercise 2.8
Score the three new models using the
Rscorer, now predicting the residuals using the models forTand `Y. Which model is the preferred one? Is it preferred over a constant average treatment effect?
# Your codeFinishing thoughts
Today we have examined double machine learning and how to implement these models.
In practice, you should spend more time creating the models for T and Y, as they are quintessential in double machine learning. You should also increase the amount of folds in cross validation, as this increases the amount of data to train on when creating residuals. Furthermore, both the models and Rscorer support monte carlo iterations, mc_iters, which allows you to repeat the same process again with different splits, thus creating less noisy estimates of the residuals. We have not done this today as the running time of these models quite quickly becomes too long for an exercise set.
If you wish to examine linear treatment heterogeneity, you should also look into the input parameters featurizer and treatment_featurizer, which enables quick and easy interactions of covariates and treatments, possibly in combination with a SparseLinearDML. See e.g. this notebook
All of the econml functionality that we went through last week can also still be used with our models, e.g. explainability, CATE interpreters and policy learning.
Additionally, you should be aware that econml has a folder with notebook examples on their GitHub, where you can see the many different possibilities.