Exercise set 9: Double machine learning

In this exercise set we will once again be working with estimation of conditional average treatment effects assuming selection on observables. However, this time the focus will be more broadly on double machine learning in econml, learning how to utilize these methods to get estimates, both assuming a partially linear model (LinearDML) and non-parametrically using causal forests (CausalForestDML).

As an example we will examine the age-old question of orange juice price-elasticity, which, as we all know, have haunted economists for millenia. To answer this question we will use a subset of Dominick’s dataset from the James M. Kilts Center, University of Chicago Booth School of Business. The data is a repeated cross sectional from stores (which we pool), where our main variables of interest are the amount of units sold (outcome) and the price of orange juice (treatment) and median income in the neighborhood (treatment effect heterogeneity). A description of the dataset can be seen here. Throughout, we assume selection on observables. This exercise was in part inspired by the econml notebooks on causal model selection and double machine learning.

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import os
import urllib.request

## Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Set seed and plot style
np.random.seed(73)
plt.style.use('seaborn-whitegrid')

%matplotlib inline

Getting started

In this first part of the exercise we will be digging straight into estimating treatment effects using double machine learning. As such, we will be postponing training of models for predicting Y and T for just a moment, although this is an essential part of double machine learning and should not be neglected.

Exercise 1.1

Load the data using the following code and verify that you have correctly loaded the DataFrame by printing the first 5 rows.

NOTE: The following code will download the file which might take a few seconds dependent on your internet.

Hints:

DataFrame’s have a method called .head()

# Import the data
file_name = "oj_large.csv"

if not os.path.isfile(file_name):
    print("Download file")
    urllib.request.urlretrieve("https://msalicedatapublic.blob.core.windows.net/datasets/OrangeJuice/oj_large.csv", file_name)
    
oj_data = pd.read_csv(file_name)

# Your code

Exercise 1.2

The following code subsets the data into different parts corresponding to X, Y, W and T, but have been named temporary names. Which is which, and why?

Hints:

What’s just confounders and what drives heterogeneity?

# Prepare data

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()

non_categorical_cols = ['feat', 'AGE60', 'EDUC', 'ETHNIC', 'HHLARGE', 'WORKWOM', 'HVAL150', 'SSTRDIST', 'SSTRVOL', 'CPDIST5', 'CPWVOL5']
categorical_cols = ['brand']

temp_1 = scaler.fit_transform(oj_data[non_categorical_cols].values)
temp_2 = scaler.fit_transform(pd.get_dummies(oj_data[categorical_cols]).values)
# Stacks categorical and non categorical variables together
temp_3 = np.hstack([temp_1, temp_2]) 
temp_4 = np.log(oj_data["price"]).values
temp_5 = oj_data['logmove'].values
temp_6 = scaler.fit_transform(oj_data[['INCOME']].values)

X = # FILL IN
W = # FILL IN
Y = # FILL IN
T = # FILL IN

XW = np.hstack([X, W])

Y_train, Y_val, T_train, T_val, X_train, X_val, W_train, W_val, XW_train, XW_val = train_test_split(Y, T, X, W, XW, test_size=.2)

Exercise 1.3

Create an instance of a LinearDML and fit it to the training data using default input parameters.

Hints:

There’s an example on this page

# Your code

Exercise 1.4

Look at the documentation for LinearDML, which can be found here.

How are the models for Y and T created? Does this explain why the data was scaled?

# Your answer

Exercise 1.5

Get an estimate of the treatment effect heterogeneity using the summary method. Is the sign as expected?

# Your code

Exercise 1.6

Estimate and plot the conditional average treatment effect and the the 95% confidence interval with the LinearDML model on the following X_test data, which generates counterfactual income levels ranging from -1 to 1.

Hints:

There documentation for LinearDML can be found on this page.

Try looking for methods that start with effect

## Generate test data
min_income = -1
max_income = 1
delta = (1 - (-1)) / 100
X_test = np.arange(min_income, max_income + delta - 0.001, delta).reshape(-1,1)

# Calculate treatment effect and interval

te_pred_linear = # FILL IN 
te_pred_interval_linear = # FILL IN 

# Plot Orange Juice elasticity as a function of income

plt.figure(figsize=(10,6))
plt.plot(X_test, te_pred_linear, label="Linear model")
plt.xlabel(r'Scale(Income)')
plt.ylabel('Orange Juice Elasticity')
plt.legend()
plt.title("Orange Juice Elasticity vs Income")
plt.show()

Exercise 1.7

Create an instance of a CausalForestDML and fit it to the training data using default input parameters.

Hints:

It follows exactly the same recipe as LinearDML.

# Your code

Exercise 1.8

Estimate and plot the conditional average treatment effect and the the 95% confidence interval with the CausalForestDML model X_test.

Hints:

It follows exactly the same recipe as LinearDML.

# Your code

As discussed during the lecture, we can use the R-loss to evaluate the fit of the conditional average treatment effect and use this to perform causal model selection.

This could be done manually, but much like grf in R, econml also offers automatic tuning with the tune method, which we will utilize to tune the causal forest in CausalForestDML.

Exercise 1.9

Create an instance of a CausalForestDML, tune it on the training data and then fit it to the training data, all using default input parameters.

Hints:

The call to tune the model looks exactly like the call to fit the model.

# Your code

Exercise 1.8

Estimate and plot the conditional average treatment effect and the the 95% confidence interval with the tuned CausalForestDML model on X_test.

Hints:

It follows exactly the same recipe as LinearDML and an untuned CausalForestDML.

# Your code

Exercise 1.9

Plot the conditional average treatment effect and the the 95% confidence interval for all three models on X_test. Which do your prefer?

Hints:

You can call plot and fill_between repeatedly before calling xlabel

# Your code

Exercise 1.10

econml implements a scoring function using the R-loss, called the Rscorer. Fit the Rscorer to the appropriate data sample.

NOTE: The Rscorer needs a model to create residuals. Here we input a LassoCV, which is also the default in the double machine learning models. As such we obtain similar residuals.

Hints:

Should we use training data or held out data for causal model selection?

# Import model
from sklearn.linear_model import LassoCV
from econml.score import RScorer

# Create scorer
scorer = RScorer(model_y=LassoCV(), model_t=LassoCV())

# FILL IN

Exercise 1.11

Score the models using the Rscorer’s best_model method. Which model is the preferred one? Is it preferred over a constant average treatment effect?

Hints:

If you’re in doubt as to which model the method has selected, you can return all the scores by setting return_scores = True and compare the best score to the list

The best_model method accepts a list of fitted estimators, and the documentation can be seen here

# Your code

Predicting treatment and outcome

Having now looked closer at how to code up estimation of heterogeneous treatment effects using double machine learning estimators, we will start with the task that is predicting both Y and T, from which we can learn the optimal hyperparameters to pass on to our double machine learning estimators.

In practice, this is probably where you will be spending most of the time, optimizing features and models to accurately predict treatment and outcome, thus achieving better converge rates.

We have covered this in both session 3 and 4, where we covered model selection and supervised learning, respectively.

Exercise 2.1

What covariates should we use to predict Y and T? Is this part of the train test split we made in exercise 1.1?

Hints:

Look at the struqtural equations in the lecture slides. What enters in the nuisance functions?

# Your answer

To make this go by slightly faster, I have pre-selected three models and their respective hyperparametergrids for which to search over. Furthermore, we utilize 2 fold cross validation (matching the default amount of folds in econml) and 10 random hyperparameter combinations using RandomizedSearchCV, which we covered in session 3.

Each model should be using negative_mean_squared_error, and you should make note of the best performing hyper mean squared error, such that you can compare performance across estimators.

Exercise 2.2

Create a Lasso to predict the outcome, Y.

Save the best hyperparameter combination.

Hints:

The best score and best hyperparameter can be found using the methods best_score_ and best_param_ respectively.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn # FILL IN 

pipe_lasso = Pipeline([ 
    # FILL IN 
    ]
    )

param_grid = {'lasso__alpha':np.logspace(-5, 3, 10)}


rs = RandomizedSearchCV(
    # FILL IN
)

# Fit
# FILL IN

# Score
# FILL IN

Exercise 2.3

Create a RandomForestRegressor to predict the outcome, Y.

Save the best hyperparameter combination.

Hints:

The best score and best hyperparameter can be found using the methods best_score_ and best_param_ respectively.

from sklearn # FILL IN

pipe_forest = Pipeline([                     
    # FILL IN
    ]
    )


param_grid= [{
    'forest__n_estimators': np.unique(np.logspace(0, 3, 25).astype(int)),
    'forest__max_features': np.arange(0.1, 1.01, 0.1),
    'forest__min_samples_split': np.arange(0.01, 0.2, 0.01),
    'forest__min_samples_leaf':  np.arange(2, 50, 2),
    'forest__max_depth': np.unique(np.logspace(1, 4, 20).astype(int))
    }]

rs = RandomizedSearchCV(
    # FILL IN
)

# Fit
# FILL IN

# Score
# FILL IN

Exercise 2.4

Create a HistGradientBoostingRegressor to predict the outcome, Y.

Save the best hyperparameter combination.

NOTE: The HistGradientBoostingRegressor is an efficient model which does gradient boosting.

Hints:

The best score and best hyperparameter can be found using the methods best_score_ and best_param_ respectively.

# Your code

from sklearn # FILL IN

pipe_booster = Pipeline([                     
   # FILL IN
    ]
    )


param_grid= [{
    'booster__min_samples_leaf':  np.arange(2, 50, 2),
    'booster__max_depth': np.unique(np.logspace(1, 4, 20).astype(int)),
    'booster__learning_rate':np.arange(0,1.001,0.1)
    }]

rs = RandomizedSearchCV(
    # FILL IN
)

# Fit
# FILL IN

# Score
# FILL IN

Exercise 2.5

Which model best predicts Y? Create a model of that type with the best hyperparameter combination

# Your code

Exercise 2.6

You have now found the model which best predicts Y, and now we repeat the same process for T

To find the best model to predict T, repeat exercise 2.2 through 2.5 but predicting T

# Your code

Exercise 2.7

Estimate a LinearDML as well as an untuned and tuned CausalForestDML using the new models for T and Y.

Plot the conditional average treatment effect and the the 95% confidence interval for all three models on X_test. Which do your prefer?

# Your code

Exercise 2.8

Score the three new models using the Rscorer, now predicting the residuals using the models for T and `Y. Which model is the preferred one? Is it preferred over a constant average treatment effect?

# Your code

Finishing thoughts

Today we have examined double machine learning and how to implement these models.

In practice, you should spend more time creating the models for T and Y, as they are quintessential in double machine learning. You should also increase the amount of folds in cross validation, as this increases the amount of data to train on when creating residuals. Furthermore, both the models and Rscorer support monte carlo iterations, mc_iters, which allows you to repeat the same process again with different splits, thus creating less noisy estimates of the residuals. We have not done this today as the running time of these models quite quickly becomes too long for an exercise set.

If you wish to examine linear treatment heterogeneity, you should also look into the input parameters featurizer and treatment_featurizer, which enables quick and easy interactions of covariates and treatments, possibly in combination with a SparseLinearDML. See e.g. this notebook

All of the econml functionality that we went through last week can also still be used with our models, e.g. explainability, CATE interpreters and policy learning.

Additionally, you should be aware that econml has a folder with notebook examples on their GitHub, where you can see the many different possibilities.