{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise set 8: Heterogeneous treatment effectts\n", "\n", "In this exercise set we will be working with estimation of conditional average treatment effects assuming selection on observables. \n", "\n", "First we will use the `econml` package in Python to estimate a double machine learning causal forest, and in the second part we will use the `grf` package in R to estimate a causal forest.\n", "\n", "In this exercise we will be using data from LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American economic review, 604-620, regarding a job training field experiment, where we will examine possible treatment effect heterogeneity treatment effects, downloaded from [NYU](https://users.nber.org/~rdehejia/nswdata.html) but supplied to you in `csv` format in a sligthly cleaned format. The object of interest is real earnings in 1978 and we assume selection on observables and overlap.\n", "\n", "## Python\n", "\n", "In this first part of the exercise, we will be utilizing Python and `econml`.\n", "\n", "First we load some packages and the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
treatageeducationblackhispanicmarriednodegreere75re78
count705.000000705.000000705.000000705.000000705.000000705.000000705.000000705.000000705.000000
mean0.41560324.60709210.2723400.7957450.1078010.1645390.7744683116.2713865586.166074
std0.4931766.6663761.7204120.4034430.3103500.3710270.4182295104.5664786269.582709
min0.00000017.0000003.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.00000019.0000009.0000001.0000000.0000000.0000001.0000000.0000000.000000
50%0.00000023.00000010.0000001.0000000.0000000.0000001.0000001122.6210004159.919000
75%1.00000027.00000011.0000001.0000000.0000000.0000001.0000004118.6810008881.665000
max1.00000055.00000016.0000001.0000001.0000001.0000001.00000037431.66000060307.930000
\n", "
" ], "text/plain": [ " treat age education black hispanic married \\\n", "count 705.000000 705.000000 705.000000 705.000000 705.000000 705.000000 \n", "mean 0.415603 24.607092 10.272340 0.795745 0.107801 0.164539 \n", "std 0.493176 6.666376 1.720412 0.403443 0.310350 0.371027 \n", "min 0.000000 17.000000 3.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 19.000000 9.000000 1.000000 0.000000 0.000000 \n", "50% 0.000000 23.000000 10.000000 1.000000 0.000000 0.000000 \n", "75% 1.000000 27.000000 11.000000 1.000000 0.000000 0.000000 \n", "max 1.000000 55.000000 16.000000 1.000000 1.000000 1.000000 \n", "\n", " nodegree re75 re78 \n", "count 705.000000 705.000000 705.000000 \n", "mean 0.774468 3116.271386 5586.166074 \n", "std 0.418229 5104.566478 6269.582709 \n", "min 0.000000 0.000000 0.000000 \n", "25% 1.000000 0.000000 0.000000 \n", "50% 1.000000 1122.621000 4159.919000 \n", "75% 1.000000 4118.681000 8881.665000 \n", "max 1.000000 37431.660000 60307.930000 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np \n", "import pandas as pd \n", "import seaborn as sns\n", "\n", "np.random.seed(73)\n", "\n", "plt.style.use('seaborn-whitegrid')\n", "\n", "%matplotlib inline\n", "\n", "df = pd.read_csv('nsw.csv', index_col=0)\n", "\n", "df.describe()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.1**\n", ">\n", "> Subset the treatment, outcome and covariates from the `DataFrame` as `T`, `y` and `X`, respectively\n", "> \n", ">>*Hints:*\n", ">> \n", ">> A `DataFrame` supports method `.drop()`, if one wishes to drop multiple columns at once. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your answer here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.2**\n", ">\n", "> Estimate a `CausalForest` using the `econml` package. Make sure that you use the splitting criterion as in the generalized random forest, but otherwise use default parameters.\n", "> \n", ">>*Hints:*\n", ">> \n", ">> The documentation for the `CausalForest` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.grf.CausalForest.html)\n", ">>\n", ">> The splitting criterion is handled by the `criterion` parameter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your answer here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.3**\n", ">\n", "> Predict out of bag estimates of the CATE, and create a histogram of these. Do you observe heterogeneity?\n", "> \n", ">>*Hints:*\n", ">> \n", ">> The documentation for the `CausalForest` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.grf.CausalForest.html)\n", ">>\n", ">> Out of bag is sometimes shortened oob\n", ">>\n", ">> You can create a histogram using `sns.histplot`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your answer here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.4**\n", ">\n", "> Estimate a `CausalForestDML` using the `econml` package. Make sure that you 1) use the splitting criterion as in the generalized random forest, 2) interpret the treatment as a discrete treatment and 3) estimate a thousand trees, but otherwise use default parameters. \n", ">\n", ">>*Hints:*\n", ">> \n", ">> The documentation for the `CausalForestDML` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.5**\n", ">\n", "> Report the doubly robust average treatment effect\n", ">\n", ">>*Hints:*\n", ">> \n", ">> The documentation for the `CausalForestDML` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html)\n", ">>\n", ">> A function which summarizes the model findings is available" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.6**\n", ">\n", "> Examine what variables drive the heterogeneity using the split based feature importance method.\n", ">\n", ">>*Hints:*\n", ">> \n", ">> The documentation for the `CausalForestDML` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html)\n", ">>\n", ">> The feature importances and feature names are available through a method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.7**\n", ">\n", "> Calculate the SHAP values for the model.\n", ">\n", ">>*Hints:*\n", ">> \n", ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.8**\n", ">\n", "> Create a bar plot of the feature importance using SHAP values. Is the top variable the same as in the split based one?\n", ">\n", ">>*Hints:*\n", ">> \n", ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n", ">>\n", ">> The summary plot has a `plot_type` parameter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.9**\n", ">\n", "> Create a summary plot of the feature importance using SHAP values. In what way does the top variable moderate the CATE?\n", ">\n", ">>*Hints:*\n", ">> \n", ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n", ">>\n", ">> The feature value colourbar might disappear, in which the following to lines of code might help:\n", ">>\n", ">> `plt.gcf().axes[-1].set_aspect(100)`\n", ">>\n", ">> `plt.gcf().axes[-1].set_box_aspect(100)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.10**\n", ">\n", "> Create a shallow decision tree of depth 2 to explain the CATE as a function of the inputs using `SingleTreeCateInterpreter`. Does the tree split on the variables you expected it to?\n", ">\n", ">>*Hints:*\n", ">> \n", ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1.11**\n", ">\n", "> Create a shallow policy tree of depth 2 to explain the CATE as a function of the inputs using `SingleTreePolicyInterpreter`. Who does the model target?\n", ">\n", ">>*Hints:*\n", ">> \n", ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## R\n", "\n", "In the second part of the exercise, we will be utilizing Python and `grf`. I will supply R code in code that has been commented out (# for code, ## for comments), and they will not be able to run in Python if you comment them in again. One can run R code in Python using `rpy2`, but I generally recommend using R and not `rpy2` due to the complexity of problem solving if `rpy2` fails.\n", "\n", "The many different functions available in `grf` can be seen in their [reference](https://grf-labs.github.io/grf/reference/index.html) and they have a lot of great tutorials, accessible at the top of their page.\n", "\n", "First we load the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#library(ggplot2)\n", "#library(grf)\n", "\n", "## Load data\n", "# df = read.csv(\"nsw.csv\")\n", "\n", "## Subset target, treatment and covariates\n", "#Y = df$re78\n", "#W = df$treat\n", "#X = subset(df, select= c(-treat, -re78, -X))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.1**\n", ">\n", "> Estimate a `causal_forest` using `grf`.\n", ">\n", ">>*Hints:*\n", ">> \n", ">> There's documentation for `causal_forest` can be found [here](https://grf-labs.github.io/grf/reference/causal_forest.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.2**\n", ">\n", "> Assess the overlap assumption by creating a histogram of the estimated treatment propensities\n", "> \n", ">>*Hints:*\n", ">> \n", ">> An example can be seen [here](https://grf-labs.github.io/grf/articles/diagnostics.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "![](overlap_test.png)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.3**\n", ">\n", "> Estimate the doubly robust average treatment effect. Do we find the same as previously?\n", "> \n", ">>*Hints:*\n", ">> \n", ">> Can you find a function for average treatment effect estimation in the reference?" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.4**\n", ">\n", "> Estimate out of bag CATE's\n", ">\n", ">>*Hints:*\n", ">> \n", ">> How to predict with a forest can be seen [here](https://grf-labs.github.io/grf/reference/predict.causal_forest.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.5**\n", ">\n", "> Test whether heterogeneity exists in the CATE's using the median split based test.\n", ">\n", ">>*Hints:*\n", ">> \n", ">> An example can be seen [here](https://grf-labs.github.io/grf/articles/diagnostics.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.6**\n", ">\n", "> Test whether heterogeneity exists in the CATE's using the RATE. NOTE: Here we should do out of sample predictions, which I ignore in this exercise!\n", ">\n", ">>*Hints:*\n", ">> \n", ">> How to estimate the RATE can be seen [here](https://grf-labs.github.io/grf/reference/rank_average_treatment_effect.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.8**\n", ">\n", "> Examine what variables drive the heterogeneity using the split based feature importance method.\n", ">\n", ">>*Hints:*\n", ">> \n", ">> A function to calculate the variable importance can be found in the [grf reference](https://grf-labs.github.io/grf/reference/index.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2.9**\n", ">\n", "> Examine how the variables affect heterogeneity using the best linear projection.\n", ">\n", ">>*Hints:*\n", ">> \n", ">> A function to estimate the best linear projection can be found in the [grf reference](https://grf-labs.github.io/grf/reference/index.html)" ] } ], "metadata": { "kernelspec": { "display_name": "vive_env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "vscode": { "interpreter": { "hash": "50f1b660881291c5d31ad4d40c48915b3ef9364cc881d538585b11a7eb841304" } } }, "nbformat": 4, "nbformat_minor": 4 }