{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise set 8: Heterogeneous treatment effectts\n",
    "\n",
    "In this exercise set we will be working with estimation of conditional average treatment effects assuming selection on observables. \n",
    "\n",
    "First we will use the `econml` package in Python to estimate a double machine learning causal forest, and in the second part we will use the `grf` package in R to estimate a causal forest.\n",
    "\n",
    "In this exercise we will be using data from LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American economic review, 604-620, regarding a job training field experiment, where we will examine possible treatment effect heterogeneity treatment effects, downloaded from [NYU](https://users.nber.org/~rdehejia/nswdata.html) but supplied to you in `csv` format in a sligthly cleaned format. The object of interest is real earnings in 1978 and we assume selection on observables and overlap.\n",
    "\n",
    "## Python\n",
    "\n",
    "In this first part of the exercise, we will be utilizing Python and `econml`.\n",
    "\n",
    "First we load some packages and the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>treat</th>\n",
       "      <th>age</th>\n",
       "      <th>education</th>\n",
       "      <th>black</th>\n",
       "      <th>hispanic</th>\n",
       "      <th>married</th>\n",
       "      <th>nodegree</th>\n",
       "      <th>re75</th>\n",
       "      <th>re78</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "      <td>705.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.415603</td>\n",
       "      <td>24.607092</td>\n",
       "      <td>10.272340</td>\n",
       "      <td>0.795745</td>\n",
       "      <td>0.107801</td>\n",
       "      <td>0.164539</td>\n",
       "      <td>0.774468</td>\n",
       "      <td>3116.271386</td>\n",
       "      <td>5586.166074</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.493176</td>\n",
       "      <td>6.666376</td>\n",
       "      <td>1.720412</td>\n",
       "      <td>0.403443</td>\n",
       "      <td>0.310350</td>\n",
       "      <td>0.371027</td>\n",
       "      <td>0.418229</td>\n",
       "      <td>5104.566478</td>\n",
       "      <td>6269.582709</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>17.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>19.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1122.621000</td>\n",
       "      <td>4159.919000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>27.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>4118.681000</td>\n",
       "      <td>8881.665000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>55.000000</td>\n",
       "      <td>16.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>37431.660000</td>\n",
       "      <td>60307.930000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            treat         age   education       black    hispanic     married  \\\n",
       "count  705.000000  705.000000  705.000000  705.000000  705.000000  705.000000   \n",
       "mean     0.415603   24.607092   10.272340    0.795745    0.107801    0.164539   \n",
       "std      0.493176    6.666376    1.720412    0.403443    0.310350    0.371027   \n",
       "min      0.000000   17.000000    3.000000    0.000000    0.000000    0.000000   \n",
       "25%      0.000000   19.000000    9.000000    1.000000    0.000000    0.000000   \n",
       "50%      0.000000   23.000000   10.000000    1.000000    0.000000    0.000000   \n",
       "75%      1.000000   27.000000   11.000000    1.000000    0.000000    0.000000   \n",
       "max      1.000000   55.000000   16.000000    1.000000    1.000000    1.000000   \n",
       "\n",
       "         nodegree          re75          re78  \n",
       "count  705.000000    705.000000    705.000000  \n",
       "mean     0.774468   3116.271386   5586.166074  \n",
       "std      0.418229   5104.566478   6269.582709  \n",
       "min      0.000000      0.000000      0.000000  \n",
       "25%      1.000000      0.000000      0.000000  \n",
       "50%      1.000000   1122.621000   4159.919000  \n",
       "75%      1.000000   4118.681000   8881.665000  \n",
       "max      1.000000  37431.660000  60307.930000  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np \n",
    "import pandas as pd \n",
    "import seaborn as sns\n",
    "\n",
    "np.random.seed(73)\n",
    "\n",
    "plt.style.use('seaborn-whitegrid')\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "df = pd.read_csv('nsw.csv', index_col=0)\n",
    "\n",
    "df.describe()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.1**\n",
    ">\n",
    "> Subset the treatment, outcome and covariates from the `DataFrame` as `T`, `y` and `X`, respectively\n",
    "> \n",
    ">>*Hints:*\n",
    ">> \n",
    ">> A `DataFrame` supports method `.drop()`, if one wishes to drop multiple columns at once. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answer here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.2**\n",
    ">\n",
    "> Estimate a `CausalForest` using the `econml` package. Make sure that you use the splitting criterion as in the generalized random forest, but otherwise use default parameters.\n",
    "> \n",
    ">>*Hints:*\n",
    ">> \n",
    ">> The documentation for the `CausalForest` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.grf.CausalForest.html)\n",
    ">>\n",
    ">> The splitting criterion is handled by the `criterion` parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answer here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.3**\n",
    ">\n",
    "> Predict out of bag estimates of the CATE, and create a histogram of these. Do you observe heterogeneity?\n",
    "> \n",
    ">>*Hints:*\n",
    ">> \n",
    ">> The documentation for the `CausalForest` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.grf.CausalForest.html)\n",
    ">>\n",
    ">> Out of bag is sometimes shortened oob\n",
    ">>\n",
    ">> You can create a histogram using `sns.histplot`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answer here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.4**\n",
    ">\n",
    "> Estimate a `CausalForestDML` using the `econml` package. Make sure that you 1) use the splitting criterion as in the generalized random forest, 2) interpret the treatment as a discrete treatment and 3) estimate a thousand trees, but otherwise use default parameters. \n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> The documentation for the `CausalForestDML` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.5**\n",
    ">\n",
    "> Report the doubly robust average treatment effect\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> The documentation for the `CausalForestDML` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html)\n",
    ">>\n",
    ">> A function which summarizes the model findings is available"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.6**\n",
    ">\n",
    "> Examine what variables drive the heterogeneity using the split based feature importance method.\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> The documentation for the `CausalForestDML` can be found [here](https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html)\n",
    ">>\n",
    ">> The feature importances and feature names are available through a method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.7**\n",
    ">\n",
    "> Calculate the SHAP values for the model.\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.8**\n",
    ">\n",
    "> Create a bar plot of the feature importance using SHAP values. Is the top variable the same as in the split based one?\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n",
    ">>\n",
    ">> The summary plot has a `plot_type` parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.9**\n",
    ">\n",
    "> Create a summary plot of the feature importance using SHAP values. In what way does the top variable moderate the CATE?\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n",
    ">>\n",
    ">> The feature value colourbar might disappear, in which the following to lines of code might help:\n",
    ">>\n",
    ">> `plt.gcf().axes[-1].set_aspect(100)`\n",
    ">>\n",
    ">> `plt.gcf().axes[-1].set_box_aspect(100)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.10**\n",
    ">\n",
    "> Create a shallow decision tree of depth 2 to explain the CATE as a function of the inputs using `SingleTreeCateInterpreter`. Does the tree split on the variables you expected it to?\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 1.11**\n",
    ">\n",
    "> Create a shallow policy tree of depth 2 to explain the CATE as a function of the inputs using `SingleTreePolicyInterpreter`. Who does the model target?\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> There's an example on the following [page](https://econml.azurewebsites.net/spec/interpretability.html)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## R\n",
    "\n",
    "In the second part of the exercise, we will be utilizing Python and `grf`. I will supply R code in code that has been commented out (# for code, ## for comments), and they will not be able to run in Python if you comment them in again. One can run R code in Python using `rpy2`, but I generally recommend using R and not `rpy2` due to the complexity of problem solving if `rpy2` fails.\n",
    "\n",
    "The many different functions available in `grf` can be seen in their [reference](https://grf-labs.github.io/grf/reference/index.html) and they have a lot of great tutorials, accessible at the top of their page.\n",
    "\n",
    "First we load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#library(ggplot2)\n",
    "#library(grf)\n",
    "\n",
    "## Load data\n",
    "# df = read.csv(\"nsw.csv\")\n",
    "\n",
    "## Subset target, treatment and covariates\n",
    "#Y = df$re78\n",
    "#W = df$treat\n",
    "#X = subset(df, select= c(-treat, -re78, -X))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.1**\n",
    ">\n",
    "> Estimate a `causal_forest` using `grf`.\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> There's documentation for `causal_forest` can be found [here](https://grf-labs.github.io/grf/reference/causal_forest.html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.2**\n",
    ">\n",
    "> Assess the overlap assumption by creating a histogram of the estimated treatment propensities\n",
    "> \n",
    ">>*Hints:*\n",
    ">> \n",
    ">> An example can be seen [here](https://grf-labs.github.io/grf/articles/diagnostics.html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](overlap_test.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.3**\n",
    ">\n",
    "> Estimate the doubly robust average treatment effect. Do we find the same as previously?\n",
    "> \n",
    ">>*Hints:*\n",
    ">> \n",
    ">> Can you find a function for average treatment effect estimation in the reference?"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.4**\n",
    ">\n",
    "> Estimate out of bag CATE's\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> How to predict with a forest can be seen [here](https://grf-labs.github.io/grf/reference/predict.causal_forest.html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.5**\n",
    ">\n",
    "> Test whether heterogeneity exists in the CATE's using the median split based test.\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> An example can be seen [here](https://grf-labs.github.io/grf/articles/diagnostics.html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.6**\n",
    ">\n",
    "> Test whether heterogeneity exists in the CATE's using the RATE. NOTE: Here we should do out of sample predictions, which I ignore in this exercise!\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> How to estimate the RATE can be seen [here](https://grf-labs.github.io/grf/reference/rank_average_treatment_effect.html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.8**\n",
    ">\n",
    "> Examine what variables drive the heterogeneity using the split based feature importance method.\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> A function to calculate the variable importance can be found in the [grf reference](https://grf-labs.github.io/grf/reference/index.html)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Exercise 2.9**\n",
    ">\n",
    "> Examine how the variables affect heterogeneity using the best linear projection.\n",
    ">\n",
    ">>*Hints:*\n",
    ">> \n",
    ">> A function to estimate the best linear projection can be found in the [grf reference](https://grf-labs.github.io/grf/reference/index.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "vive_env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "vscode": {
   "interpreter": {
    "hash": "50f1b660881291c5d31ad4d40c48915b3ef9364cc881d538585b11a7eb841304"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}