Magnus Nielsen, SODAS, UCPH
Estimation of conditional average treatment effects
How to use conditional average treatment effects
Try to get an intuitive understanding of what the methods do
Main contribution of each paper in the ‘generalized random forest’ series
Denote \(T_i\) as the treatment variable
Define the potential outcomes
\[Y_i=\begin{cases} Y_i(1), & T_i=1\\ Y_i(0), & T_i=0 \end{cases}\]
The observed outcome \(Y_i\) can be written in terms of potential outcomes: \[ Y_i = Y_{i}(0) + [Y_{i}(1)-Y_{i}(0)]\cdot T_i\]
\(Y_{i}(1)-Y_{i}(0)\) is the causal effect of \(T_i\) on \(Y_i\)
We never observe the same individual \(i\) in both states
We need some way of estimating the state we do not observe (the counterfactual)
Perhaps we can do a naive comparison by treatment status?
\[\tau = E[Y_i|T_i = 1] - E[Y_i|T_i = 0]\]
Utilizing that
\[Y_i = Y_{i}(0) + [Y_{i}(1)-Y_{i}(0)] \cdot T_i\]
We get the following
\[ \begin{align} \nonumber E[Y_i|T_i = 1] - E[Y_i|T_i = 0] = &E[Y_i(1)|T_i = 1] - E[Y_i(0)|T_i = 1] + \\ \nonumber &E[Y_i(0)|T_i = 1] - E[Y_i(0)|T_i = 0] \end{align} \]
The average causal effect of \(T_i\) on \(Y\)
\[E[Y_i(1)|T_i = 1] - E[Y_i(0)|T_i = 1] = E[Y_i(1) - Y_i(0)|T_i = 1]\]
Difference in average \(Y_i(0)\) between the two groups
\[E[Y_i(0)|T_i = 1] - E[Y_i(0)|T_i = 0]\]
Often referred to as selection bias
Random assignment implies \(T_i\) is independent of potential outcomes
\[E[Y_{i}(0)|T_i = 1] = E[Y_{i}(0)|T_i = 0]\]
Intuition: non-treated individuals can be used as counterfactuals for treated
If randomization by us is not feasible, we must rely on nature:
Today we will consider
Construct counterfactual potential treated and control units
Why: Matching controls for the covariates used
For a given characteristic \(x\), find \(k\) nearest treated (\(S_1\)) and untreated (\(S_0\)) observations
We can then estimate the conditional average treatment effect (CATE) using the following estimator \[\tau(x) = \frac{1}{k}\sum_{i\in S_1(x)} Y_i - \frac{1}{k}\sum_{i\in S_0(x)} Y_i\]
Nearest could defined by distance in covariates or in propensities
When performing matching, it isn’t necessary to aggregate up to an average treatment effect
We can instead just stop when we have estimated the CATE
\[\tau(x) = E[Y_i(1)-Y_i(0)|X=x]\]
Do any of the previously studied supervised models create ‘neighborhoods’? If yes, which?
Trees inherently create partitions
One big problem: We’re matching on outcomes
Spurious extreme values of \(Y_i\) are going to be matched with other spurious extreme values
What does this mean?
We utilize sample splitting
An observation can be used for either
This is called being an honest tree (versus adaptive), and is proposed by Athey & Imbens (2016), Recursive partitioning for heterogeneous causal effects
What is the main drawback of honest estimation?
Split to identify heterogeneous treatment effects
Modify criterion in anticipation of this
How can one increase performance of trees for a fixed sample?
Wager & Athey (2018), Estimation and inference of heterogeneous treatment effects using random forests, propose the causal forest, which is an ensemble of causal trees
Reduces variance and creates less sharp boundaries
For each tree \(b\), calculate the CATE of the observation as in the causal tree (eq. 5 in paper), denoted \(\hat \tau_b(x)\)
For ensemble of \(B\) trees, CATE estimator is then
\[\hat \tau(x) = B^{-1} \sum_{b=1}^{B} \hat \tau_{b}(x)\]
As long as trees are honest, we can perform asymptotic inference
Two ways of achieving honesty, double-sampling (as in causal tree) or propensity trees
Reconciled in generalized random forest
Coverage until \(d=10\), performance degrades after
Trees create neighborhoods with CATE’s
Athey et al. (2019), Generalized random forests, reframe it as creating a weighting function usable in maximum likelihood estimation
Previously used weights based on similarity but had strong issues with the curse of dimensionality
Generalized random forests use data-driven heterogeneity to lessen this
If really high dimensional, consider double machine learning (next session)
By rephrasing into moment conditions, multiple possibilities arise
Note that causal forests can refer to both causal forests in Wager & Athey (2018) and in Athey et al. (2019)
econml
and grf
)Compared to the causal forest in Wager & Athey (2018), a couple of other things are changed:
Most critical assumptions are the “regular” assumptions:
Test these as you usually would (if possible)
There are some additional technical assumptions (sec. 3)
Two approaches
econml
in Python
grf.CausalForest
, grf.CausalForestIV
and dml.CausalForestDML
grf
in R
causal_forest
, instrumental_forest
, quantile_forest
Exercises will cover both R and Python
I do not expect you to learn R for this one thing, but wanted to supply some code
When performing causal inference, we need to retain honesty
Either split the data or use out of bag predictions
Under either scenario, causal inference is valid
Causal forests perform best for relatively low-dimensional problems
Consider using a double machine learning variant if you have many covariates
The grf algorithm reference has some recommendations, amongst others:
honesty.fraction
)grf
implements an option which tunes hyperparameters called tune.parameters
The packages really try to make causal inference more accessible
grf
tutorials, top centreeconml
user guideAthey and Wager have multiple examples where they implement models and describe their considerations
Many different things to do after estimating CATE’s, broadly categorized:
A simple naive test: Split data based on median CATE
See evaluating a causal forest fit for an example
Note: When calculating ATE’s, use built-in functionality to calculate doubly-robust ATE’s
average_treatment_effect
function in Rate
method in PythonAlternatively, consider the Rank-Weighted Average Treatment Effect (RATE) introduced by Yadlowsky et al. (2021)
Procedure is as follows:
This creates the Targeting Operator Characteristic (TOC) curve
Usage:
If AUTOC is not different from zero, there are two possible explanations:
All implemented in R in rank_average_treatment
, see here
How does one interpret a fully non-parametric CATE estimation?
Sadly, the packages are not very consistent in what is offered
A simple way to asses what drives heterogeneity is to look at splits in the trees
variable_importances
function in Rfeature_importances
method in PythonWhat do we know about bias and explainability methods that use splits to calculate feature importance?
A neat way to explain heterogeneity is to use SHAP values!
To my knowledge only available in econml
shap_values()
methodAnother possibility is to predict the CATE with an intrinsically interpretable model
econml.interpretation
as SingleTreeCateInterpretergrf
as best_linear_projectionOne could easily train own models
On the basis of what one has found, or priors, one can estimate counterfactual CATE’s
One can create policies based on the CATE’s
policytree
in R, Sverdrup et al. (2020)
SingleTreePolicyInterpreter
in econml.cate_interpreter
in Python
econml.policy
has trees and forests that are both doubly robust and not, see documentationSee e.g. Athey & Wager (2021)
Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360.
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests.
Athey, S., & Wager, S. (2019). Estimating treatment effects with causal forests: An application. Observational Studies, 5(2), 37-51.
Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., & Wager, S. (2020). policytree: Policy learning via doubly robust empirical welfare maximization over trees. Journal of Open Source Software, 5(50), 2232.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
Yadlowsky, S., Fleming, S., Shah, N., Brunskill, E., & Wager, S. (2021). Evaluating treatment prioritization rules via rank-weighted average treatment effects. arXiv preprint arXiv:2111.07966.