Magnus Nielsen, SODAS, UCPH
How to use machine learning methods for causality with unknown nuisance functions
How to perform causal model selection
Focus on intuitive understanding of methods and workflow
There’s a lot of different papers building on the same idea
Victor Chernozhukov is part of many of these papers
We consider the following partially linear model
\[\begin{align} Y=&\ T\theta_0+g_0(X)+U\\ T=&\ m_0(X) + V\\ \end{align}\]
With \(E[U|X,T]=0\) and \(E[V|X]=0\)
Basic model properties:
How do you perform functional form/variable selection when using OLS?
Classic econometrics:
Machine learning:
Short for least absolute shrinkage and selection operator
We add a term to the minimization problem which penalizes model complexity \[\hat w = \underset{w}{\text{argmin}} \left\{\frac{1}{N}|| Y - Xw||_2^2 + \lambda ||w||_1\right\}, \lambda \geq 0\]
where \(||\cdot||_1\) is the L1 or Taxicab norm, corresponding to \(\sum_{i=1}^k |w_i|\)
Source: Raschka & Mirjalili, 2019, ch. 4
Due to the regularization, all estimates are biased towards zero
However, some problems remain
What kind of covariates should be included in our regression?
Given this, what’s the problem with using the modified LASSO for variable selection?
Still problematic
A simple solution suggested by Belloni et al. (2014a) is to use a post-double-selection method to correct for bias:
Also sometimes known as double-LASSO
This can be motivated in roughly two ways:
Usually done with cross validation
Some issues
There also exist theoretically justified hyperparameters
One issue:
Luckily, we’re just interested in covariate selection
\[\lambda = 2.2\sigma_r \sqrt{N} \Phi^{-1}\left(1-\frac{\alpha}{2K\cdot \text{ln}(N)}\right)\]
Where
Taken from Urminsky et al. (2016) appendix
We must estimate the standard deviation of the residuals
This is done in an iterative way
In Urminsky et al. (2016) \(x=100\)
Formally, we need to assume sparsity to make valid inference (Belloni et al., 2014b)
However, LASSO can handle \(K>N\)
You might find high-dimensional problems irrelevant
Method can be used for datasets with few observations
Alternatively, you just don’t know the functional form
Consider the second, third or higher order polynomial expansion of a regular amount of variables
Could also be other transformations of covariates
This method only does data driven selection of covariates
The formal theory in Belloni et al. (2014b) allows for inclusion of these
The same idea can be applied to instrumental variables
A contender for how to approach many weak instruments problem
There are a couple of packages
hdm
in Rpdslasso
in StataHyperparameter selection that is robust to non-Gaussian and heteroskedastic errors exist (Belloni et al., 2012)
We’re returning to the partially linear model
\[\begin{align} Y=& T\theta_0+g_0(X)+U\\ T=& m_0(X) + V\\ \end{align}\]
With \(E[U|X,T]=0\) and \(E[V|X]=0\)
We will be going through the main ideas of Chernozhukov et al. (2018)
We use data splitting to get two samples
In the auxiliary sample, \(I^a\), we estimate \(g(\cdot)\)
In the main sample we estimate the parameter of interest \[\hat{\theta}_0 = \frac{\frac{1}{\sqrt{n}}\sum_{i\in I^m}T_i(Y_i-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I^m}T_i^2}\]
However, this estimator generally does not converge to the true value
We can decompose the error into two parts
\[ \begin{align} \sqrt{n}(\theta-\hat{\theta}_0) = & \frac{\frac{1}{\sqrt{n}}\sum_{i\in I^m}T_iU_i}{\frac{1}{n}\sum_{i\in I^m}T_i^2} \\ & + \frac{\frac{1}{\sqrt{n}}\sum_{i\in I^m}T_i(g_0-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I^m}T_i^2} \end{align} \]
The first part converges under mild conditions, the second does not
It can be shown that this is due to regularization bias
Suppose we also estimate \(\hat{m}_0(\cdot)\) on the auxiliary sample \(I^a\)
We can then utilize the following estimator
\[\check{\theta}_0=\frac{\frac{1}{n}\sum_{i\in I^m}\hat{V}_i(y_i-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I^m}\hat{V}_i T_i}\]
This can be decomposed into three terms, \(a\), \(b\) and \(c\)
The term \(b\) now depends on the product of the estimation errors in \(\hat{m}_0\) and \(\hat{g}_0\) \[ \begin{align} b & = \frac{\frac{1}{\sqrt{n}} \sum_{i \in I^m}[\hat m_0(X_i) - m_0(X_i)][\hat g_0 (X_i) - g_0 (X_i)]}{E[V^2]} \end{align} \]
The moment is Neyman orthogonal
The term \(c\) relates to overfitting
By utilizing sample splitting, errors in DGP and estimation errors are unrelated
Like in the last session, we don’t like estimating treatment effects in just one part of the sample
There are estimators that take this into account exist
We’re going to switch to a slightly different estimator based on Robinson (1988)
Consider the DGP
\[ \begin{align} Y & = \theta(X)\cdot T + g(X,W) + \epsilon \\ T & = f(X, W) + \eta \\ \end{align} \] where
\[ \begin{align} E[\epsilon|X, W] & = 0 \\ E[\eta|X, W] & = 0 \\ E[\epsilon \cdot \eta|X, W] & = 0 \end{align} \]
Question: What’s the difference between \(X\) and \(W\)?
We can subtract \(E[Y| X,W]\) and get:
\[ \begin{align} Y - E[Y|X,W] & = \theta(X)\cdot (T - E[T|X,W]) + \epsilon \\ \end{align} \] Where we use that \[ \begin{align} E[Y|X,W] = \tau(X) \cdot E[T|X,W] + g(X,W) \end{align} \]
We can estimate the nuisance functions \(E[Y|X,W]\) and \(E[T|X,W]\)
We estimate the nuisance functions (using data splitting) and calculate residuals \[ \begin{align} \tilde Y & = Y - E[Y|X,W] \\ \tilde T & = Y - E[T|X,W] \end{align} \]
The residuals are related by the equation \[ \begin{align} \tilde Y & = \theta(X)\cdot \tilde T + \epsilon \end{align} \]
The estimator based on the Robinson’s (1988) partialling out approach is then
\[ \begin{align} \hat \theta & = \underset{\theta \in \Theta}{\text{argmin}} E_{n}\left[(\tilde Y - \theta(X)\cdot \tilde T)^2\right] \end{align} \]
For some model class \(\Theta\), i.e. a constant average treatment effect in Chernozhukov et al. (2018)
We perform two steps of machine learning and regress residuals
Where prediction of \(T\) and \(Y\) utilize data splitting
When coding this up, this is what actually happens:
You need to worry about
And whether there’s only selection on observables
The predictive performance in the two predictive models can be assessed as usual
Any model you can think of
Or an ensemble of all these
You can perform hyperparameter selection using all the data
Better hyperparameter selection results in better nuisance estimates
Not a problem as long as relatively few hyperparameters are tuned, see references here
You can also increase precision by estimating noise multiple times
ATE in Chernozhukov et al. (2018)
Linear and high-dimensional linear in Semenova et al. (2017)
Non-parametric in Athey et al. (2019)
Non-parametric in Oprescu et al. (2019)
Both use the bootstrap of little bags for inference
Implemented in econml
LinearDML
SparseLinearDML
CausalForestDML
DMLOrthoForest
Also has other implementations, but LinearDML
, SparseLinearDML
and CausalForestDML
most cited papers
In R: doubleml
In Stata: ddml
& pystacked
pystacked
uses stacked sklearn
modelspdslasso
and other LASSO implementations in StataFor categorical treatments, one can also use doubly robust methods
Pros
Cons
Flexible methods
Tempted to join everything available on observation
What’s the problem with the aforementioned idea?
As always, we should not include ´bad controls´ (Angrist & Pischke, 2009)
Perform variable selection and double machine learning only with good and neutral controls
Good and bad controls are relatively vague
Graphs create a way of organizing thoughts about DGP’s
There’s a crash course in Cinelli et al. (2020) and a discussion of the use of graph approaches versus potential outcome approaches in Imbens (2020)
We have many different CATE estimators
Sometimes we might prefer one method a priori
Sometimes we are just interested in personalized estimates
Usually we utilize that we observe the ground truth \(Y\) for model selection
The counterpart in causality would be \(E[l(\hat \tau (x), \tau)]\)
How could one evaluate CATE’s?
Hint: Remember the partialling out regression \[ \begin{align} \tilde Y & = \theta(X)\cdot \tilde T + \epsilon \end{align} \]
Nie & Wager (2021) propose an alternative method to evaluate CATE estimates, \(\hat \tau\)
Rewriting the partialling out regression
\[ \begin{align} \tilde Y & = \theta(X)\cdot \tilde T + \epsilon \\ \tilde Y & = \hat \tau \tilde T + \epsilon \end{align} \]
Our treatment effects should explain the residual \(\tilde Y\)
With a slight reformulation of eq. 4 in the paper, we get
\[ \begin{align} \hat L_n [\hat \tau(\cdot)] & = \frac{1}{n} \sum_{i=1}^n \left[\tilde Y - \hat\tau\tilde {T}\right]^2 \end{align} \]
Should use out of sample residuals and CATE estimates
Alternatively, use held-out data
Can evaluate estimates from any model
grf
tunes model hyperparameters with tune
econml
has a score
functionNie & Wager (2021) also motivate a two step modelling process
Called the R-learner due to it’s close link to Robinson (1988) and the focus on residuals
KernelDML
in econml
One could also evaluate against a constant average treatment effect
Implemented in the RScorer
in econml
, see here
The very short version
econml
function)
econml
has a table of different estimators
econml
also has a GitHub with example notebooks
Meta-learners also estimate CATE’s
Bayesian additive regression trees (BART) also estimate CATE’s
Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6), 2369-2429.
Belloni, A., Chernozhukov, V., & Hansen, C. (2014a). High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2), 29-50.
Belloni, A., Chernozhukov, V., & Hansen, C. (2014b). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2), 608-650.
Urminsky, O., Hansen, C., & Chernozhukov, V. (2016). Using double-lasso regression for principled variable selection. Available at SSRN 2733374.
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters.
Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., & Robins, J. M. (2022). Locally robust semiparametric estimation. Econometrica, 90(4), 1501-1535.
Chiang, H. D., Kato, K., Ma, Y., & Sasaki, Y. (2022). Multiway cluster robust double/debiased machine learning. Journal of Business & Economic Statistics, 40(3), 1046-1056.
Nie, X., & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 299-319.
Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019, May). Orthogonal random forest for causal inference. In International Conference on Machine Learning (pp. 4932-4941). PMLR.
Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 931-954.
Schuler, A., Baiocchi, M., Tibshirani, R., & Shah, N. (2018). A comparison of methods for model selection when estimating individual treatment effects. arXiv preprint arXiv:1804.05146.
Semenova, V., Goldman, M., Chernozhukov, V., & Taddy, M. (2017). Estimation and inference on heterogeneous treatment effects in high-dimensional dynamic panels. arXiv preprint arXiv:1712.09988.
Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
Cinelli, C., Forney, A., & Pearl, J. (2020). A crash course in good and bad controls. Sociological Methods & Research, 00491241221099552.
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees.
Hünermund, P., & Bareinboim, E. (2019). Causal inference and data fusion in econometrics. arXiv preprint arXiv:1912.09104.
Hünermund, P., Louw, B., & Caspi, I. (2021). Double Machine Learning and Automated Confounder Selection–A Cautionary Tale. arXiv preprint arXiv:2108.11294.
Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), 1129-1179.
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165.
Raschka, S., & Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.