Double machine learning

Magnus Nielsen, SODAS, UCPH

Agenda

Variable selection
Post-double-selection
Double machine learning
Generalizing DML
Decisions
Kitchen sink causality
Causal model selection

Goal

How to use machine learning methods for causality with unknown nuisance functions

Variable selection
Double machine learning

How to perform causal model selection

When estimating heterogeneous treatment effects

Focus on intuitive understanding of methods and workflow

Will have to use some math to create a scaffold

A citation mess

There’s a lot of different papers building on the same idea

Many have multiple working papers and sometimes also final publications
- I try to use final publication if possible
- Can cause confusion when they cite eachother

Victor Chernozhukov is part of many of these papers

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters is probably the most often cited

What to cite?

_{^{Source: Chernozhukov et al., 2022}}

Variable selection

Partially linear model

We consider the following partially linear model

\[\begin{align} Y=&\ T\theta_0+g_0(X)+U\\ T=&\ m_0(X) + V\\ \end{align}\]

With \(E[U|X,T]=0\) and \(E[V|X]=0\)

Basic model properties:

Outcome is confounded by (unknown) nuisance function, \(g_0(\cdot)\)
People select into treatment based on observables, with (unknown) propensity function \(m_0(\cdot)\)

Question

How do you perform functional form/variable selection when using OLS?

Model selection

Classic econometrics:

Use OLS and include covariates based on theory or inference
- Canonical functional forms, backward selection, forward selection etc.
Problems:
- How to delete covariates systematically?
- Adjust for multiple hypothesis testing?
- Does data support it?

Guided selection

Machine learning:

Use LASSO to perform covariate selection

A quick recap of LASSO

Short for least absolute shrinkage and selection operator

Introduced by Tibshirani (1996)

We add a term to the minimization problem which penalizes model complexity \[\hat w = \underset{w}{\text{argmin}} \left\{\frac{1}{N}|| Y - Xw||_2^2 + \lambda ||w||_1\right\}, \lambda \geq 0\]

where \(||\cdot||_1\) is the L1 or Taxicab norm, corresponding to \(\sum_{i=1}^k |w_i|\)

A geometric interpretation

_{^{Source: Raschka & Mirjalili, 2019, ch. 4}}

A first step

Due to the regularization, all estimates are biased towards zero

We could exclude the treatment from the regularization

However, some problems remain

Question

What kind of covariates should be included in our regression?

Given this, what’s the problem with using the modified LASSO for variable selection?

Confounders

Still problematic

We may omit potentially relevant variables
- LASSO excludes possible variables if little predictive power w.r.t. \(Y\)
- Excluded variables may still have predictive power w.r.t. \(T\)
  - Confounders with little (but non-zero) predictive power w.r.t. \(Y\) may be excluded

Post-double-selection

Fixing the LASSO

A simple solution suggested by Belloni et al. (2014a) is to use a post-double-selection method to correct for bias:

Step 1: estimate two LASSO models
- 1. Regress \(Y\) on \(X\)
- 1. Regress \(T\) on \(X\)
Step 2: run OLS using only variables that were kept in either LASSO

Also sometimes known as double-LASSO

Why?

This can be motivated in roughly two ways:

Researcher covariate selection is funky
- P-hacking, subjectivity etc.
We are assuming sparsity

Hyperparameter selection

Usually done with cross validation

Great out of sample predictive performance

Some issues

Tends to include too many variables
No formal theory

Theoretical hyperparameters

There also exist theoretically justified hyperparameters

Good at variable selection
Formal theory

One issue:

Not great out of sample predictive performance
- Hence never used in predictive modelling

Luckily, we’re just interested in covariate selection

The formula

\[\lambda = 2.2\sigma_r \sqrt{N} \Phi^{-1}\left(1-\frac{\alpha}{2K\cdot \text{ln}(N)}\right)\]

Where

\(\sigma_r\) is the standard deviation of the residuals (homoskedastic)
\(\Phi^{-1}\) is the inverse CDF of the normal distribution
\(K\) and \(N\) are amount of covariates and sample size
\(C\) and \(\alpha\) are constants
- Usually 1.1 and 0.1

Taken from Urminsky et al. (2016) appendix

A non-technical introduction to post-double-selection

Residuals?

We must estimate the standard deviation of the residuals

As such, we need residuals

This is done in an iterative way

Get starter residuals by subtracting mean
Estimate standard deviation
Use formula
- Repeat \(x\) amount of times or until convergence

In Urminsky et al. (2016) \(x=100\)

Inference

Formally, we need to assume sparsity to make valid inference (Belloni et al., 2014b)

This is also what we implicitly do when we make low-dimensional functional forms
As an example, LASSO could return \(K>N\), which won’t work with OLS

However, LASSO can handle \(K>N\)

We can perform variable selection even if there are more variables than observations!

More variables than observations?

You might find high-dimensional problems irrelevant

Not so fast

Method can be used for datasets with few observations

We generally have many controls through Statistics Denmark

Alternatively, you just don’t know the functional form

But willing to assume linear functional form
Create polynomial features

Technical regressors

Consider the second, third or higher order polynomial expansion of a regular amount of variables

Just amount of interactions quickly explodes
- \(\frac{n!}{k!(n-k)!}\) at each level
Sparsity probably justified
Sometimes called technical regressors

Could also be other transformations of covariates

Data driven selection

This method only does data driven selection of covariates

As such, it may leave out variables which you want to include a priori

The formal theory in Belloni et al. (2014b) allows for inclusion of these

If you want an additional covariate in, just add it to the set

Instrumental variables

The same idea can be applied to instrumental variables

Introduced in Belloni et al. (2012)

A contender for how to approach many weak instruments problem

We need to assume sparsity

Implementations

There are a couple of packages

hdm in R
pdslasso in Stata

Hyperparameter selection that is robust to non-Gaussian and heteroskedastic errors exist (Belloni et al., 2012)

Double machine learning

Back to the partially linear model

We’re returning to the partially linear model

\[\begin{align} Y=& T\theta_0+g_0(X)+U\\ T=& m_0(X) + V\\ \end{align}\]

With \(E[U|X,T]=0\) and \(E[V|X]=0\)

We will be going through the main ideas of Chernozhukov et al. (2018)

The intuitive version

First idea: Sample splitting

We use data splitting to get two samples

Main part \(I^m\), auxiliary part \(I^a\), each with size \(n\)

In the auxiliary sample, \(I^a\), we estimate \(g(\cdot)\)

We can use any arbitrary machine learning method
Learn the confounding function

Estimator

In the main sample we estimate the parameter of interest \[\hat{\theta}_0 = \frac{\frac{1}{\sqrt{n}}\sum_{i\in I^m}T_i(Y_i-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I^m}T_i^2}\]

However, this estimator generally does not converge to the true value

Not usable

Decomposing the error

We can decompose the error into two parts

\[ \begin{align} \sqrt{n}(\theta-\hat{\theta}_0) = & \frac{\frac{1}{\sqrt{n}}\sum_{i\in I^m}T_iU_i}{\frac{1}{n}\sum_{i\in I^m}T_i^2} \\ & + \frac{\frac{1}{\sqrt{n}}\sum_{i\in I^m}T_i(g_0-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I^m}T_i^2} \end{align} \]

The first part converges under mild conditions, the second does not

What could be the cause of this?
- Think of what we do when learning \(\hat{g}_0\)?

Regularization bias

It can be shown that this is due to regularization bias

We curb overfitting when learning \(\hat{g}\)
- This is done to control the bias-variance trade-off
- Necessary for informative learning in complex and high-dimensional settings
Estimator will have bias term that asymptotically diverges and is not centered
- Converges too slowly

Orthogonalization

Suppose we also estimate \(\hat{m}_0(\cdot)\) on the auxiliary sample \(I^a\)

Calculate residuals \(\hat V = T - \hat{m}_0(X)\) for \(X \in I^m\)

We can then utilize the following estimator

\[\check{\theta}_0=\frac{\frac{1}{n}\sum_{i\in I^m}\hat{V}_i(y_i-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I^m}\hat{V}_i T_i}\]

Decomposing

This can be decomposed into three terms, \(a\), \(b\) and \(c\)

\(a\) is once again well behaved
\(b\) concerns regularization bias
\(c\) concerns overfitting bias

The term \(b\)

The term \(b\) now depends on the product of the estimation errors in \(\hat{m}_0\) and \(\hat{g}_0\) \[ \begin{align} b & = \frac{\frac{1}{\sqrt{n}} \sum_{i \in I^m}[\hat m_0(X_i) - m_0(X_i)][\hat g_0 (X_i) - g_0 (X_i)]}{E[V^2]} \end{align} \]

The moment is Neyman orthogonal

Locally insensitive to the value of the nuisance parameters
- We can use noisy estimates
- Hence why it is sometimes called orthogonal machine learning
Even though both estimates converge slowly, the product still converges

The term \(c\)

The term \(c\) relates to overfitting

Errors in DGP and estimation errors are related
- Model overfits to the noise
Traditionally handled in semi-parametric analysis by assuming limited complexity
- This rules out settings with machine learning methods, as they are too complex

By utilizing sample splitting, errors in DGP and estimation errors are unrelated

Weak assumptions needed, dependent on application

Sample splitting

Like in the last session, we don’t like estimating treatment effects in just one part of the sample

We rotate the data and repeat the process
- Also supports K-fold data splitting
  - More data for estimating \(m_0\) and \(g_0\)
  - Should be done in practice

Partialling out

There are estimators that take this into account exist

Also in the setup we’ve described

We’re going to switch to a slightly different estimator based on Robinson (1988)

Difference between equation 4.3 and 4.4 in Chernozhukov et al. (2018)

Generalizing

Consider the DGP

\[ \begin{align} Y & = \theta(X)\cdot T + g(X,W) + \epsilon \\ T & = f(X, W) + \eta \\ \end{align} \] where

\[ \begin{align} E[\epsilon|X, W] & = 0 \\ E[\eta|X, W] & = 0 \\ E[\epsilon \cdot \eta|X, W] & = 0 \end{align} \]

Question: What’s the difference between \(X\) and \(W\)?

Rewriting

We can subtract \(E[Y| X,W]\) and get:

\[ \begin{align} Y - E[Y|X,W] & = \theta(X)\cdot (T - E[T|X,W]) + \epsilon \\ \end{align} \] Where we use that \[ \begin{align} E[Y|X,W] = \tau(X) \cdot E[T|X,W] + g(X,W) \end{align} \]

We can estimate the nuisance functions \(E[Y|X,W]\) and \(E[T|X,W]\)

Non-parametric prediction problem
Double machine learning

Compute residuals

We estimate the nuisance functions (using data splitting) and calculate residuals \[ \begin{align} \tilde Y & = Y - E[Y|X,W] \\ \tilde T & = Y - E[T|X,W] \end{align} \]

The residuals are related by the equation \[ \begin{align} \tilde Y & = \theta(X)\cdot \tilde T + \epsilon \end{align} \]

Partialling out estimator

The estimator based on the Robinson’s (1988) partialling out approach is then

\[ \begin{align} \hat \theta & = \underset{\theta \in \Theta}{\text{argmin}} E_{n}\left[(\tilde Y - \theta(X)\cdot \tilde T)^2\right] \end{align} \]

For some model class \(\Theta\), i.e. a constant average treatment effect in Chernozhukov et al. (2018)

Equation 4.4 in Chernozhukov et al. (2018)

Summing up

We perform two steps of machine learning and regress residuals

Predict \(T\) based on \(X, W\)
Predict \(Y\) based on \(X, W\)
Calculate residuals
Regress them on each other, subject to some functional form

Where prediction of \(T\) and \(Y\) utilize data splitting

In practice

When coding this up, this is what actually happens:

Find best model and hyperparameters for predicting \(Y\) and \(T\)
Give these to an estimator with functional form assumption you want
(Possibly evaluate model if heterogeneous treatment effects)

Decisions

Worries

You need to worry about

Predictive performance in first double machine learning stage
What functional form to assume

And whether there’s only selection on observables

Some methods support IV

Predictive performance

The predictive performance in the two predictive models can be assessed as usual

Hold out data (same amount of folds as estimator)

Any model you can think of

Linear/logistic models, including regularized regression
Tree based models, including random forests & boosted trees
Neural networks

Or an ensemble of all these

Good that you learned a lot about prediction!

Tuning beforehand?

You can perform hyperparameter selection using all the data

Alternatively, perform hyperparameter selection just within auxiliary sample

Better hyperparameter selection results in better nuisance estimates

More precise causal inference
- As such, this should be done

Not a problem as long as relatively few hyperparameters are tuned, see references here

Repeated nuisance estimation

You can also increase precision by estimating noise multiple times

Advocated by Duflo in this presentation, slide 65
Choose median or mean for final regression
- Median more robust to outliers

Linear functional forms

ATE in Chernozhukov et al. (2018)

Also considers instrumental variables

Linear and high-dimensional linear in Semenova et al. (2017)

High-dimensional linear needs to assume sparsity
- Utilizes a debiased LASSO from Van de Geer et al. (2014)
- Could also be used with many technical regressors

Non-parametric functional form

Non-parametric in Athey et al. (2019)

Also considers instrumental variables
Relatively low-dimensional data

Non-parametric in Oprescu et al. (2019)

Can provably handle sparse high-dimensional data

Both use the bootstrap of little bags for inference

Implementations

Implemented in econml

LinearDML
SparseLinearDML
CausalForestDML
DMLOrthoForest

Also has other implementations, but LinearDML, SparseLinearDML and CausalForestDML most cited papers

Proxy for usability?

Some alternative packages

In R: doubleml

Supports clusters
- See Chiang et al. (2022) for theory
Chernozhukov on author list
Also a Python package
- EconML has \(\approx\) 60 times more downloads in a week
  - More estimators & functionality

In Stata: ddml & pystacked

pystacked uses stacked sklearn models
Same guys behind pdslasso and other LASSO implementations in Stata

Doubly robust variants

For categorical treatments, one can also use doubly robust methods

Predict \(Y\) based on both \(X,W\) and \(T\), not just \(X, W\)
Otherwise procedure basically the same
- Last step uses an augmented inverse probability weighted estimator (AIPW)
- See here for more information

Doubly robust or not?

Pros

If wrong functional form, get best linear projection
Slightly stronger robustness guarantees

Cons

Requires categorical treatment
Generally higher variance
- Especially if weak overlap

Kitchen sink causality

What controls to include

Flexible methods

Both post-double-selection and double machine learning

Tempted to join everything available on observation

Let the model select important variables
Keeps all covariates with predictive power

Question

What’s the problem with the aforementioned idea?

Good and bad controls

As always, we should not include ´bad controls´ (Angrist & Pischke, 2009)

Double machine learning is still sensitive to it, see e.g. Hünermund et al. (2021)

Perform variable selection and double machine learning only with good and neutral controls

Causal graphs

Good and bad controls are relatively vague

Graphs create a way of organizing thoughts about DGP’s

Can be combined with a structural approach e.g. Hünermund & Bareinboim (2019)
Different graphs imply different conditional distributions
- Can create graph selection procedures, but hidden confounders still a problem

There’s a crash course in Cinelli et al. (2020) and a discussion of the use of graph approaches versus potential outcome approaches in Imbens (2020)

Causal model selection

Many CATE estimators

We have many different CATE estimators

All based on same ‘important’ assumptions
- Unconfoundedness

Sometimes we might prefer one method a priori

Linear models are more interpretable

Sometimes we are just interested in personalized estimates

We want the most accurate model
- Hence causal model selection

Supervised model selection

Usually we utilize that we observe the ground truth \(Y\) for model selection

Maximize out of sample performance
- Minimize expected loss, \(E[l(\hat Y, Y)]\)
- \(l(\cdot)\) could be mean squared error, accuracy etc.

The counterpart in causality would be \(E[l(\hat \tau (x), \tau)]\)

Sadly, we do not observe the ground truth of causality

Question

How could one evaluate CATE’s?

Hint: Remember the partialling out regression \[ \begin{align} \tilde Y & = \theta(X)\cdot \tilde T + \epsilon \end{align} \]

An alternative

Nie & Wager (2021) propose an alternative method to evaluate CATE estimates, \(\hat \tau\)

Rewriting the partialling out regression

\[ \begin{align} \tilde Y & = \theta(X)\cdot \tilde T + \epsilon \\ \tilde Y & = \hat \tau \tilde T + \epsilon \end{align} \]

Our treatment effects should explain the residual \(\tilde Y\)

\(E[\epsilon | X, W] = 0]\)

The loss function

With a slight reformulation of eq. 4 in the paper, we get

\[ \begin{align} \hat L_n [\hat \tau(\cdot)] & = \frac{1}{n} \sum_{i=1}^n \left[\tilde Y - \hat\tau\tilde {T}\right]^2 \end{align} \]

Should use out of sample residuals and CATE estimates

Split data into bins
Estimate residuals \(\tilde Y, \tilde T\) in each bin, training on all others
- Just like double machine learning
Then evaluate loss function

Alternatively, use held-out data

Causal model selection

Can evaluate estimates from any model

We now have causal model evaluation
- This method most consistently selects high-performing model (Schuler et al., 2018)
Implementations exist
- How grf tunes model hyperparameters with tune
- econml has a score function

Can also create models

Nie & Wager (2021) also motivate a two step modelling process

Estimate nuisance functions
Select parameter estimates that minimizes loss function
- Estimate \(\hat \tau\) using a model, i.e. kernel regression

Called the R-learner due to it’s close link to Robinson (1988) and the focus on residuals

KernelDML in econml
No confidence intervals

Heterogeneity or not?

One could also evaluate against a constant average treatment effect

How much better are we at explaining residuals compared to a constant average treatment effect

Implemented in the RScorer in econml, see here

If score is negative, heterogeneous treatment effects are worse than constant average treatment effect
- Possibly overfitting during training

A quick summation

The very short version

Decide whether to use doubly robust estimator or not
- Treatment type and how good is overlap
Create models for \(T\) and \(Y\) based on \(X, W\) (and \(T\) if doubly robust)
- Choose best hyperparameters before DML using cross-validation
Choose a functional form (an econml function)
- Alternatively, perform causal model selection

Different estimators

econml has a table of different estimators

Concise summary of estimators with different assumptions & treatment types
Find the papers behind the estimators and read those!

econml also has a GitHub with example notebooks

Some additional possibilities

Meta-learners also estimate CATE’s

Cannot deliver confidence intervals
See Künzel et al. (2019)

Bayesian additive regression trees (BART) also estimate CATE’s

It’s Bayesian
See Chipman et al. (2010)

References

Post-double-selection papers

Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6), 2369-2429.

Belloni, A., Chernozhukov, V., & Hansen, C. (2014a). High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2), 29-50.

Belloni, A., Chernozhukov, V., & Hansen, C. (2014b). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2), 608-650.

Urminsky, O., Hansen, C., & Chernozhukov, V. (2016). Using double-lasso regression for principled variable selection. Available at SSRN 2733374.

Double machine learning I

Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters.

Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., & Robins, J. M. (2022). Locally robust semiparametric estimation. Econometrica, 90(4), 1501-1535.

Chiang, H. D., Kato, K., Ma, Y., & Sasaki, Y. (2022). Multiway cluster robust double/debiased machine learning. Journal of Business & Economic Statistics, 40(3), 1046-1056.

Nie, X., & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 299-319.

Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019, May). Orthogonal random forest for causal inference. In International Conference on Machine Learning (pp. 4932-4941). PMLR.

Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 931-954.

Double machine learning II

Schuler, A., Baiocchi, M., Tibshirani, R., & Shah, N. (2018). A comparison of methods for model selection when estimating individual treatment effects. arXiv preprint arXiv:1804.05146.

Semenova, V., Goldman, M., Chernozhukov, V., & Taddy, M. (2017). Estimation and inference on heterogeneous treatment effects in high-dimensional dynamic panels. arXiv preprint arXiv:1712.09988.

Miscellaneous I

Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.

Cinelli, C., Forney, A., & Pearl, J. (2020). A crash course in good and bad controls. Sociological Methods & Research, 00491241221099552.

Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees.

Hünermund, P., & Bareinboim, E. (2019). Causal inference and data fusion in econometrics. arXiv preprint arXiv:1912.09104.

Hünermund, P., Louw, B., & Caspi, I. (2021). Double Machine Learning and Automated Confounder Selection–A Cautionary Tale. arXiv preprint arXiv:2108.11294.

Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), 1129-1179.

Miscellaneous II

Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165.

Raschka, S., & Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

Thanks for attending the course