Tae Hyun Kim (Lowell)

Double/Debiased Machine Learning (DML)

Definition

A methodology for performing valid statistical inference on a low-dimensional parameter of interest θ0\theta_0 in the presence of a high-dimensional nuisance parameter η0\eta_0.

Two Key Ingredients:

  1. Use of a Neyman-Orthogonal Score
  2. Application of Cross-fitting (sample splitting)
N(θ^DMLθ0)dN(0,V)\sqrt{N}(\hat{\theta}_{DML} - \theta_0) \xrightarrow{d} N(0, V)

Intuitive Understanding

The problem: If you estimate the nuisance parameter with a traditional ML method and plug it in directly:

  • Regularization bias arises
  • Bias arises from overfitting
  • The θ\theta estimator fails to achieve N1/2N^{-1/2} consistency

DML’s solution:

  1. Neyman-orthogonal score: construct a moment condition that is less sensitive to nuisance-parameter estimation error
  2. Cross-fitting: split the data to remove overfitting bias
Traditional:  η̂ (ML) → plug-in → θ̂ (biased, inconsistent)

DML:  Orthogonal score + Cross-fitting → θ̂ (√N-consistent, asymptotically normal)

Key Properties

  • N1/2N^{-1/2} convergence rate: achieves the optimal convergence rate
  • Asymptotic normality: converges to a standard normal distribution
  • Valid inference: standard t-tests and confidence intervals can be used
  • Method agnostic: a variety of ML methods such as Lasso, Random Forest, and Neural Networks can be used
  • High-dimensional: works even without traditional complexity constraints (the Donsker property)

Algorithm

DML1 (Averaging)

Estimate θ\theta separately on each fold, then average: θ~0=1Kk=1Kθˇ0,k\tilde{\theta}_0 = \frac{1}{K}\sum_{k=1}^K \check{\theta}_{0,k}

DML2 (Pooling)

Solve the aggregated estimating equation: 1Kk=1KEn,k[ψ(W;θ~0,η^0,k)]=0\frac{1}{K}\sum_{k=1}^K E_{n,k}[\psi(W; \tilde{\theta}_0, \hat{\eta}_{0,k})] = 0

DML2 tends to perform better in small samples

Example: Partially Linear Regression

Model: Y=Dθ0+g0(X)+U,E[UX,D]=0Y = D\theta_0 + g_0(X) + U, \quad E[U|X,D] = 0 D=m0(X)+V,E[VX]=0D = m_0(X) + V, \quad E[V|X] = 0

Orthogonal Score: ψ(W;θ,η)=(YDθg(X))(Dm(X))\psi(W; \theta, \eta) = (Y - D\theta - g(X))(D - m(X))

Algorithm:

  1. Split data into K folds
  2. For each fold k:
    • Estimate g^(X)\hat{g}(X) and m^(X)\hat{m}(X) on other folds using ML
    • Compute residuals: Y~=Yg^(X)\tilde{Y} = Y - \hat{g}(X), D~=Dm^(X)\tilde{D} = D - \hat{m}(X)
  3. Estimate θ\theta by regressing Y~\tilde{Y} on D~\tilde{D}

Rate Conditions

For DML to work, a convergence-rate condition on the nuisance-parameter estimation is required:

g^g0m^m0=oP(N1/2)||\hat{g} - g_0|| \cdot ||\hat{m} - m_0|| = o_P(N^{-1/2})

e.g., each must achieve a rate of at least N1/4N^{-1/4}

  • Neyman-Orthogonal Score - the core theoretical tool of DML
  • Cross-fitting - sample splitting to remove overfitting bias
  • Partially Linear Model - the canonical application of DML
  • CATE - treatment effect estimable with DML
  • Doubly Robust Estimator - similar robustness properties

Applications

  • Treatment effect estimation with high-dimensional controls
  • Instrumental variables with many instruments
  • Structural parameter estimation in complex models
  • Policy evaluation with rich covariate sets
  • Personalized pricing with customer features

Advantages vs Limitations

AdvantagesLimitations
N\sqrt{N}-consistentRequires a rate condition
Valid inferenceComputationally intensive (multiple splits)
Any ML method can be usedLack of guidance on choosing the ML method
Allows high-dimensional nuisanceVariable finite-sample performance

References

  • chernozhukovDoubleDebiasedMachine2018 - Original DML paper
  • kennedyOptimalDoublyRobust2023 - Related doubly robust methods

Local graph