Compressed Double Machine Learning

DuckDML implements a compressed leave-one-out estimator for a partially linear model with discrete controls:

\[ Y_i = W_i'\beta + g(X_i) + \varepsilon_i. \]

Here \(W_i\) is one or more treatment variables and \(X_i\) is a vector of discrete controls. The controls define groups \(g \in \mathcal{G}\).

Constructor

Show code

DuckDML(
    db_name: str,
    table_name: str,
    outcome_var: str,
    treatment_var: str | list[str],
    discrete_covars: list[str],
    seed: int,
    n_bootstraps: int = 200,
)

The implementation accepts either a single treatment name or a list of treatment names. Internally it stores self.treatment_vars as a list.

Leave-One-Out Residualization

For a variable \(V_i\), define the leave-one-out group mean

\[ \hat{m}_{v,-i}(X_i) = \frac{1}{N_g - 1} \sum_{\substack{j \in g\\j \ne i}} V_j. \]

The residual is

\[ \tilde{V}_i = V_i - \hat{m}_{v,-i}(X_i) = \frac{N_g V_i - S_V^{(g)}}{N_g - 1}, \]

where \(S_V^{(g)} = \sum_{j \in g} V_j\).

The target estimator is the OLS regression of leave-one-out residualized \(Y\) on leave-one-out residualized \(W\):

\[ \hat{\beta} = \left(\sum_i \tilde{W}_i\tilde{W}_i'\right)^{-1} \left(\sum_i \tilde{W}_i\tilde{Y}_i\right). \]

Compressed Algebra

For group \(g\), define

\[ S_W^{(g)} = \sum_{i \in g} W_i,\qquad S_Y^{(g)} = \sum_{i \in g} Y_i, \]

\[ S_{WW}^{(g)} = \sum_{i \in g} W_i W_i', \qquad S_{WY}^{(g)} = \sum_{i \in g} W_i Y_i. \]

Then

\[ \sum_{i \in g} \tilde{W}_i\tilde{W}_i' = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WW}^{(g)} - S_W^{(g)}S_W^{(g)'} \right], \]

and

\[ \sum_{i \in g} \tilde{W}_i\tilde{Y}_i = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WY}^{(g)} - S_W^{(g)}S_Y^{(g)} \right]. \]

So DuckDML.compress_data() only needs grouped counts, sums, and cross-products. The final estimator sums these matrices over groups and solves one small linear system.

SQL Sufficient Statistics

For treatment variables X1, X2 and outcome Y, grouped by town_id and day_id, the aggregation includes:

Show code

SELECT
  town_id,
  day_id,
  COUNT(*) AS n_g,
  SUM(Y) AS sum_y,
  SUM(POW(Y, 2)) AS sum_y_sq,
  SUM(X1) AS sum_X1,
  SUM(Y * X1) AS sum_Y_X1,
  SUM(X2) AS sum_X2,
  SUM(Y * X2) AS sum_Y_X2,
  SUM(X1 * X1) AS sum_X1_X1,
  SUM(X1 * X2) AS sum_X1_X2,
  SUM(X2 * X2) AS sum_X2_X2
FROM data
GROUP BY town_id, day_id
HAVING COUNT(*) > 1

Singleton groups are dropped because leave-one-out residualization is undefined when \(N_g = 1\).

Usage

Example from notebooks/duckdml.ipynb:

Show code

from duckreg.estimators import DuckDML

dml_model = DuckDML(
    db_name="dml_example.db",
    table_name="data",
    outcome_var="Y",
    treatment_var="X",
    discrete_covars=["town_id", "day_id"],
    seed=42,
    n_bootstraps=500,
)
dml_model.fit()
results = dml_model.summary()

Multivariate treatment example from the tests:

Show code

dml = DuckDML(
    db_name="test_dml_pytest.db",
    table_name="data",
    outcome_var="Y",
    treatment_var=["X1", "X2"],
    discrete_covars=["town_id", "day_id"],
    seed=42,
    n_bootstraps=10,
)
dml.fit()
dml.summary()

Inference

fit_vcov() computes an analytic HC1-style sandwich using compressed residual cross-products. The default bootstrap() resamples compressed groups, which treats the discrete-control cells as the resampling clusters.

Use the bootstrap when the grouped structure is the natural independent sampling level. Use the analytic covariance when the compressed moments are an acceptable approximation to the desired robust variance.