Compressed Double Machine Learning
DuckDML implements a compressed leave-one-out estimator for a partially linear model with discrete controls:
\[ Y_i = W_i'\beta + g(X_i) + \varepsilon_i. \]
Here \(W_i\) is one or more treatment variables and \(X_i\) is a vector of discrete controls. The controls define groups \(g \in \mathcal{G}\).
Constructor
Show code
DuckDML(
db_name: str,
table_name: str,
outcome_var: str,
treatment_var: str | list[str],
discrete_covars: list[str],
seed: int,
n_bootstraps: int = 200,
)The implementation accepts either a single treatment name or a list of treatment names. Internally it stores self.treatment_vars as a list.
Leave-One-Out Residualization
For a variable \(V_i\), define the leave-one-out group mean
\[ \hat{m}_{v,-i}(X_i) = \frac{1}{N_g - 1} \sum_{\substack{j \in g\\j \ne i}} V_j. \]
The residual is
\[ \tilde{V}_i = V_i - \hat{m}_{v,-i}(X_i) = \frac{N_g V_i - S_V^{(g)}}{N_g - 1}, \]
where \(S_V^{(g)} = \sum_{j \in g} V_j\).
The target estimator is the OLS regression of leave-one-out residualized \(Y\) on leave-one-out residualized \(W\):
\[ \hat{\beta} = \left(\sum_i \tilde{W}_i\tilde{W}_i'\right)^{-1} \left(\sum_i \tilde{W}_i\tilde{Y}_i\right). \]
Compressed Algebra
For group \(g\), define
\[ S_W^{(g)} = \sum_{i \in g} W_i,\qquad S_Y^{(g)} = \sum_{i \in g} Y_i, \]
\[ S_{WW}^{(g)} = \sum_{i \in g} W_i W_i', \qquad S_{WY}^{(g)} = \sum_{i \in g} W_i Y_i. \]
Then
\[ \sum_{i \in g} \tilde{W}_i\tilde{W}_i' = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WW}^{(g)} - S_W^{(g)}S_W^{(g)'} \right], \]
and
\[ \sum_{i \in g} \tilde{W}_i\tilde{Y}_i = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WY}^{(g)} - S_W^{(g)}S_Y^{(g)} \right]. \]
So DuckDML.compress_data() only needs grouped counts, sums, and cross-products. The final estimator sums these matrices over groups and solves one small linear system.
SQL Sufficient Statistics
For treatment variables X1, X2 and outcome Y, grouped by town_id and day_id, the aggregation includes:
Show code
SELECT
town_id,
day_id,
COUNT(*) AS n_g,
SUM(Y) AS sum_y,
SUM(POW(Y, 2)) AS sum_y_sq,
SUM(X1) AS sum_X1,
SUM(Y * X1) AS sum_Y_X1,
SUM(X2) AS sum_X2,
SUM(Y * X2) AS sum_Y_X2,
SUM(X1 * X1) AS sum_X1_X1,
SUM(X1 * X2) AS sum_X1_X2,
SUM(X2 * X2) AS sum_X2_X2
FROM data
GROUP BY town_id, day_id
HAVING COUNT(*) > 1Singleton groups are dropped because leave-one-out residualization is undefined when \(N_g = 1\).
Usage
Example from notebooks/duckdml.ipynb:
Show code
from duckreg.estimators import DuckDML
dml_model = DuckDML(
db_name="dml_example.db",
table_name="data",
outcome_var="Y",
treatment_var="X",
discrete_covars=["town_id", "day_id"],
seed=42,
n_bootstraps=500,
)
dml_model.fit()
results = dml_model.summary()Multivariate treatment example from the tests:
Show code
dml = DuckDML(
db_name="test_dml_pytest.db",
table_name="data",
outcome_var="Y",
treatment_var=["X1", "X2"],
discrete_covars=["town_id", "day_id"],
seed=42,
n_bootstraps=10,
)
dml.fit()
dml.summary()Inference
fit_vcov() computes an analytic HC1-style sandwich using compressed residual cross-products. The default bootstrap() resamples compressed groups, which treats the discrete-control cells as the resampling clusters.
Use the bootstrap when the grouped structure is the natural independent sampling level. Use the analytic covariance when the compressed moments are an acceptable approximation to the desired robust variance.