Inference and Variance Estimation

Inference in duckreg depends on whether the compressed table preserves the score contributions needed for the target covariance estimator.

Analytic HC1 For Linear Models

For DuckRegression.fit_vcov(), the covariance has sandwich form:

\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g \operatorname{RSS}_g x_gx_g'\right) (X'WX)^{-1}. \]

The grouped residual sum of squares is exact from compressed sufficient statistics:

\[ \operatorname{RSS}_g = n_g \hat{y}_g^2 - 2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]

This requires sum_y_sq, which the compression query stores.

Cluster Bootstrap

The implemented cluster bootstrap resamples clusters, recompresses, and refits. The cluster path is natural when the point-estimation compression does not preserve all cluster-level score contributions.

For a cluster column \(c\), the bootstrap creates multiplicities:

Show code

WITH resampled AS (
  SELECT cluster_id, COUNT(*) AS mult
  FROM (SELECT unnest(?) AS cluster_id)
  GROUP BY cluster_id
)

and then joins those multiplicities against grouped data.

Analytic Cluster-Robust Scores

If the compressed rows preserve cluster identity, analytic cluster-robust covariance can be computed from compressed scores. The standard cluster-robust variance is

\[ \hat{V}_{CR} = (X'X)^{-1} \left( \sum_{c=1}^C u_cu_c' \right) (X'X)^{-1}, \]

where

\[ u_c = \sum_{i \in c} x_i \hat{\varepsilon}_i. \]

If each compressed stratum \(g\) belongs to exactly one cluster \(c\), then

\[ u_c = \sum_{g \in c} x_g \left( \sum_{i \in g} y_i - n_gx_g'\hat{\beta} \right). \]

So exact analytic cluster scores require grouping by both covariate cell and cluster:

Show code

SELECT
  x1,
  x2,
  cluster,
  COUNT(*) AS count,
  SUM(y) AS sum_y
FROM data
GROUP BY x1, x2, cluster

This is faster than bootstrap when cluster cardinality is modest and adding cluster to the group-by does not destroy compression.

Tradeoff

Scenario	Strategy	Reason
IID or HC1-style robust errors	`fit_vcov()`	Sufficient statistics are already stored.
Few or moderate clusters	analytic cluster scores, if implemented	Grouping by cluster preserves scores and remains compact.
Many high-cardinality clusters	cluster bootstrap	Grouping by cluster may approach the raw data size.
GLMs	`fit_vcov()`	Model-based Fisher covariance is implemented.
Ridge	no built-in SEs	Regularization-aware inference is separate.

Current API Details

DuckRegression.fit_vcov() computes analytic HC1-style covariance for a single outcome.

DuckDML.fit_vcov() computes an analytic covariance from compressed leave-one-out residual cross-products.

DuckMundlak, DuckDoubleDemeaning, and DuckMundlakEventStudy use bootstrap covariance when n_bootstraps > 0.

GLM classes have fit_vcov() methods but currently stub out bootstrap. Use n_bootstraps=0 for those classes until a real bootstrap implementation is added.