duckreg
  • Home
  • Compression
  • Linear Models
  • Panel
  • DML
  • GLMs
  • Ridge
  • Inference
  • Examples
  1. Inference and Variance Estimation
  • duckreg Documentation
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples

On this page

  • Analytic HC1 For Linear Models
  • Cluster Bootstrap
  • Analytic Cluster-Robust Scores
  • Tradeoff
  • Current API Details

Inference and Variance Estimation

Inference in duckreg depends on whether the compressed table preserves the score contributions needed for the target covariance estimator.

Analytic HC1 For Linear Models

For DuckRegression.fit_vcov(), the covariance has sandwich form:

\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g \operatorname{RSS}_g x_gx_g'\right) (X'WX)^{-1}. \]

The grouped residual sum of squares is exact from compressed sufficient statistics:

\[ \operatorname{RSS}_g = n_g \hat{y}_g^2 - 2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]

This requires sum_y_sq, which the compression query stores.

Cluster Bootstrap

The implemented cluster bootstrap resamples clusters, recompresses, and refits. The cluster path is natural when the point-estimation compression does not preserve all cluster-level score contributions.

For a cluster column \(c\), the bootstrap creates multiplicities:

Show code
WITH resampled AS (
  SELECT cluster_id, COUNT(*) AS mult
  FROM (SELECT unnest(?) AS cluster_id)
  GROUP BY cluster_id
)

and then joins those multiplicities against grouped data.

Analytic Cluster-Robust Scores

If the compressed rows preserve cluster identity, analytic cluster-robust covariance can be computed from compressed scores. The standard cluster-robust variance is

\[ \hat{V}_{CR} = (X'X)^{-1} \left( \sum_{c=1}^C u_cu_c' \right) (X'X)^{-1}, \]

where

\[ u_c = \sum_{i \in c} x_i \hat{\varepsilon}_i. \]

If each compressed stratum \(g\) belongs to exactly one cluster \(c\), then

\[ u_c = \sum_{g \in c} x_g \left( \sum_{i \in g} y_i - n_gx_g'\hat{\beta} \right). \]

So exact analytic cluster scores require grouping by both covariate cell and cluster:

Show code
SELECT
  x1,
  x2,
  cluster,
  COUNT(*) AS count,
  SUM(y) AS sum_y
FROM data
GROUP BY x1, x2, cluster

This is faster than bootstrap when cluster cardinality is modest and adding cluster to the group-by does not destroy compression.

Tradeoff

Scenario Strategy Reason
IID or HC1-style robust errors fit_vcov() Sufficient statistics are already stored.
Few or moderate clusters analytic cluster scores, if implemented Grouping by cluster preserves scores and remains compact.
Many high-cardinality clusters cluster bootstrap Grouping by cluster may approach the raw data size.
GLMs fit_vcov() Model-based Fisher covariance is implemented.
Ridge no built-in SEs Regularization-aware inference is separate.

Current API Details

DuckRegression.fit_vcov() computes analytic HC1-style covariance for a single outcome.

DuckDML.fit_vcov() computes an analytic covariance from compressed leave-one-out residual cross-products.

DuckMundlak, DuckDoubleDemeaning, and DuckMundlakEventStudy use bootstrap covariance when n_bootstraps > 0.

GLM classes have fit_vcov() methods but currently stub out bootstrap. Use n_bootstraps=0 for those classes until a real bootstrap implementation is added.