Inference and Variance Estimation
Inference in duckreg depends on whether the compressed table preserves the score contributions needed for the target covariance estimator.
Analytic HC1 For Linear Models
For DuckRegression.fit_vcov(), the covariance has sandwich form:
\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g \operatorname{RSS}_g x_gx_g'\right) (X'WX)^{-1}. \]
The grouped residual sum of squares is exact from compressed sufficient statistics:
\[ \operatorname{RSS}_g = n_g \hat{y}_g^2 - 2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]
This requires sum_y_sq, which the compression query stores.
Cluster Bootstrap
The implemented cluster bootstrap resamples clusters, recompresses, and refits. The cluster path is natural when the point-estimation compression does not preserve all cluster-level score contributions.
For a cluster column \(c\), the bootstrap creates multiplicities:
Show code
WITH resampled AS (
SELECT cluster_id, COUNT(*) AS mult
FROM (SELECT unnest(?) AS cluster_id)
GROUP BY cluster_id
)and then joins those multiplicities against grouped data.
Analytic Cluster-Robust Scores
If the compressed rows preserve cluster identity, analytic cluster-robust covariance can be computed from compressed scores. The standard cluster-robust variance is
\[ \hat{V}_{CR} = (X'X)^{-1} \left( \sum_{c=1}^C u_cu_c' \right) (X'X)^{-1}, \]
where
\[ u_c = \sum_{i \in c} x_i \hat{\varepsilon}_i. \]
If each compressed stratum \(g\) belongs to exactly one cluster \(c\), then
\[ u_c = \sum_{g \in c} x_g \left( \sum_{i \in g} y_i - n_gx_g'\hat{\beta} \right). \]
So exact analytic cluster scores require grouping by both covariate cell and cluster:
Show code
SELECT
x1,
x2,
cluster,
COUNT(*) AS count,
SUM(y) AS sum_y
FROM data
GROUP BY x1, x2, clusterThis is faster than bootstrap when cluster cardinality is modest and adding cluster to the group-by does not destroy compression.
Tradeoff
| Scenario | Strategy | Reason |
|---|---|---|
| IID or HC1-style robust errors | fit_vcov() |
Sufficient statistics are already stored. |
| Few or moderate clusters | analytic cluster scores, if implemented | Grouping by cluster preserves scores and remains compact. |
| Many high-cardinality clusters | cluster bootstrap | Grouping by cluster may approach the raw data size. |
| GLMs | fit_vcov() |
Model-based Fisher covariance is implemented. |
| Ridge | no built-in SEs | Regularization-aware inference is separate. |
Current API Details
DuckRegression.fit_vcov() computes analytic HC1-style covariance for a single outcome.
DuckDML.fit_vcov() computes an analytic covariance from compressed leave-one-out residual cross-products.
DuckMundlak, DuckDoubleDemeaning, and DuckMundlakEventStudy use bootstrap covariance when n_bootstraps > 0.
GLM classes have fit_vcov() methods but currently stub out bootstrap. Use n_bootstraps=0 for those classes until a real bootstrap implementation is added.