Linear Regression API
DuckRegression is the original estimator in the package. It handles compressed ordinary least squares with optional fixed-effect demeaning, analytic HC1-style standard errors for a single outcome, and bootstrap covariance when n_bootstraps > 0.
Constructor
Show code
DuckRegression(
db_name: str,
table_name: str,
formula: str,
cluster_col: str,
seed: int,
n_bootstraps: int = 100,
rowid_col: str = "rowid",
fitter: str = "numpy",
)The formula has the form
Show code
Y ~ X1 + X2 + X3
or, with fixed effects,
Show code
Y ~ X1 + X2 | unit + time
Multiple outcomes are allowed:
Show code
Y1 + Y2 ~ D + f1 + f2
Point Estimation
Without fixed effects, DuckRegression adds an intercept and solves weighted least squares on the compressed rows:
\[ \hat{\beta} = \left(X_c' W X_c\right)^{-1} X_c' W \bar{y}_c, \]
where \(W=\operatorname{diag}(n_g)\). With fixed effects, the compressed \(y\) and \(X\) are weighted-demeaned first.
Show code
from duckreg.estimators import DuckRegression
m = DuckRegression(
db_name="large_dataset.db",
table_name="data",
formula="Y ~ D + f1 + f2",
cluster_col="",
n_bootstraps=0,
seed=42,
)
m.fit()
m.fit_vcov()
m.summary()The test suite checks that compressed estimates match direct NumPy OLS coefficients on the uncompressed table for formulas such as Y ~ D, Y ~ D + f1, and Y ~ D + f1 + f2.
Analytic HC1 Covariance
fit_vcov() computes a compressed version of HC1 for a single outcome. It uses
\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g \operatorname{RSS}_g x_g x_g'\right) (X'WX)^{-1}, \]
where the within-cell residual sum of squares is reconstructed from sufficient statistics:
\[ \operatorname{RSS}_g = n_g \hat{y}_g^2 - 2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]
This is why the compression step stores sum_y_sq.
Bootstrap Covariance
If n_bootstraps > 0, DuckReg.fit() calls bootstrap(). For DuckRegression, the bootstrap has two paths:
| Setting | Resampling unit | Data path |
|---|---|---|
cluster_col="" or false-like |
row ids | recompress selected rows |
cluster_col="cluster" |
clusters | join cluster multiplicities to grouped data |
Cluster bootstrap logic uses a resampled cluster table:
Show code
WITH resampled AS (
SELECT cluster_id, COUNT(*) AS mult
FROM (SELECT unnest(?) AS cluster_id)
GROUP BY cluster_id
)and then multiplies grouped counts and sums by mult.
Multiple Outcomes
The parser allows
Show code
m = DuckRegression(
db_name="large_dataset.db",
table_name="data",
formula="Y + Y2 ~ D + f1 + f2",
cluster_col="f1",
n_bootstraps=100,
seed=232,
)
m.fit()
m.summary()Compression stores sum_Y, sum_Y_sq, mean_Y, sum_Y2, sum_Y2_sq, and mean_Y2. The output concatenates the coefficient vectors.
What To Check In Practice
For a linear fit, check:
Show code
m.df_compressed["count"].sum()
len(m.df_compressed)
m.summary()The first should equal the raw observation count. The second is the compression payoff. The third reports standard_error only if covariance was computed either by fit_vcov() or by bootstrap.