Linear Regression API

DuckRegression is the original estimator in the package. It handles compressed ordinary least squares with optional fixed-effect demeaning, analytic HC1-style standard errors for a single outcome, and bootstrap covariance when n_bootstraps > 0.

Constructor

Show code

DuckRegression(
    db_name: str,
    table_name: str,
    formula: str,
    cluster_col: str,
    seed: int,
    n_bootstraps: int = 100,
    rowid_col: str = "rowid",
    fitter: str = "numpy",
)

The formula has the form

Show code

Y ~ X1 + X2 + X3

or, with fixed effects,

Show code

Y ~ X1 + X2 | unit + time

Multiple outcomes are allowed:

Show code

Y1 + Y2 ~ D + f1 + f2

Point Estimation

Without fixed effects, DuckRegression adds an intercept and solves weighted least squares on the compressed rows:

\[ \hat{\beta} = \left(X_c' W X_c\right)^{-1} X_c' W \bar{y}_c, \]

where \(W=\operatorname{diag}(n_g)\). With fixed effects, the compressed \(y\) and \(X\) are weighted-demeaned first.

Show code

from duckreg.estimators import DuckRegression

m = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 + f2",
    cluster_col="",
    n_bootstraps=0,
    seed=42,
)
m.fit()
m.fit_vcov()
m.summary()

The test suite checks that compressed estimates match direct NumPy OLS coefficients on the uncompressed table for formulas such as Y ~ D, Y ~ D + f1, and Y ~ D + f1 + f2.

Analytic HC1 Covariance

fit_vcov() computes a compressed version of HC1 for a single outcome. It uses

\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g \operatorname{RSS}_g x_g x_g'\right) (X'WX)^{-1}, \]

where the within-cell residual sum of squares is reconstructed from sufficient statistics:

\[ \operatorname{RSS}_g = n_g \hat{y}_g^2 - 2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]

This is why the compression step stores sum_y_sq.

Bootstrap Covariance

If n_bootstraps > 0, DuckReg.fit() calls bootstrap(). For DuckRegression, the bootstrap has two paths:

Setting	Resampling unit	Data path
`cluster_col=""` or false-like	row ids	recompress selected rows
`cluster_col="cluster"`	clusters	join cluster multiplicities to grouped data

Cluster bootstrap logic uses a resampled cluster table:

Show code

WITH resampled AS (
  SELECT cluster_id, COUNT(*) AS mult
  FROM (SELECT unnest(?) AS cluster_id)
  GROUP BY cluster_id
)

and then multiplies grouped counts and sums by mult.

Multiple Outcomes

The parser allows

Show code

m = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y + Y2 ~ D + f1 + f2",
    cluster_col="f1",
    n_bootstraps=100,
    seed=232,
)
m.fit()
m.summary()

Compression stores sum_Y, sum_Y_sq, mean_Y, sum_Y2, sum_Y2_sq, and mean_Y2. The output concatenates the coefficient vectors.

What To Check In Practice

For a linear fit, check:

Show code

m.df_compressed["count"].sum()
len(m.df_compressed)
m.summary()

The first should equal the raw observation count. The second is the compression payoff. The third reports standard_error only if covariance was computed either by fit_vcov() or by bootstrap.