Compression and Estimator Lifecycle
The package is built around a simple observation: if an estimator only needs sums over repeated covariate patterns, the raw data can stay in DuckDB. Python only sees a compressed table whose number of rows is the number of unique strata.
The DuckReg Contract
Every estimator subclasses DuckReg and implements four methods:
Show code
class DuckReg:
def fit(self):
self.prepare_data()
self.compress_data()
self.point_estimate = self.estimate()
if self.n_bootstraps > 0:
self.vcov = self.bootstrap()
self.conn.close()This makes the public workflow uniform:
Show code
model.fit()
results = model.summary()Classes that support analytic covariance add a method such as:
Show code
model.fit_vcov()
results = model.summary()The important object created during fitting is usually model.df_compressed. It is the in-memory table passed to NumPy for the final solve.
Weighted Least Squares After Compression
Suppose the raw linear model is
\[ y_i = x_i'\beta + \varepsilon_i, \]
and rows are grouped by identical covariates. For group \(g\):
\[ n_g = \#\{i:x_i=x_g\}, \qquad \bar{y}_g = \frac{1}{n_g}\sum_{i:x_i=x_g} y_i. \]
The full-sample least-squares objective can be written as
\[ \sum_i (y_i - x_i'\beta)^2 = \sum_g \sum_{i \in g} (y_i - x_g'\beta)^2. \]
The first-order condition depends on the raw data only through \(n_g\) and \(\bar{y}_g\):
\[ \sum_i x_i(y_i - x_i'\beta) = \sum_g n_g x_g(\bar{y}_g - x_g'\beta). \]
So the compressed estimator is weighted least squares:
\[ \hat{\beta} = \left(\sum_g n_g x_g x_g'\right)^{-1} \left(\sum_g n_g x_g \bar{y}_g\right). \]
The implementation uses frequency weights:
Show code
def wls(X, y, n):
N = sqrt(n)
Xn = X * N
yn = y * N
return np.linalg.lstsq(Xn, yn, rcond=None)[0]What The SQL Aggregation Stores
For DuckRegression, compress_data() groups by the right-hand-side covariates and fixed-effect columns:
Show code
SELECT
x1, x2, fe1,
COUNT(*) AS count,
SUM(y) AS sum_y,
SUM(POW(y, 2)) AS sum_y_sq
FROM data
GROUP BY x1, x2, fe1The compressed table is then augmented with
\[ \bar{y}_g = \frac{\text{sum\_y}_g}{\text{count}_g}. \]
For multiple outcomes, the same pattern is repeated for every outcome variable.
Fixed Effects
The formula parser supports a fixed-effect separator:
Show code
model = DuckRegression(
db_name="large_dataset.db",
table_name="data",
formula="Y ~ D + f1 | unit + time",
cluster_col="unit",
seed=42,
)If fixed effects are present, collect_data() converts the fixed-effect columns to integer ids, then calls demean() on both \(y\) and \(X\) using the compressed-cell frequency weights. The demeaning routine uses alternating projections:
\[ z \leftarrow z - E_w[z \mid \text{fixed effect}] \]
cycling over each fixed-effect dimension until convergence.
Compression Is A Contract
Compression is exact when the grouped statistics preserve the quantities that enter the estimator. For ordinary least squares with discrete covariates, that contract is exact. For cluster-robust inference, the grouping must also preserve cluster membership if one wants exact analytic cluster scores. Otherwise, bootstrap resampling has to return to the raw table or to a cluster-preserving compressed table.