Compression and Estimator Lifecycle

The package is built around a simple observation: if an estimator only needs sums over repeated covariate patterns, the raw data can stay in DuckDB. Python only sees a compressed table whose number of rows is the number of unique strata.

The `DuckReg` Contract

Every estimator subclasses DuckReg and implements four methods:

Show code

class DuckReg:
    def fit(self):
        self.prepare_data()
        self.compress_data()
        self.point_estimate = self.estimate()
        if self.n_bootstraps > 0:
            self.vcov = self.bootstrap()
        self.conn.close()

This makes the public workflow uniform:

Show code

model.fit()
results = model.summary()

Classes that support analytic covariance add a method such as:

Show code

model.fit_vcov()
results = model.summary()

The important object created during fitting is usually model.df_compressed. It is the in-memory table passed to NumPy for the final solve.

Weighted Least Squares After Compression

Suppose the raw linear model is

\[ y_i = x_i'\beta + \varepsilon_i, \]

and rows are grouped by identical covariates. For group \(g\):

\[ n_g = \#\{i:x_i=x_g\}, \qquad \bar{y}_g = \frac{1}{n_g}\sum_{i:x_i=x_g} y_i. \]

The full-sample least-squares objective can be written as

\[ \sum_i (y_i - x_i'\beta)^2 = \sum_g \sum_{i \in g} (y_i - x_g'\beta)^2. \]

The first-order condition depends on the raw data only through \(n_g\) and \(\bar{y}_g\):

\[ \sum_i x_i(y_i - x_i'\beta) = \sum_g n_g x_g(\bar{y}_g - x_g'\beta). \]

So the compressed estimator is weighted least squares:

\[ \hat{\beta} = \left(\sum_g n_g x_g x_g'\right)^{-1} \left(\sum_g n_g x_g \bar{y}_g\right). \]

The implementation uses frequency weights:

Show code

def wls(X, y, n):
    N = sqrt(n)
    Xn = X * N
    yn = y * N
    return np.linalg.lstsq(Xn, yn, rcond=None)[0]

What The SQL Aggregation Stores

For DuckRegression, compress_data() groups by the right-hand-side covariates and fixed-effect columns:

Show code

SELECT
  x1, x2, fe1,
  COUNT(*) AS count,
  SUM(y) AS sum_y,
  SUM(POW(y, 2)) AS sum_y_sq
FROM data
GROUP BY x1, x2, fe1

The compressed table is then augmented with

\[ \bar{y}_g = \frac{\text{sum\_y}_g}{\text{count}_g}. \]

For multiple outcomes, the same pattern is repeated for every outcome variable.

Fixed Effects

The formula parser supports a fixed-effect separator:

Show code

model = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 | unit + time",
    cluster_col="unit",
    seed=42,
)

If fixed effects are present, collect_data() converts the fixed-effect columns to integer ids, then calls demean() on both \(y\) and \(X\) using the compressed-cell frequency weights. The demeaning routine uses alternating projections:

\[ z \leftarrow z - E_w[z \mid \text{fixed effect}] \]

cycling over each fixed-effect dimension until convergence.

Compression Is A Contract

Compression is exact when the grouped statistics preserve the quantities that enter the estimator. For ordinary least squares with discrete covariates, that contract is exact. For cluster-robust inference, the grouping must also preserve cluster membership if one wants exact analytic cluster scores. Otherwise, bootstrap resampling has to return to the raw table or to a cluster-preserving compressed table.

The DuckReg Contract

Weighted Least Squares After Compression

What The SQL Aggregation Stores

Fixed Effects

Compression Is A Contract

The `DuckReg` Contract