duckreg
  • Home
  • Compression
  • Linear Models
  • Panel
  • DML
  • GLMs
  • Ridge
  • Inference
  • Examples
  1. Compression and Estimator Lifecycle
  • duckreg Documentation
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples

On this page

  • The DuckReg Contract
  • Weighted Least Squares After Compression
  • What The SQL Aggregation Stores
  • Fixed Effects
  • Compression Is A Contract

Compression and Estimator Lifecycle

The package is built around a simple observation: if an estimator only needs sums over repeated covariate patterns, the raw data can stay in DuckDB. Python only sees a compressed table whose number of rows is the number of unique strata.

The DuckReg Contract

Every estimator subclasses DuckReg and implements four methods:

Show code
class DuckReg:
    def fit(self):
        self.prepare_data()
        self.compress_data()
        self.point_estimate = self.estimate()
        if self.n_bootstraps > 0:
            self.vcov = self.bootstrap()
        self.conn.close()

This makes the public workflow uniform:

Show code
model.fit()
results = model.summary()

Classes that support analytic covariance add a method such as:

Show code
model.fit_vcov()
results = model.summary()

The important object created during fitting is usually model.df_compressed. It is the in-memory table passed to NumPy for the final solve.

Weighted Least Squares After Compression

Suppose the raw linear model is

\[ y_i = x_i'\beta + \varepsilon_i, \]

and rows are grouped by identical covariates. For group \(g\):

\[ n_g = \#\{i:x_i=x_g\}, \qquad \bar{y}_g = \frac{1}{n_g}\sum_{i:x_i=x_g} y_i. \]

The full-sample least-squares objective can be written as

\[ \sum_i (y_i - x_i'\beta)^2 = \sum_g \sum_{i \in g} (y_i - x_g'\beta)^2. \]

The first-order condition depends on the raw data only through \(n_g\) and \(\bar{y}_g\):

\[ \sum_i x_i(y_i - x_i'\beta) = \sum_g n_g x_g(\bar{y}_g - x_g'\beta). \]

So the compressed estimator is weighted least squares:

\[ \hat{\beta} = \left(\sum_g n_g x_g x_g'\right)^{-1} \left(\sum_g n_g x_g \bar{y}_g\right). \]

The implementation uses frequency weights:

Show code
def wls(X, y, n):
    N = sqrt(n)
    Xn = X * N
    yn = y * N
    return np.linalg.lstsq(Xn, yn, rcond=None)[0]

What The SQL Aggregation Stores

For DuckRegression, compress_data() groups by the right-hand-side covariates and fixed-effect columns:

Show code
SELECT
  x1, x2, fe1,
  COUNT(*) AS count,
  SUM(y) AS sum_y,
  SUM(POW(y, 2)) AS sum_y_sq
FROM data
GROUP BY x1, x2, fe1

The compressed table is then augmented with

\[ \bar{y}_g = \frac{\text{sum\_y}_g}{\text{count}_g}. \]

For multiple outcomes, the same pattern is repeated for every outcome variable.

Fixed Effects

The formula parser supports a fixed-effect separator:

Show code
model = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 | unit + time",
    cluster_col="unit",
    seed=42,
)

If fixed effects are present, collect_data() converts the fixed-effect columns to integer ids, then calls demean() on both \(y\) and \(X\) using the compressed-cell frequency weights. The demeaning routine uses alternating projections:

\[ z \leftarrow z - E_w[z \mid \text{fixed effect}] \]

cycling over each fixed-effect dimension until convergence.

Compression Is A Contract

Compression is exact when the grouped statistics preserve the quantities that enter the estimator. For ordinary least squares with discrete covariates, that contract is exact. For cluster-robust inference, the grouping must also preserve cluster membership if one wants exact analytic cluster scores. Otherwise, bootstrap resampling has to return to the raw table or to a cluster-preserving compressed table.