duckreg
  • Home
  • Compression
  • Linear Models
  • Panel
  • DML
  • GLMs
  • Ridge
  • Inference
  • Examples
  1. Linear Regression API
  • duckreg Documentation
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples

On this page

  • Constructor
  • Point Estimation
  • Analytic HC1 Covariance
  • Bootstrap Covariance
  • Multiple Outcomes
  • What To Check In Practice

Linear Regression API

DuckRegression is the original estimator in the package. It handles compressed ordinary least squares with optional fixed-effect demeaning, analytic HC1-style standard errors for a single outcome, and bootstrap covariance when n_bootstraps > 0.

Constructor

Show code
DuckRegression(
    db_name: str,
    table_name: str,
    formula: str,
    cluster_col: str,
    seed: int,
    n_bootstraps: int = 100,
    rowid_col: str = "rowid",
    fitter: str = "numpy",
)

The formula has the form

Show code
Y ~ X1 + X2 + X3

or, with fixed effects,

Show code
Y ~ X1 + X2 | unit + time

Multiple outcomes are allowed:

Show code
Y1 + Y2 ~ D + f1 + f2

Point Estimation

Without fixed effects, DuckRegression adds an intercept and solves weighted least squares on the compressed rows:

\[ \hat{\beta} = \left(X_c' W X_c\right)^{-1} X_c' W \bar{y}_c, \]

where \(W=\operatorname{diag}(n_g)\). With fixed effects, the compressed \(y\) and \(X\) are weighted-demeaned first.

Show code
from duckreg.estimators import DuckRegression

m = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 + f2",
    cluster_col="",
    n_bootstraps=0,
    seed=42,
)
m.fit()
m.fit_vcov()
m.summary()

The test suite checks that compressed estimates match direct NumPy OLS coefficients on the uncompressed table for formulas such as Y ~ D, Y ~ D + f1, and Y ~ D + f1 + f2.

Analytic HC1 Covariance

fit_vcov() computes a compressed version of HC1 for a single outcome. It uses

\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g \operatorname{RSS}_g x_g x_g'\right) (X'WX)^{-1}, \]

where the within-cell residual sum of squares is reconstructed from sufficient statistics:

\[ \operatorname{RSS}_g = n_g \hat{y}_g^2 - 2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]

This is why the compression step stores sum_y_sq.

Bootstrap Covariance

If n_bootstraps > 0, DuckReg.fit() calls bootstrap(). For DuckRegression, the bootstrap has two paths:

Setting Resampling unit Data path
cluster_col="" or false-like row ids recompress selected rows
cluster_col="cluster" clusters join cluster multiplicities to grouped data

Cluster bootstrap logic uses a resampled cluster table:

Show code
WITH resampled AS (
  SELECT cluster_id, COUNT(*) AS mult
  FROM (SELECT unnest(?) AS cluster_id)
  GROUP BY cluster_id
)

and then multiplies grouped counts and sums by mult.

Multiple Outcomes

The parser allows

Show code
m = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y + Y2 ~ D + f1 + f2",
    cluster_col="f1",
    n_bootstraps=100,
    seed=232,
)
m.fit()
m.summary()

Compression stores sum_Y, sum_Y_sq, mean_Y, sum_Y2, sum_Y2_sq, and mean_Y2. The output concatenates the coefficient vectors.

What To Check In Practice

For a linear fit, check:

Show code
m.df_compressed["count"].sum()
len(m.df_compressed)
m.summary()

The first should equal the raw observation count. The second is the compression payoff. The third reports standard_error only if covariance was computed either by fit_vcov() or by bootstrap.