duckreg
  • Home
  • Compression
  • Linear Models
  • Panel
  • DML
  • GLMs
  • Ridge
  • Inference
  • Examples
  1. duckreg Documentation
  • duckreg Documentation
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples

On this page

  • API Map
  • When To Use Which Estimator
  • Core Constraint
  • Repository Examples
  • Current Caveats

duckreg Documentation

Compressed out-of-memory regressions with DuckDB.

duckreg estimates regressions by pushing the large part of the problem into DuckDB. The library compresses raw rows into sufficient statistics, loads the much smaller grouped table into memory, and then solves weighted linear, generalized linear, panel, DML, or ridge problems.

The shared estimator lifecycle is:

  1. prepare_data() creates any derived design tables.
  2. compress_data() runs SQL aggregation.
  3. estimate() solves the estimator on the compressed table.
  4. bootstrap() runs when n_bootstraps > 0.
  5. summary() returns point estimates and, when available, standard errors.
Show code
model = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 + f2",
    cluster_col="",
    n_bootstraps=0,
    seed=42,
)
model.fit()
model.fit_vcov()
model.summary()

API Map

Object Module Use
DuckReg duckreg.duckreg Abstract lifecycle and common database connection handling.
DuckRegression duckreg.estimators Compressed OLS over discrete covariate cells, with optional fixed-effect demeaning.
DuckMundlak duckreg.estimators One-way or two-way Mundlak panel regression using generated unit and time means.
DuckMundlakEventStudy duckreg.estimators Cohort-by-time event-study design compressed before weighted least squares.
DuckDoubleDemeaning duckreg.estimators Two-way double-demeaned treatment estimator.
DuckDML duckreg.estimators Compressed leave-one-out partial linear estimator for discrete controls.
DuckLogisticRegression duckreg.estimators Compressed canonical logit via grouped Fisher scoring.
DuckPoissonRegression duckreg.estimators Compressed canonical Poisson via grouped Fisher scoring.
DuckMultinomialLogisticRegression duckreg.estimators Exact baseline-category multinomial logit for moderate label counts.
DuckPoissonMultinomialRegression duckreg.estimators Label-wise Poisson decomposition for many count labels.
DuckRidge duckreg.regularized Compressed ridge regression, lambda paths, and cross-validation.
demean duckreg.demean Numba alternating-projection demeaning for fixed effects.

When To Use Which Estimator

Problem Preferred class
Saturated linear regression with discrete regressors DuckRegression
Linear regression with fixed effects in the formula DuckRegression with Y ~ X | fe1 + fe2
Panel treatment effects with unit and time means DuckMundlak
Event-study coefficients by cohort and calendar time DuckMundlakEventStudy
Two-way fixed-effect style residualized treatment DuckDoubleDemeaning
Partial linear model with high-dimensional discrete controls DuckDML
Binary outcome DuckLogisticRegression
Count outcome DuckPoissonRegression
Moderate number of unordered labels DuckMultinomialLogisticRegression
Many count labels, such as token counts DuckPoissonMultinomialRegression
Regularized linear model over compressed covariate cells DuckRidge

Core Constraint

Compression is exact when the variables needed by the estimator are constant or aggregable inside the grouping cells. The canonical example is OLS with discrete covariates. If all rows in cell \(g\) have covariates \(x_g\), then the estimator only needs

\[ n_g,\qquad \sum_{i \in g} y_i,\qquad \sum_{i \in g} y_i^2 \]

for point estimates and analytic HC1-style inference.

For GLMs, the same idea works because grouped binomial, Poisson, and multinomial likelihoods are functions of counts and sums. For panel and DML estimators, prepare_data() creates generated regressors or cross-products before compression.

Repository Examples

The examples page condenses the existing notebooks:

Notebook Covered material
notebooks/introduction.ipynb Basic DuckRegression, HC1 standard errors, cluster bootstrap, multiple outcomes, Mundlak, double demeaning.
notebooks/event_study.ipynb Static Mundlak specifications and dynamic event studies.
notebooks/duckdml.ipynb Compressed leave-one-out DML.
notebooks/regularized.ipynb Ridge compression, lambda paths, and cross-validation.
notebooks/glm.ipynb Logistic, Poisson, multinomial logit, and many-label Poisson decomposition.

Current Caveats

Some APIs are newer than the original linear-regression core. In the current branch, GLM bootstrap methods return empty arrays rather than implemented resampling estimators, so use fit_vcov() for GLM model-based covariance and keep n_bootstraps=0 unless the bootstrap path is updated.

DuckRidge intentionally leaves bootstrap standard errors unimplemented. Ridge inference needs a separate bias and regularization-aware treatment.