duckreg Documentation

Compressed out-of-memory regressions with DuckDB.

duckreg estimates regressions by pushing the large part of the problem into DuckDB. The library compresses raw rows into sufficient statistics, loads the much smaller grouped table into memory, and then solves weighted linear, generalized linear, panel, DML, or ridge problems.

The shared estimator lifecycle is:

prepare_data() creates any derived design tables.
compress_data() runs SQL aggregation.
estimate() solves the estimator on the compressed table.
bootstrap() runs when n_bootstraps > 0.
summary() returns point estimates and, when available, standard errors.

Show code

model = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 + f2",
    cluster_col="",
    n_bootstraps=0,
    seed=42,
)
model.fit()
model.fit_vcov()
model.summary()

API Map

Object	Module	Use
`DuckReg`	`duckreg.duckreg`	Abstract lifecycle and common database connection handling.
`DuckRegression`	`duckreg.estimators`	Compressed OLS over discrete covariate cells, with optional fixed-effect demeaning.
`DuckMundlak`	`duckreg.estimators`	One-way or two-way Mundlak panel regression using generated unit and time means.
`DuckMundlakEventStudy`	`duckreg.estimators`	Cohort-by-time event-study design compressed before weighted least squares.
`DuckDoubleDemeaning`	`duckreg.estimators`	Two-way double-demeaned treatment estimator.
`DuckDML`	`duckreg.estimators`	Compressed leave-one-out partial linear estimator for discrete controls.
`DuckLogisticRegression`	`duckreg.estimators`	Compressed canonical logit via grouped Fisher scoring.
`DuckPoissonRegression`	`duckreg.estimators`	Compressed canonical Poisson via grouped Fisher scoring.
`DuckMultinomialLogisticRegression`	`duckreg.estimators`	Exact baseline-category multinomial logit for moderate label counts.
`DuckPoissonMultinomialRegression`	`duckreg.estimators`	Label-wise Poisson decomposition for many count labels.
`DuckRidge`	`duckreg.regularized`	Compressed ridge regression, lambda paths, and cross-validation.
`demean`	`duckreg.demean`	Numba alternating-projection demeaning for fixed effects.

When To Use Which Estimator

Problem	Preferred class
Saturated linear regression with discrete regressors	`DuckRegression`
Linear regression with fixed effects in the formula	`DuckRegression` with `Y ~ X \| fe1 + fe2`
Panel treatment effects with unit and time means	`DuckMundlak`
Event-study coefficients by cohort and calendar time	`DuckMundlakEventStudy`
Two-way fixed-effect style residualized treatment	`DuckDoubleDemeaning`
Partial linear model with high-dimensional discrete controls	`DuckDML`
Binary outcome	`DuckLogisticRegression`
Count outcome	`DuckPoissonRegression`
Moderate number of unordered labels	`DuckMultinomialLogisticRegression`
Many count labels, such as token counts	`DuckPoissonMultinomialRegression`
Regularized linear model over compressed covariate cells	`DuckRidge`

Core Constraint

Compression is exact when the variables needed by the estimator are constant or aggregable inside the grouping cells. The canonical example is OLS with discrete covariates. If all rows in cell \(g\) have covariates \(x_g\), then the estimator only needs

\[ n_g,\qquad \sum_{i \in g} y_i,\qquad \sum_{i \in g} y_i^2 \]

for point estimates and analytic HC1-style inference.

For GLMs, the same idea works because grouped binomial, Poisson, and multinomial likelihoods are functions of counts and sums. For panel and DML estimators, prepare_data() creates generated regressors or cross-products before compression.

Repository Examples

The examples page condenses the existing notebooks:

Notebook	Covered material
`notebooks/introduction.ipynb`	Basic `DuckRegression`, HC1 standard errors, cluster bootstrap, multiple outcomes, Mundlak, double demeaning.
`notebooks/event_study.ipynb`	Static Mundlak specifications and dynamic event studies.
`notebooks/duckdml.ipynb`	Compressed leave-one-out DML.
`notebooks/regularized.ipynb`	Ridge compression, lambda paths, and cross-validation.
`notebooks/glm.ipynb`	Logistic, Poisson, multinomial logit, and many-label Poisson decomposition.

Current Caveats

Some APIs are newer than the original linear-regression core. In the current branch, GLM bootstrap methods return empty arrays rather than implemented resampling estimators, so use fit_vcov() for GLM model-based covariance and keep n_bootstraps=0 unless the bootstrap path is updated.

DuckRidge intentionally leaves bootstrap standard errors unimplemented. Ridge inference needs a separate bias and regularization-aware treatment.