duckreg ibis-backend review demo

Backend-neutral compressed regression via Ibis expressions

The implemented spec keeps the existing Duck* API intact and adds a new DB* family that writes compression and design-matrix logic once in Ibis, then lets Ibis compile/execute it on DuckDB or other supported backends.

Branch: ibis-backend New module: duckreg/dbreg.py Test gate: 19 passed
Problem
The branch had Ibis connection plumbing, but core estimators still assembled DuckDB-flavored SQL strings, including DuckDB-specific unnest(?) bootstrap queries.
Change
Add new DB* estimators whose compression/design matrices are Ibis expression trees, not SQL templates.
Compatibility
No existing Duck* class is removed or changed. Users can opt into DBRegression, DBDML, etc.

Main implemented spec

Spec itemImplementation
Backwards-compatible APIExisting DuckRegression, DuckDML, DuckMundlak, DuckDoubleDemeaning remain available. New classes are exported alongside them.
New DBReg familyDBReg, DBRegression, DBDML, DBMundlak, DBDoubleDemeaning, plus Db* aliases.
Ibis-native compressionCompression queries use table.group_by(...).aggregate(...), expression arithmetic, joins, cross joins, and filters.
Design matrix materialization oncePanel transformations like Mundlak means and double-demeaning build a reusable Ibis design_matrix expression before compression.
No DuckDB-only bootstrap path in DB*Cluster bootstrap resampling is done from backend-neutral grouped sufficient stats in pandas after one Ibis grouped materialization, avoiding unnest(?).
Parity testsNew tests compare DB* estimates to existing Duck* estimates on DuckDB-backed Ibis connections.

Intended user-facing API

Old path still works

from duckreg import DuckRegression

model = DuckRegression(
    db_name="analysis.duckdb",
    table_name="trips",
    formula="fare ~ treatment + hour",
    cluster_col="driver_id",
    seed=42,
)
model.fit()

New backend-neutral path

import ibis
from duckreg import DBRegression

con = ibis.duckdb.connect("analysis.duckdb")
# Later: ibis.postgres.connect(...), ibis.bigquery.connect(...), etc.

model = DBRegression(
    db_name=None,
    connection=con,
    table_name="trips",
    formula="fare ~ treatment + hour",
    cluster_col="driver_id",
    seed=42,
)
model.fit()

Core compression query: written once as Ibis

The linear-regression compression now has a backend-neutral expression constructor. This is the important replacement for string SQL:

def compression_expr(self, table=None, include_cluster: bool = False):
    table = self.table_expr() if table is None else table
    group_cols = list(self.strata_cols)
    if include_cluster and self.cluster_col:
        group_cols.append(self.cluster_col)

    metrics = {"count": table.count()}
    for var in self.outcome_vars:
        metrics[f"sum_{var}"] = table[var].sum()
        metrics[f"sum_{var}_sq"] = (table[var] * table[var]).sum()

    return table.group_by(group_cols).aggregate(**metrics)

Materialization stays explicit:

self.compression = self.compression_expr()
self.df_compressed = self.execute_expr(self.compression)

Panel design matrix example

DBMundlak computes unit/time averages using Ibis aggregations and joins, then compresses the resulting expression. No temp-table SQL is needed.

t = self.table_expr()
unit_avgs = t.group_by(self.unit_col).aggregate(
    **{f"avg_{cov}_unit": t[cov].mean() for cov in self.covariates}
)
design = t.join(unit_avgs, self.unit_col)

if self.time_col is not None:
    time_avgs = t.group_by(self.time_col).aggregate(
        **{f"avg_{cov}_time": t[cov].mean() for cov in self.covariates}
    )
    design = design.join(time_avgs, self.time_col)

self.design_matrix = design.select([...])

Implemented estimators

ClassStatusNotes
DBRegressionImplementedCompressed WLS; HC1 vcov; backend-neutral grouped cluster bootstrap materialization.
DBDMLImplementedLeave-one-out sufficient-stat compression for discrete covariates.
DBMundlakImplementedIbis design matrix for unit/time averages; cluster bootstrap path implemented.
DBDoubleDemeaningImplementedIbis mean joins + cross join for overall mean; point estimate implemented.
DBMundlakEventStudyNot yet portedExisting event-study SQL CTE path is more involved; should be next if this direction looks right.
DB* GLMsNot yet portedStraightforward follow-up: same grouped sufficient-stat pattern, but current implementation leaves existing Duck* GLMs alone.

Review checklist

Local validation

$ uv run pytest -q
19 passed, 48 warnings in 19.02s