duckreg ibis-backend review demo
Backend-neutral compressed regression via Ibis expressions
The implemented spec keeps the existing Duck* API intact and adds a new DB* family that writes compression and design-matrix logic once in Ibis, then lets Ibis compile/execute it on DuckDB or other supported backends.
Branch:
ibis-backend
New module: duckreg/dbreg.py
Test gate: 19 passed
Problem
The branch had Ibis connection plumbing, but core estimators still assembled DuckDB-flavored SQL strings, including DuckDB-specific
The branch had Ibis connection plumbing, but core estimators still assembled DuckDB-flavored SQL strings, including DuckDB-specific
unnest(?) bootstrap queries.Change
Add new
Add new
DB* estimators whose compression/design matrices are Ibis expression trees, not SQL templates.Compatibility
No existing
No existing
Duck* class is removed or changed. Users can opt into DBRegression, DBDML, etc.Main implemented spec
| Spec item | Implementation |
|---|---|
| Backwards-compatible API | Existing DuckRegression, DuckDML, DuckMundlak, DuckDoubleDemeaning remain available. New classes are exported alongside them. |
| New DBReg family | DBReg, DBRegression, DBDML, DBMundlak, DBDoubleDemeaning, plus Db* aliases. |
| Ibis-native compression | Compression queries use table.group_by(...).aggregate(...), expression arithmetic, joins, cross joins, and filters. |
| Design matrix materialization once | Panel transformations like Mundlak means and double-demeaning build a reusable Ibis design_matrix expression before compression. |
| No DuckDB-only bootstrap path in DB* | Cluster bootstrap resampling is done from backend-neutral grouped sufficient stats in pandas after one Ibis grouped materialization, avoiding unnest(?). |
| Parity tests | New tests compare DB* estimates to existing Duck* estimates on DuckDB-backed Ibis connections. |
Intended user-facing API
Old path still works
from duckreg import DuckRegression
model = DuckRegression(
db_name="analysis.duckdb",
table_name="trips",
formula="fare ~ treatment + hour",
cluster_col="driver_id",
seed=42,
)
model.fit()
New backend-neutral path
import ibis
from duckreg import DBRegression
con = ibis.duckdb.connect("analysis.duckdb")
# Later: ibis.postgres.connect(...), ibis.bigquery.connect(...), etc.
model = DBRegression(
db_name=None,
connection=con,
table_name="trips",
formula="fare ~ treatment + hour",
cluster_col="driver_id",
seed=42,
)
model.fit()
Core compression query: written once as Ibis
The linear-regression compression now has a backend-neutral expression constructor. This is the important replacement for string SQL:
def compression_expr(self, table=None, include_cluster: bool = False):
table = self.table_expr() if table is None else table
group_cols = list(self.strata_cols)
if include_cluster and self.cluster_col:
group_cols.append(self.cluster_col)
metrics = {"count": table.count()}
for var in self.outcome_vars:
metrics[f"sum_{var}"] = table[var].sum()
metrics[f"sum_{var}_sq"] = (table[var] * table[var]).sum()
return table.group_by(group_cols).aggregate(**metrics)
Materialization stays explicit:
self.compression = self.compression_expr()
self.df_compressed = self.execute_expr(self.compression)
Panel design matrix example
DBMundlak computes unit/time averages using Ibis aggregations and joins, then compresses the resulting expression. No temp-table SQL is needed.
t = self.table_expr()
unit_avgs = t.group_by(self.unit_col).aggregate(
**{f"avg_{cov}_unit": t[cov].mean() for cov in self.covariates}
)
design = t.join(unit_avgs, self.unit_col)
if self.time_col is not None:
time_avgs = t.group_by(self.time_col).aggregate(
**{f"avg_{cov}_time": t[cov].mean() for cov in self.covariates}
)
design = design.join(time_avgs, self.time_col)
self.design_matrix = design.select([...])
Implemented estimators
| Class | Status | Notes |
|---|---|---|
DBRegression | Implemented | Compressed WLS; HC1 vcov; backend-neutral grouped cluster bootstrap materialization. |
DBDML | Implemented | Leave-one-out sufficient-stat compression for discrete covariates. |
DBMundlak | Implemented | Ibis design matrix for unit/time averages; cluster bootstrap path implemented. |
DBDoubleDemeaning | Implemented | Ibis mean joins + cross join for overall mean; point estimate implemented. |
DBMundlakEventStudy | Not yet ported | Existing event-study SQL CTE path is more involved; should be next if this direction looks right. |
DB* GLMs | Not yet ported | Straightforward follow-up: same grouped sufficient-stat pattern, but current implementation leaves existing Duck* GLMs alone. |
Review checklist
- Does the naming feel right:
DBRegression/DBDML, or should it beIbisRegression? - Should bootstrap stay pandas-after-compression for portability, or should we add optional backend-specific resampling hooks?
- Should
Duck*eventually subclass/aliasDB*when the new path is mature, or stay separate permanently? - Next likely ports: event study and GLMs.
Local validation
$ uv run pytest -q
19 passed, 48 warnings in 19.02s