duckreg ibis-backend review demo

Backend-neutral compressed regression via Ibis expressions

The implemented spec keeps the existing Duck* API intact and adds a new DB* family that writes compression and design-matrix logic once in Ibis, then lets Ibis compile/execute it on DuckDB or other supported backends.

Branch: ibis-backend New module: duckreg/dbreg.py Test gate: 19 passed

Problem
The branch had Ibis connection plumbing, but core estimators still assembled DuckDB-flavored SQL strings, including DuckDB-specific unnest(?) bootstrap queries.

Change
Add new DB* estimators whose compression/design matrices are Ibis expression trees, not SQL templates.

Compatibility
No existing Duck* class is removed or changed. Users can opt into DBRegression, DBDML, etc.

Main implemented spec

Spec item	Implementation
Backwards-compatible API	Existing `DuckRegression`, `DuckDML`, `DuckMundlak`, `DuckDoubleDemeaning` remain available. New classes are exported alongside them.
New DBReg family	`DBReg`, `DBRegression`, `DBDML`, `DBMundlak`, `DBDoubleDemeaning`, plus `Db*` aliases.
Ibis-native compression	Compression queries use `table.group_by(...).aggregate(...)`, expression arithmetic, joins, cross joins, and filters.
Design matrix materialization once	Panel transformations like Mundlak means and double-demeaning build a reusable Ibis `design_matrix` expression before compression.
No DuckDB-only bootstrap path in DB*	Cluster bootstrap resampling is done from backend-neutral grouped sufficient stats in pandas after one Ibis grouped materialization, avoiding `unnest(?)`.
Parity tests	New tests compare `DB` estimates to existing `Duck` estimates on DuckDB-backed Ibis connections.

Intended user-facing API

Old path still works

from duckreg import DuckRegression

model = DuckRegression(
    db_name="analysis.duckdb",
    table_name="trips",
    formula="fare ~ treatment + hour",
    cluster_col="driver_id",
    seed=42,
)
model.fit()

New backend-neutral path

import ibis
from duckreg import DBRegression

con = ibis.duckdb.connect("analysis.duckdb")
# Later: ibis.postgres.connect(...), ibis.bigquery.connect(...), etc.

model = DBRegression(
    db_name=None,
    connection=con,
    table_name="trips",
    formula="fare ~ treatment + hour",
    cluster_col="driver_id",
    seed=42,
)
model.fit()

Core compression query: written once as Ibis

The linear-regression compression now has a backend-neutral expression constructor. This is the important replacement for string SQL:

def compression_expr(self, table=None, include_cluster: bool = False):
    table = self.table_expr() if table is None else table
    group_cols = list(self.strata_cols)
    if include_cluster and self.cluster_col:
        group_cols.append(self.cluster_col)

    metrics = {"count": table.count()}
    for var in self.outcome_vars:
        metrics[f"sum_{var}"] = table[var].sum()
        metrics[f"sum_{var}_sq"] = (table[var] * table[var]).sum()

    return table.group_by(group_cols).aggregate(**metrics)

Materialization stays explicit:

self.compression = self.compression_expr()
self.df_compressed = self.execute_expr(self.compression)

Panel design matrix example

DBMundlak computes unit/time averages using Ibis aggregations and joins, then compresses the resulting expression. No temp-table SQL is needed.

t = self.table_expr()
unit_avgs = t.group_by(self.unit_col).aggregate(
    **{f"avg_{cov}_unit": t[cov].mean() for cov in self.covariates}
)
design = t.join(unit_avgs, self.unit_col)

if self.time_col is not None:
    time_avgs = t.group_by(self.time_col).aggregate(
        **{f"avg_{cov}_time": t[cov].mean() for cov in self.covariates}
    )
    design = design.join(time_avgs, self.time_col)

self.design_matrix = design.select([...])

Implemented estimators

Class	Status	Notes
`DBRegression`	Implemented	Compressed WLS; HC1 vcov; backend-neutral grouped cluster bootstrap materialization.
`DBDML`	Implemented	Leave-one-out sufficient-stat compression for discrete covariates.
`DBMundlak`	Implemented	Ibis design matrix for unit/time averages; cluster bootstrap path implemented.
`DBDoubleDemeaning`	Implemented	Ibis mean joins + cross join for overall mean; point estimate implemented.
`DBMundlakEventStudy`	Not yet ported	Existing event-study SQL CTE path is more involved; should be next if this direction looks right.
`DB* GLMs`	Not yet ported	Straightforward follow-up: same grouped sufficient-stat pattern, but current implementation leaves existing `Duck*` GLMs alone.

Review checklist

Does the naming feel right: DBRegression / DBDML, or should it be IbisRegression?
Should bootstrap stay pandas-after-compression for portability, or should we add optional backend-specific resampling hooks?
Should Duck* eventually subclass/alias DB* when the new path is mature, or stay separate permanently?
Next likely ports: event study and GLMs.

Local validation

$ uv run pytest -q
19 passed, 48 warnings in 19.02s

Generated for review from local branch ibis-backend. Existing untracked notebooks/nytaxi.ipynb was left untouched.