Reference-category coding answers questions relative to an arbitrary omitted level. That is often fine for prediction, but it is awkward when the goal is to read lower-order coefficients as population or sample-average effects in the presence of categorical modifiers.
The abundance-based constraints (ABC) construction of Kowal, Facilitating heterogeneous effect estimation via statistically efficient categorical modifiers, JASA, DOI 10.1080/01621459.2026.2635078, replaces reference-level interpretations with empirical-abundance-weighted interpretations. In OLS, this is a pure reparameterization: fitted values are the same as a full-rank ordinary dummy-coded regression, but the displayed coefficients answer different questions.
This page documents the crabbymetrics OLS implementation, cm.ABCOLS().
2 Setup
Let \(y_i \in \mathbb R\), continuous covariates \(x_i \in \mathbb R^p\), and categorical covariates \(c_i = (c_{i1}, \ldots, c_{iK})\). For a categorical variable \(C_k\) with levels \(\ell = 0, \ldots, L_k - 1\), define empirical level abundances
ABCOLS expects categorical variables to be supplied as zero-based integer codes in a dense matrix. This keeps the runtime dependency footprint NumPy-only; callers with pandas categoricals should convert levels to stable codes before fitting.
3 Main-effect constraints
For categorical main effects, the overcomplete model includes one coefficient for every level:
So \(\beta_{k\ell}\) is the deviation of level \(\ell\) from the sample-abundance-weighted average, not the deviation from a reference level. With centered continuous covariates, the intercept is interpretable as the sample-weighted grand mean at the sample mean of \(x\).
4 Continuous-by-categorical interactions
For a continuous variable \(x_j\) and categorical modifier \(C_k\), the overcomplete model adds
Thus the continuous main effect remains the empirical-abundance-weighted average of group-specific slopes.
5 Categorical-by-categorical interactions
For two categorical variables \(A\) and $B`, the overcomplete model includes a cell coefficient \(\delta_{ab}\) for every observed cell:
\[
\sum_a\sum_b 1\{A_i=a,B_i=b\}\delta_{ab}.
\]
The ABC constraints impose weighted zero margins:
\[
\sum_b \hat\pi_{ab}\delta_{ab}=0 \quad \text{for each } a,
\]
and
\[
\sum_a \hat\pi_{ab}\delta_{ab}=0 \quad \text{for each } b.
\]
One of these constraints is redundant. ABCOLS uses a deterministic rule: it includes all \(A\)-margin constraints and all but the first \(B\)-margin constraint.
6 Null-space implementation
Rather than hand-building weighted contrast columns, ABCOLS uses the constraint/null-space route.
First build an overcomplete design matrix \(X_{\rm full}\) containing:
an intercept;
centered continuous variables;
one indicator for every categorical level;
one column \(x_j1\{C_k=\ell\}\) for every requested continuous-categorical interaction level;
one column \(1\{A=a,B=b\}\) for every requested categorical-categorical interaction cell.
Let \(\theta\) collect all overcomplete coefficients. The ABC restrictions are linear:
\[
A\theta = 0.
\]
Let \(Q\) be a basis for the null space of \(A\), so \(AQ=0\). Write
The implementation reports coefficient standard errors from the diagonal of this matrix and records the maximum constraint violation \(\|A\hat\theta\|_\infty\) as a numerical check.
7 Example: ABC versus base-category coding
The example below cooks an unbalanced sample with a three-level categorical modifier group and a two-level categorical variable sex. The true data-generating process has group-specific intercept and slope deviations.
The fitted values agree because the two designs span the same column space. The coefficients differ because the parameterizations answer different questions.
8 Interpreting the ABC slope
Under ABC coding, x0 is the abundance-weighted average of group-specific slopes. We can verify this directly.
Show code
coef =dict(zip(abc_summary["column_names"], abc_summary["coef"]))weights = np.bincount(group) /len(group)group_slopes = np.array([coef["x0"] + coef[f"x0:c0[{level}]"] for level inrange(3)])print("weights:", weights.round(3))print("group slopes:", group_slopes.round(4))print("weighted average slope:", float(weights @ group_slopes))print("ABC main slope:", coef["x0"])
weights: [0.3 0.45 0.25]
group slopes: [1.5582 0.9077 1.1971]
weighted average slope: 1.1751733997826892
ABC main slope: 1.175173399782689
In the base-category fit, the coefficient on x0 is the slope for the omitted group. In the ABC fit, the coefficient on x0 is the empirical-abundance-weighted average slope.
9 Interpreting the ABC categorical effects
The same logic applies to categorical main effects. The ABC group effects have sample-weighted mean zero:
Show code
group_effects = np.array([coef[f"c0[{level}]"] for level inrange(3)])print("group effects:", group_effects.round(4))print("weighted mean group effect:", float(weights @ group_effects))
group effects: [-0.8346 0.0782 0.8608]
weighted mean group effect: -2.7755575615628914e-17
By contrast, base-category coefficients are deviations from group=0. They change if a different base group is selected; the ABC coefficients do not have that arbitrary reference-level dependence.
10 Current implementation contract
cm.ABCOLS() is intentionally low-level and NumPy-first:
model = cm.ABCOLS()model.fit( y, # 1D float array x, # 2D float array of continuous covariates categories, # 2D uint32 array of zero-based categorical codes cont_cat_interactions=[], # list of (continuous_col, categorical_col) cat_cat_interactions=[], # list of (categorical_col_a, categorical_col_b) center_continuous=True,)
Important restrictions in the first implementation:
categorical levels must be contiguous observed integer codes starting at zero;
empty categorical levels raise an error;
requested categorical-categorical interactions must have all cells observed;
inference is homoskedastic OLS in the constrained parameterization;
robust covariance, formula parsing, pandas categorical handling, and sparse empty-cell handling are natural follow-ups.
11 Takeaway
ABC OLS is not a new fitted-value model for unpenalized least squares. It is a disciplined coefficient coordinate system. The constraints make lower-order coefficients stable and interpretable as sample-abundance-weighted averages when categorical modifiers and interactions enter the model.