Reference-category coding answers questions relative to an arbitrary omitted level. That is often fine for prediction, but it is awkward when the goal is to read lower-order coefficients as population or sample-average effects in the presence of categorical modifiers.
The upshot is simple. Standard one-hot encoding sets the baseline to an arbitrary omitted category, so coefficients are deviations from that category. ABC sets the baseline to the full-sample weighted mean, so coefficients are deviations from common parameters: the grand mean, the average slope, or the average treatment effect, depending on the model. Lin (2013) already used this idea in regression-adjusted experiments by centering the covariate vector before interacting it with treatment: the coefficient on treatment is then the sample-average treatment effect, while the treatment-by-covariate interactions are deviations from that shared effect rather than deviations from an arbitrary covariate origin.
The abundance-based constraints (ABC) construction of Kowal, Facilitating heterogeneous effect estimation via statistically efficient categorical modifiers, JASA, DOI 10.1080/01621459.2026.2635078, generalizes that centering logic to categorical modifiers and their interactions. It replaces reference-level interpretations with empirical-abundance-weighted interpretations. In OLS, this is a pure reparameterization: fitted values are the same as a full-rank ordinary dummy-coded regression, but the displayed coefficients answer different questions.
This page documents the crabbymetrics OLS implementation, cm.ABCOLS().
2 Setup
Let \(y_i \in \mathbb R\), continuous covariates \(x_i \in \mathbb R^p\), and categorical covariates \(c_i = (c_{i1}, \ldots, c_{iK})\). For a categorical variable \(C_k\) with levels \(\ell = 0, \ldots, L_k - 1\), define empirical level abundances
ABCOLS expects categorical variables to be supplied as zero-based integer codes in a dense matrix. This keeps the runtime dependency footprint NumPy-only; callers with pandas categoricals should convert levels to stable codes before fitting.
3 Main-effect constraints
For categorical main effects, the overcomplete model includes one coefficient for every level:
So \(\beta_{k\ell}\) is the deviation of level \(\ell\) from the sample-abundance-weighted average, not the deviation from a reference level. With centered continuous covariates, the intercept is interpretable as the sample-weighted grand mean at the sample mean of \(x\).
4 Continuous-by-categorical interactions
For a continuous variable \(x_j\) and categorical modifier \(C_k\), the overcomplete model adds
Thus the continuous main effect remains the empirical-abundance-weighted average of group-specific slopes.
5 Categorical-by-categorical interactions
For two categorical variables \(A\) and \(B\), the overcomplete model includes a cell coefficient \(\delta_{ab}\) for every observed cell:
\[
\sum_a\sum_b 1\{A_i=a,B_i=b\}\delta_{ab}.
\]
The ABC constraints impose weighted zero margins:
\[
\sum_b \hat\pi_{ab}\delta_{ab}=0 \quad \text{for each } a,
\]
and
\[
\sum_a \hat\pi_{ab}\delta_{ab}=0 \quad \text{for each } b.
\]
One of these constraints is redundant. ABCOLS uses a deterministic rule: it includes all \(A\)-margin constraints and all but the first \(B\)-margin constraint.
6 Null-space implementation
Rather than hand-building weighted contrast columns, ABCOLS uses the constraint/null-space route.
First build an overcomplete design matrix \(X_{\rm full}\) containing:
an intercept;
centered continuous variables;
one indicator for every categorical level;
one column \(x_j1\{C_k=\ell\}\) for every requested continuous-categorical interaction level;
one column \(1\{A=a,B=b\}\) for every requested categorical-categorical interaction cell.
Let \(\theta\) collect all overcomplete coefficients. The ABC restrictions are linear:
\[
A\theta = 0.
\]
Let \(Q\) be a basis for the null space of \(A\), so \(AQ=0\). Write
The implementation reports coefficient standard errors from the diagonal of this matrix and records the maximum constraint violation \(\|A\hat\theta\|_\infty\) as a numerical check.
7 Example: ABC versus base-category coding
The example below cooks an unbalanced sample with a three-level categorical modifier group and a two-level categorical variable sex. The true data-generating process has group-specific intercept and slope deviations.
Here \(\gamma_g\) are the raw group intercept deviations, \(\eta_g\) are the raw x0:group slope deviations, and \(\delta_s\) is the raw sex main effect. Since ABC reports abundance-centered effects, the population coefficients in the ABC parameterization shift these raw effects by their sample-abundance averages. With \(\hat\pi_g=(0.30,0.45,0.25)\) and \(\hat\pi_s=(0.50,0.50)\),
Here \(\alpha\) is the intercept, \(\beta\) is the main slope on centered x0, \(g_a\) is the group=a main effect, \(s_b\) is the sex=b main effect, \(h_a\) is the x0:group=a slope deviation, and \(d_{ab}\) is the group=a:sex=b interaction cell coefficient.
Thus the concrete constraint system \(A\theta=0\) is
\[
0.30g_0 + 0.45g_1 + 0.25g_2 = 0,
\]
\[
0.50s_0 + 0.50s_1 = 0,
\]
\[
0.30h_0 + 0.45h_1 + 0.25h_2 = 0,
\]
and, for the group:sex interaction,
\[
0.150d_{00}+0.150d_{01}=0,
\]
\[
0.225d_{10}+0.225d_{11}=0,
\]
\[
0.125d_{20}+0.125d_{21}=0,
\]
\[
0.150d_{01}+0.225d_{11}+0.125d_{21}=0.
\]
There are 16 overparameterized coefficients and 7 independent constraints, so the constrained coefficient space is 9-dimensional. One especially readable null-space parameterization chooses
Equivalently, the columns of \(Q_{\rm hand}\) are: the intercept direction, the main-slope direction, two weighted group-effect contrast directions, one sex-effect contrast direction, two weighted group-slope contrast directions, and two weighted interaction contrast directions. The package uses a numerical orthonormal basis for the same span; the coordinate system for \(\phi\) can rotate, but the implied constrained \(\hat\theta\) and fitted values are invariant.
7.2 Raw NumPy constrained solve
Before using crabbymetrics, we can solve the overparameterized constrained least-squares problem directly in NumPy. First construct \(X_{\rm full}\), \(A\), and the hand-written null-space basis \(Q_{\rm hand}\).
This is the ABC solution in raw NumPy. Notice that \(Q_{\rm hand}\) is not orthonormal. It does not need to be: any full-rank basis for \(\operatorname{null}(A)\) gives the same constrained fitted values.
7.3 The crabbymetrics solution
Now fit ABCOLS with the same continuous-by-categorical and categorical-by-categorical interactions:
max |theta_np - theta_crabby|: 9.2148511043888e-15
max |yhat_np - yhat_crabby|: 1.2156942119645464e-14
max constraint violation: 6.245004513516506e-17
The package is doing the same constrained least-squares operation. The only implementation difference is that ABCOLS computes a numerical orthonormal null-space basis internally rather than using the hand-written \(Q_{\rm hand}\) above.
Now fit the same linear span using ordinary base-category coding. This drops group=0, sex=0, and the corresponding interaction baseline columns.
max fitted-value difference: 3.9968028886505635e-15
ABC overcomplete coefficient
ABCOLS estimate
Intercept
1.5120
x0
1.1752
g0
-0.8346
g1
0.0782
g2
0.8608
s0
-0.1750
s1
0.1750
x0:g0
0.3830
x0:g1
-0.2675
x0:g2
0.0219
g0:s0
0.0055
g0:s1
-0.0055
g1:s0
-0.0021
g1:s1
0.0021
g2:s0
-0.0027
g2:s1
0.0027
Vanilla one-hot / reference-coded coefficient
Vanilla OLS
Same coefficient implied by ABC
Difference
Intercept
0.5078
0.5078
-1.11e-15
x0
1.5582
1.5582
-4.44e-16
group[1]
0.9052
0.9052
2.22e-16
group[2]
1.6873
1.6873
8.88e-16
sex[1]
0.3391
0.3391
-1.22e-15
x0:group[1]
-0.6505
-0.6505
8.88e-16
x0:group[2]
-0.3611
-0.3611
-1.11e-16
group[1]:sex[1]
0.0152
0.0152
-8.69e-16
group[2]:sex[1]
0.0163
0.0163
-1.22e-15
The first table is the ABCOLS coefficient table in the overcomplete ABC coordinates. The second table translates those ABC coefficients back into the ordinary one-hot/reference-coded coordinates and compares them to a vanilla OLS fit. The fitted values agree because the two designs span the same column space. The coefficients differ because the parameterizations answer different questions, but the translated coefficients agree up to numerical precision.
8 Interpreting the ABC slope
Under ABC coding, x0 is the abundance-weighted average of group-specific slopes. We can verify this directly.
Show code
coef =dict(zip(abc_summary["column_names"], abc_summary["coef"]))weights = np.bincount(group) /len(group)group_slopes = np.array([coef["x0"] + coef[f"x0:c0[{level}]"] for level inrange(3)])print("weights:", weights.round(3))print("group slopes:", group_slopes.round(4))print("weighted average slope:", float(weights @ group_slopes))print("ABC main slope:", coef["x0"])
weights: [0.3 0.45 0.25]
group slopes: [1.5582 0.9077 1.1971]
weighted average slope: 1.1751733997826892
ABC main slope: 1.175173399782689
In the base-category fit, the coefficient on x0 is the slope for the omitted group. In the ABC fit, the coefficient on x0 is the empirical-abundance-weighted average slope.
9 Interpreting the ABC categorical effects
The same logic applies to categorical main effects. The ABC group effects have sample-weighted mean zero:
Show code
group_effects = np.array([coef[f"c0[{level}]"] for level inrange(3)])print("group effects:", group_effects.round(4))print("weighted mean group effect:", float(weights @ group_effects))
group effects: [-0.8346 0.0782 0.8608]
weighted mean group effect: -2.7755575615628914e-17
By contrast, base-category coefficients are deviations from group=0. They change if a different base group is selected; the ABC coefficients do not have that arbitrary reference-level dependence.
10 Current implementation contract
cm.ABCOLS() is intentionally low-level and NumPy-first:
model = cm.ABCOLS()model.fit( y, # 1D float array x, # 2D float array of continuous covariates categories, # 2D uint32 array of zero-based categorical codes cont_cat_interactions=[], # list of (continuous_col, categorical_col) cat_cat_interactions=[], # list of (categorical_col_a, categorical_col_b) center_continuous=True,)
Important restrictions in the first implementation:
categorical levels must be contiguous observed integer codes starting at zero;
empty categorical levels raise an error;
requested categorical-categorical interactions must have all cells observed;
inference is homoskedastic OLS in the constrained parameterization;
robust covariance, formula parsing, pandas categorical handling, and sparse empty-cell handling are natural follow-ups.
11 Takeaway
ABC OLS is not a new fitted-value model for unpenalized least squares. It is a disciplined coefficient coordinate system. The constraints make lower-order coefficients stable and interpretable as sample-abundance-weighted averages when categorical modifiers and interactions enter the model.