Generalized Linear Models
The GLM API extends the compression idea to canonical-link likelihoods. Logistic, Poisson, and multinomial models are fitted from grouped sufficient statistics using Fisher scoring.
Fisher Scoring
All canonical GLM paths use the update
\[ \beta^{(t+1)} = \beta^{(t)} + I(\beta^{(t)})^{-1} U(\beta^{(t)}), \]
where \(U(\beta)\) is the score and \(I(\beta)\) is expected Fisher information.
The implementation solves
Show code
step = np.linalg.solve(info, score)
beta_new = beta + stepwith a small ridge fallback if the solve fails.
Binary Logit
For each compressed covariate cell \(g\), let \(s_g\) be successes and \(n_g\) be total rows. With
\[ p_g = \Lambda(x_g'\beta), \]
the grouped log likelihood is
\[ \ell(\beta) = \sum_g \left[ s_g x_g'\beta - n_g\log\{1+\exp(x_g'\beta)\} \right]. \]
The score and information are
\[ U(\beta) = \sum_g x_g(s_g - n_g p_g), \]
\[ I(\beta) = \sum_g n_g p_g(1-p_g)x_gx_g'. \]
Usage from notebooks/glm.ipynb:
Show code
from duckreg.estimators import DuckLogisticRegression
logit = DuckLogisticRegression(
db_path,
"binary_data",
"y ~ x1 + x2",
seed=1,
method="irls",
n_bootstraps=0,
)
logit.fit()
logit.fit_vcov()
logit.summary()Poisson Regression
For Poisson,
\[ \mu_g = \exp(x_g'\beta). \]
The grouped log likelihood is
\[ \ell(\beta) = \sum_g \left[ y_g^+x_g'\beta - n_g\exp(x_g'\beta) \right]. \]
The score and information are
\[ U(\beta) = \sum_g x_g(y_g^+ - n_g\mu_g), \]
\[ I(\beta) = \sum_g n_g\mu_g x_gx_g'. \]
Show code
from duckreg.estimators import DuckPoissonRegression
pois = DuckPoissonRegression(
db_path,
"count_data",
"y ~ x1 + x2",
seed=1,
method="irls",
n_bootstraps=0,
)
pois.fit()
pois.fit_vcov()
pois.summary()Exact IRLS vs One-Step
The canonical binary and Poisson classes accept:
Show code
method="irls"
method="one_step"method="irls" iterates Fisher scoring on the compressed full data until the maximum absolute step is below tolerance.
method="one_step" fits a pilot model on a reservoir sample, then takes one full-data Fisher step:
\[ \hat{\beta}_{\mathrm{one}} = \hat{\beta}_0 + I_N(\hat{\beta}_0)^{-1}U_N(\hat{\beta}_0). \]
This is faster for very large tables, but it is approximate relative to the full compressed IRLS solution.
Multinomial Logit
DuckMultinomialLogisticRegression fits a baseline-category softmax for moderate numbers of labels. If label \(K\) is the baseline, the estimated labels are \(1,\ldots,K-1\) and
\[ \pi_{gk} = \frac{\exp(x_g'\beta_k)} {1+\sum_{j=1}^{K-1}\exp(x_g'\beta_j)}. \]
The score block for label \(k\) is
\[ U_k(\beta) = \sum_g x_g(c_{gk} - n_g\pi_{gk}), \]
and the Fisher information block for labels \(k,l\) is
\[ I_{kl}(\beta) = \sum_g n_g\pi_{gk} \{1[k=l]-\pi_{gl}\} x_gx_g'. \]
The computational cost is the block information matrix of size \(((K-1)p)\times((K-1)p)\).
Show code
from duckreg.estimators import DuckMultinomialLogisticRegression
mn = DuckMultinomialLogisticRegression(
db_path,
"class_data",
"y ~ x1 + x2",
seed=1,
labels=["a", "b", "c"],
baseline="c",
n_bootstraps=0,
)
mn.fit()
mn.fit_vcov()
mn.summary()Many-Label Poisson Decomposition
DuckPoissonMultinomialRegression is for many count labels. It fits one Poisson regression per label:
\[ \log \mu_{ij} = x_i'\gamma_j. \]
This replaces one coupled softmax Hessian with \(K\) independent \(p \times p\) Poisson systems.
Show code
from duckreg.estimators import DuckPoissonMultinomialRegression
pm = DuckPoissonMultinomialRegression(
db_path,
"token_counts",
count_col="count",
label_col="label",
covars=["x1", "x2"],
seed=1,
n_bootstraps=0,
)
pm.fit()
pm.summary()The Poisson decomposition is scalable and naturally parallelizable across labels. In the current code it is a separable count model, not an exact joint multinomial MLE with unit fixed effects or conditioning on document totals.
Current GLM Inference Caveat
fit_vcov() is implemented for canonical binary, Poisson, and moderate-\(K\) multinomial models. The GLM bootstrap methods currently return empty arrays, so do not rely on n_bootstraps > 0 for GLM standard errors without first implementing real resampling.