Generalized Linear Models

The GLM API extends the compression idea to canonical-link likelihoods. Logistic, Poisson, and multinomial models are fitted from grouped sufficient statistics using Fisher scoring.

Fisher Scoring

All canonical GLM paths use the update

\[ \beta^{(t+1)} = \beta^{(t)} + I(\beta^{(t)})^{-1} U(\beta^{(t)}), \]

where \(U(\beta)\) is the score and \(I(\beta)\) is expected Fisher information.

The implementation solves

Show code

step = np.linalg.solve(info, score)
beta_new = beta + step

with a small ridge fallback if the solve fails.

Binary Logit

For each compressed covariate cell \(g\), let \(s_g\) be successes and \(n_g\) be total rows. With

\[ p_g = \Lambda(x_g'\beta), \]

the grouped log likelihood is

\[ \ell(\beta) = \sum_g \left[ s_g x_g'\beta - n_g\log\{1+\exp(x_g'\beta)\} \right]. \]

The score and information are

\[ U(\beta) = \sum_g x_g(s_g - n_g p_g), \]

\[ I(\beta) = \sum_g n_g p_g(1-p_g)x_gx_g'. \]

Usage from notebooks/glm.ipynb:

Show code

from duckreg.estimators import DuckLogisticRegression

logit = DuckLogisticRegression(
    db_path,
    "binary_data",
    "y ~ x1 + x2",
    seed=1,
    method="irls",
    n_bootstraps=0,
)
logit.fit()
logit.fit_vcov()
logit.summary()

Poisson Regression

For Poisson,

\[ \mu_g = \exp(x_g'\beta). \]

The grouped log likelihood is

\[ \ell(\beta) = \sum_g \left[ y_g^+x_g'\beta - n_g\exp(x_g'\beta) \right]. \]

The score and information are

\[ U(\beta) = \sum_g x_g(y_g^+ - n_g\mu_g), \]

\[ I(\beta) = \sum_g n_g\mu_g x_gx_g'. \]

Show code

from duckreg.estimators import DuckPoissonRegression

pois = DuckPoissonRegression(
    db_path,
    "count_data",
    "y ~ x1 + x2",
    seed=1,
    method="irls",
    n_bootstraps=0,
)
pois.fit()
pois.fit_vcov()
pois.summary()

Exact IRLS vs One-Step

The canonical binary and Poisson classes accept:

Show code

method="irls"
method="one_step"

method="irls" iterates Fisher scoring on the compressed full data until the maximum absolute step is below tolerance.

method="one_step" fits a pilot model on a reservoir sample, then takes one full-data Fisher step:

\[ \hat{\beta}_{\mathrm{one}} = \hat{\beta}_0 + I_N(\hat{\beta}_0)^{-1}U_N(\hat{\beta}_0). \]

This is faster for very large tables, but it is approximate relative to the full compressed IRLS solution.

Multinomial Logit

DuckMultinomialLogisticRegression fits a baseline-category softmax for moderate numbers of labels. If label \(K\) is the baseline, the estimated labels are \(1,\ldots,K-1\) and

\[ \pi_{gk} = \frac{\exp(x_g'\beta_k)} {1+\sum_{j=1}^{K-1}\exp(x_g'\beta_j)}. \]

The score block for label \(k\) is

\[ U_k(\beta) = \sum_g x_g(c_{gk} - n_g\pi_{gk}), \]

and the Fisher information block for labels \(k,l\) is

\[ I_{kl}(\beta) = \sum_g n_g\pi_{gk} \{1[k=l]-\pi_{gl}\} x_gx_g'. \]

The computational cost is the block information matrix of size \(((K-1)p)\times((K-1)p)\).

Show code

from duckreg.estimators import DuckMultinomialLogisticRegression

mn = DuckMultinomialLogisticRegression(
    db_path,
    "class_data",
    "y ~ x1 + x2",
    seed=1,
    labels=["a", "b", "c"],
    baseline="c",
    n_bootstraps=0,
)
mn.fit()
mn.fit_vcov()
mn.summary()

Many-Label Poisson Decomposition

DuckPoissonMultinomialRegression is for many count labels. It fits one Poisson regression per label:

\[ \log \mu_{ij} = x_i'\gamma_j. \]

This replaces one coupled softmax Hessian with \(K\) independent \(p \times p\) Poisson systems.

Show code

from duckreg.estimators import DuckPoissonMultinomialRegression

pm = DuckPoissonMultinomialRegression(
    db_path,
    "token_counts",
    count_col="count",
    label_col="label",
    covars=["x1", "x2"],
    seed=1,
    n_bootstraps=0,
)
pm.fit()
pm.summary()

The Poisson decomposition is scalable and naturally parallelizable across labels. In the current code it is a separable count model, not an exact joint multinomial MLE with unit fixed effects or conditioning on document totals.

Current GLM Inference Caveat

fit_vcov() is implemented for canonical binary, Poisson, and moderate-\(K\) multinomial models. The GLM bootstrap methods currently return empty arrays, so do not rely on n_bootstraps > 0 for GLM standard errors without first implementing real resampling.