Math
Randomized numerical linear algebra is useful when an estimator spends most of its time on a large matrix whose relevant geometry is much smaller than the matrix itself. The central move is to replace a large deterministic linear algebra problem by a smaller random problem that preserves the pieces of geometry the estimator actually uses.
This page separates five families of randomized approximations now exposed in crabbymetrics:
- subspace sketches for low-rank range finding, SVD, QR, matrix completion, and factor models;
- row sketches for tall least squares;
- kernel sketches via Nyström features and random Fourier features;
- moment sketches for many-moment GMM;
- feature compression before balancing or synthetic-control-style weighting.
The important design choice is conservative: exact algorithms remain the defaults for estimator classes. Sketching is opt-in through explicit method arguments or reusable transform classes.
The projection view
Let \(A \in \mathbb{R}^{m \times n}\). Many estimators do not need all of \(A\); they need either:
- the action of \(A\) on a low-dimensional subspace;
- the dominant left/right singular subspaces;
- the Gram matrix \(A'A\) or \(AA'\);
- the solution to a least-squares or moment equation involving \(A\).
A sketch introduces a random linear map and solves a smaller problem. There are two common orientations.
A range sketch multiplies on the right:
\[
Y = A\Omega,
\qquad
\Omega \in \mathbb{R}^{n \times \ell},
\qquad
\ell = k + q,
\]
where \(k\) is the target rank and \(q\) is an oversampling buffer. If \(A\) is exactly or approximately rank \(k\), the columns of \(Y\) span approximately the dominant column space of \(A\). After orthogonalizing,
\[
Y = QR,
\qquad
Q'Q = I_\ell,
\]
we approximate
\[
A \approx QQ'A.
\]
A row sketch multiplies on the left:
\[
\widetilde A = SA,
\qquad
S \in \mathbb{R}^{s \times m},
\qquad
s \ll m.
\]
For regression, this compresses observations. The sketch should be a subspace embedding: for vectors \(v\) in the relevant column span,
\[
(1-\varepsilon)\lVert v \rVert_2^2
\le
\lVert Sv \rVert_2^2
\le
(1+\varepsilon)\lVert v \rVert_2^2.
\]
This condition is stronger than merely preserving pairwise distances for a few rows. It says that the random map approximately preserves the whole low-dimensional residual geometry.
Randomized range finding and SVD
The randomized SVD in crabbymetrics starts from the range sketch above. Given \(A \in \mathbb{R}^{m \times n}\) and a requested rank \(k\):
- draw \(\Omega \in \mathbb{R}^{n \times (k+q)}\);
- form \(Y=A\Omega\);
- optionally perform power iterations;
- orthogonalize \(Y=QR\);
- compute the small SVD of \(B=Q'A\):
\[
B = \widetilde U \Sigma V';
\]
- lift the left singular vectors:
\[
U \approx Q\widetilde U,
\qquad
A \approx U\Sigma V'.
\]
If the singular values of \(A\) decay quickly, a small \(q\) is often enough. If singular values decay slowly, the leading subspace is harder to identify. Power iteration improves separation by replacing the sampled range with roughly
\[
Y = (AA')^r A\Omega.
\]
The singular values are effectively transformed from \(\sigma_j\) to \(\sigma_j^{2r+1}\), so dominant directions stand out more sharply. The tradeoff is additional passes over \(A\).
The useful error heuristic is:
\[
\lVert A - QQ'A \rVert
\approx
\text{tail singular value scale},
\]
not machine precision. Randomized SVD is an approximation to a low-rank target, not a replacement for an exact full-rank factorization.
Randomized QR and approximate least squares
Randomized QR uses the same range basis. If \(Q\) approximately spans the column space of \(A\), then QR can be computed on the compressed coordinates and lifted back. Equivalently, randomized least squares can be viewed as solving on a sketch that preserves the column space of the design.
For a tall design \(X \in \mathbb{R}^{n \times p}\) with \(n \gg p\), OLS is
\[
\hat\beta
= \arg\min_b \lVert y-Xb \rVert_2^2.
\]
A sketched least-squares fit draws \(S \in \mathbb{R}^{s \times n}\) and solves
\[
\hat\beta_S
= \arg\min_b \lVert S(y-Xb) \rVert_2^2.
\]
If \(S\) is a subspace embedding for \(\operatorname{span}([X,y])\), then the sketched objective is uniformly close to the full objective over the relevant parameter region. That is the reason a smaller problem can recover a similar coefficient vector.
CountSketch row embeddings
The row-sketching path uses CountSketch. Each original row \(i\) receives:
- a bucket \(h(i) \in \{1,\ldots,s\}\);
- a sign \(\sigma_i \in \{-1,+1\}\).
Then rows are accumulated as
\[
(SX)_{h(i),\cdot} \mathrel{+}= \sigma_i X_{i,\cdot},
\qquad
(Sy)_{h(i)} \mathrel{+}= \sigma_i y_i.
\]
The matrix \(S\) is never materialized densely. Construction is one pass through the rows, so the cost is proportional to the number of nonzero entries touched. For dense \(X\), that is \(O(np)\) to build the sketch plus the cost of solving the \(s \times p\) compressed least-squares problem.
The sketch size \(s\) controls the bias–variance approximation tradeoff. A very small sketch is fast but noisy; a larger sketch approaches the full estimator.
Matrix completion via randomized singular-value thresholding
Panel matrix completion repeatedly solves a low-rank denoising problem. A stylized singular-value-thresholding update is
\[
M^{(t+1)}
= \mathcal{S}_\lambda(A^{(t)}),
\]
where \(\mathcal{S}_\lambda\) soft-thresholds singular values. If
\[
A^{(t)} = U\operatorname{diag}(\sigma_j)V',
\]
then
\[
\mathcal{S}_\lambda(A^{(t)})
= U\operatorname{diag}((\sigma_j-\lambda)_+)V'.
\]
When the desired panel signal is low rank, most singular values are discarded or heavily shrunk. Computing a full exact SVD in every iteration can therefore waste work. The randomized path computes an approximate rank-\(k\) SVD and applies the same soft-thresholding to the retained singular values.
This is why the MatrixCompletion sketching knobs are explicit:
svd_method="exact" keeps the deterministic path;
svd_method="randomized" uses randomized SVD;
svd_rank controls the retained rank ceiling;
- oversampling and power-iteration controls tune approximation quality.
The approximation changes the low-rank update, so it should be tested against exact SVD on small panels before being trusted on large panels.
Interactive fixed effects and randomized factor extraction
Interactive fixed effects use a low-rank factor structure such as
\[
Y_{it}(0) \approx \lambda_i'f_t + \alpha_i + \xi_t.
\]
After additive effects are handled, factor extraction is again an SVD problem on a residualized panel matrix. Exact factor extraction computes leading singular vectors of the full matrix. The randomized path approximates the same leading subspace with randomized SVD.
The target is not equality of individual factor signs or rotations. Factors are only identified up to invertible transformations. The relevant comparisons are reconstructed fitted values, counterfactual paths, and treatment-effect summaries.
Kernel approximation: Nyström features
Kernel methods replace dot products with similarities
\[
K(x_i,x_j).
\]
A full kernel matrix for \(n\) observations costs \(O(n^2)\) memory. Nyström approximation chooses \(m \ll n\) landmarks \(L = \{\ell_1,\ldots,\ell_m\}\) and forms
\[
K_{nm} = K(X,L),
\qquad
K_{mm} = K(L,L).
\]
The low-rank kernel approximation is
\[
K(X,X) \approx K_{nm}K_{mm}^{\dagger}K_{mn}.
\]
A convenient feature representation is
\[
\Phi_{Nys}(X) = K_{nm}K_{mm}^{-1/2},
\]
so that
\[
\Phi_{Nys}(X)\Phi_{Nys}(X)' \approx K(X,X).
\]
NystromBasis exposes these features as a transformer. Downstream estimators can stay ordinary linear estimators, e.g. ridge on Nyström features.
Kernel approximation: random Fourier features
For shift-invariant kernels \(K(x,z)=\kappa(x-z)\), Bochner’s theorem says that a continuous positive-definite kernel is the Fourier transform of a nonnegative measure. For the Gaussian/RBF kernel,
\[
K(x,z) = \exp\left(-\frac{\lVert x-z\rVert^2}{2b^2}\right),
\]
one can draw random frequencies
\[
\omega_j \sim N(0, b^{-2}I)
\]
and phases
\[
b_j \sim \operatorname{Uniform}(0,2\pi),
\]
then construct features
\[
\phi_j(x)
= \sqrt{\frac{2}{D}}\cos(\omega_j'x+b_j),
\qquad j=1,\ldots,D.
\]
The inner product approximates the kernel:
\[
\phi(x)'\phi(z) \approx K(x,z).
\]
RandomFourierFeatures is useful when a smooth nonlinear relationship can be learned by a linear estimator after random nonlinear expansion.
Many-moment GMM sketching
A GMM estimator solves moments
\[
\bar g(\theta) = \frac{1}{n}\sum_{i=1}^n g_i(\theta) \in \mathbb{R}^q,
\]
by minimizing
\[
Q(\theta) = \bar g(\theta)'W\bar g(\theta).
\]
When \(q\) is large, the moment system and weighting matrix can dominate computation. A moment sketch draws a projection
\[
R \in \mathbb{R}^{s \times q},
\qquad s < q,
\]
and fits the compressed moments
\[
\widetilde g(\theta) = R\bar g(\theta).
\]
The Jacobian is compressed consistently:
\[
\widetilde G(\theta) = R G(\theta),
\qquad
G(\theta)=\frac{\partial \bar g(\theta)}{\partial \theta'}.
\]
GMM.fit_sketch(...) uses a fixed Rademacher projection. The conservative interpretation is important: the fitted and summarized model is the sketched moment problem, not the full moment problem with magically free inference. Moment sketching is most appropriate when there are many redundant or highly correlated moments.
Randomized PCA before balancing weights
Balancing weights solve for weights \(w_i\) that make weighted source covariates match target covariates. In the simplest form, the balance condition is
\[
\sum_i w_i x_i \approx \bar x_{target}.
\]
If \(x_i\) is a long donor history or wide feature vector, exact balance can be high-dimensional and unstable. One explicit compression strategy is to first map covariates to randomized principal-component scores:
\[
z_i = V_k'(x_i-\bar x),
\]
where \(V_k\) is obtained by randomized SVD. Balance is then imposed on \(z_i\) rather than the full \(x_i\).
This is deliberately exposed as RandomizedPCA, not hidden inside BalancingWeights or synthetic-control estimators. Compression changes the estimand-relevant diagnostics: balance in compressed coordinates does not imply exact balance in original coordinates unless the discarded directions are irrelevant.
Practical tuning rules
- Keep exact defaults for small problems.
- Use randomized SVD when a matrix is large and plausibly low rank.
- Increase
oversamples when approximation quality is unstable.
- Increase
power_iter when singular values decay slowly.
- Use row sketches when \(n\) is large relative to the number of regressors and the least-squares geometry is well conditioned.
- Treat GMM moment sketching as a point-estimation acceleration first; be cautious with inference.
- For kernel approximations, choose the feature dimension by validation or an approximation diagnostic, not by aesthetics.
- For balancing compression, always inspect balance in original variables as a diagnostic.