Randomized Sketching And Least Squares

1 Math

Randomized numerical linear algebra is useful when an estimator spends most of its time on a large matrix whose relevant geometry is much smaller than the matrix itself. The central move is to replace a large deterministic linear algebra problem by a smaller random problem that preserves the pieces of geometry the estimator actually uses.

This page separates five families of randomized approximations now exposed in crabbymetrics:

subspace sketches for low-rank range finding, SVD, QR, matrix completion, and factor models;
row sketches for tall least squares and instrumental variables;
kernel sketches via Nyström features and random Fourier features;
moment sketches for many-moment GMM;
feature compression before balancing or synthetic-control-style weighting.

The important design choice is conservative: exact algorithms remain the defaults for estimator classes. Sketching is opt-in through explicit method arguments or reusable transform classes.

1.1 The projection view

Let \(A \in \mathbb{R}^{m \times n}\). Many estimators do not need all of \(A\); they need either:

the action of \(A\) on a low-dimensional subspace;
the dominant left/right singular subspaces;
the Gram matrix \(A'A\) or \(AA'\);
the solution to a least-squares or moment equation involving \(A\).

A sketch introduces a random linear map and solves a smaller problem. There are two common orientations.

A range sketch multiplies on the right:

\[ Y = A\Omega, \qquad \Omega \in \mathbb{R}^{n \times \ell}, \qquad \ell = k + q, \]

where \(k\) is the target rank and \(q\) is an oversampling buffer. If \(A\) is exactly or approximately rank \(k\), the columns of \(Y\) span approximately the dominant column space of \(A\). After orthogonalizing,

\[ Y = QR, \qquad Q'Q = I_\ell, \]

we approximate

\[ A \approx QQ'A. \]

A row sketch multiplies on the left:

\[ \widetilde A = SA, \qquad S \in \mathbb{R}^{s \times m}, \qquad s \ll m. \]

For regression, this compresses observations. The sketch should be a subspace embedding: for vectors \(v\) in the relevant column span,

\[ (1-\varepsilon)\lVert v \rVert_2^2 \le \lVert Sv \rVert_2^2 \le (1+\varepsilon)\lVert v \rVert_2^2. \]

This condition is stronger than merely preserving pairwise distances for a few rows. It says that the random map approximately preserves the whole low-dimensional residual geometry.

1.2 Randomized range finding and SVD

The randomized SVD in crabbymetrics starts from the range sketch above. Given \(A \in \mathbb{R}^{m \times n}\) and a requested rank \(k\):

draw \(\Omega \in \mathbb{R}^{n \times (k+q)}\);
form \(Y=A\Omega\);
optionally perform power iterations;
orthogonalize \(Y=QR\);
compute the small SVD of \(B=Q'A\):

\[ B = \widetilde U \Sigma V'; \]

lift the left singular vectors:

\[ U \approx Q\widetilde U, \qquad A \approx U\Sigma V'. \]

If the singular values of \(A\) decay quickly, a small \(q\) is often enough. If singular values decay slowly, the leading subspace is harder to identify. Power iteration improves separation by replacing the sampled range with roughly

\[ Y = (AA')^r A\Omega. \]

The singular values are effectively transformed from \(\sigma_j\) to \(\sigma_j^{2r+1}\), so dominant directions stand out more sharply. The tradeoff is additional passes over \(A\).

The useful error heuristic is:

\[ \lVert A - QQ'A \rVert \approx \text{tail singular value scale}, \]

not machine precision. Randomized SVD is an approximation to a low-rank target, not a replacement for an exact full-rank factorization.

1.3 Randomized QR and approximate least squares

Randomized QR uses the same range basis. If \(Q\) approximately spans the column space of \(A\), then QR can be computed on the compressed coordinates and lifted back. Equivalently, randomized least squares can be viewed as solving on a sketch that preserves the column space of the design.

For a tall design \(X \in \mathbb{R}^{n \times p}\) with \(n \gg p\), OLS is

\[ \hat\beta = \arg\min_b \lVert y-Xb \rVert_2^2. \]

A sketched least-squares fit draws \(S \in \mathbb{R}^{s \times n}\) and solves

\[ \hat\beta_S = \arg\min_b \lVert S(y-Xb) \rVert_2^2. \]

If \(S\) is a subspace embedding for \(\operatorname{span}([X,y])\), then the sketched objective is uniformly close to the full objective over the relevant parameter region. That is the reason a smaller problem can recover a similar coefficient vector.

1.4 CountSketch row embeddings

The row-sketching path uses CountSketch. Each original row \(i\) receives:

a bucket \(h(i) \in \{1,\ldots,s\}\);
a sign \(\sigma_i \in \{-1,+1\}\).

Then rows are accumulated as

\[ (SX)_{h(i),\cdot} \mathrel{+}= \sigma_i X_{i,\cdot}, \qquad (Sy)_{h(i)} \mathrel{+}= \sigma_i y_i. \]

The matrix \(S\) is never materialized densely. Construction is one pass through the rows, so the cost is proportional to the number of nonzero entries touched. For dense \(X\), that is \(O(np)\) to build the sketch plus the cost of solving the \(s \times p\) compressed least-squares problem.

The sketch size \(s\) controls the bias–variance approximation tradeoff. A very small sketch is fast but noisy; a larger sketch approaches the full estimator.

1.5 Row sketches for instrumental variables

For 2SLS, let

\[ X = [1, X_{endog}, X_{exog}], \qquad Z = [1, X_{exog}, Z_{excluded}]. \]

The ordinary 2SLS estimator is

\[ \hat\beta_{2SLS} = (X'P_ZX)^{-1}X'P_Zy, \qquad P_Z = Z(Z'Z)^{-1}Z'. \]

The row-sketched estimator applies the same sketch to \(X\), \(Z\), and \(y\):

\[ \hat\beta_{S,2SLS} = ((SX)'P_{SZ}(SX))^{-1}(SX)'P_{SZ}(Sy), \]

where

\[ P_{SZ} = (SZ)((SZ)'(SZ))^{-1}(SZ)'. \]

Using a shared sketch is essential. The object being approximated is the joint moment geometry

\[ Z'(y-X\beta)=0, \]

not three unrelated matrices. Sketching only one of \(X\), \(Z\), or \(y\) would change the moment equation in a less controlled way.

1.6 Matrix completion via randomized singular-value thresholding

Panel matrix completion repeatedly solves a low-rank denoising problem. A stylized singular-value-thresholding update is

\[ M^{(t+1)} = \mathcal{S}_\lambda(A^{(t)}), \]

where \(\mathcal{S}_\lambda\) soft-thresholds singular values. If

\[ A^{(t)} = U\operatorname{diag}(\sigma_j)V', \]

then

\[ \mathcal{S}_\lambda(A^{(t)}) = U\operatorname{diag}((\sigma_j-\lambda)_+)V'. \]

When the desired panel signal is low rank, most singular values are discarded or heavily shrunk. Computing a full exact SVD in every iteration can therefore waste work. The randomized path computes an approximate rank-\(k\) SVD and applies the same soft-thresholding to the retained singular values.

This is why the MatrixCompletion sketching knobs are explicit:

svd_method="exact" keeps the deterministic path;
svd_method="randomized" uses randomized SVD;
svd_rank controls the retained rank ceiling;
oversampling and power-iteration controls tune approximation quality.

The approximation changes the low-rank update, so it should be tested against exact SVD on small panels before being trusted on large panels.

1.7 Interactive fixed effects and randomized factor extraction

Interactive fixed effects use a low-rank factor structure such as

\[ Y_{it}(0) \approx \lambda_i'f_t + \alpha_i + \xi_t. \]

After additive effects are handled, factor extraction is again an SVD problem on a residualized panel matrix. Exact factor extraction computes leading singular vectors of the full matrix. The randomized path approximates the same leading subspace with randomized SVD.

The target is not equality of individual factor signs or rotations. Factors are only identified up to invertible transformations. The relevant comparisons are reconstructed fitted values, counterfactual paths, and treatment-effect summaries.

1.8 Kernel approximation: Nyström features

Kernel methods replace dot products with similarities

\[ K(x_i,x_j). \]

A full kernel matrix for \(n\) observations costs \(O(n^2)\) memory. Nyström approximation chooses \(m \ll n\) landmarks \(L = \{\ell_1,\ldots,\ell_m\}\) and forms

\[ K_{nm} = K(X,L), \qquad K_{mm} = K(L,L). \]

The low-rank kernel approximation is

\[ K(X,X) \approx K_{nm}K_{mm}^{\dagger}K_{mn}. \]

A convenient feature representation is

\[ \Phi_{Nys}(X) = K_{nm}K_{mm}^{-1/2}, \]

so that

\[ \Phi_{Nys}(X)\Phi_{Nys}(X)' \approx K(X,X). \]

NystromBasis exposes these features as a transformer. Downstream estimators can stay ordinary linear estimators, e.g. ridge on Nyström features.

1.9 Kernel approximation: random Fourier features

For shift-invariant kernels \(K(x,z)=\kappa(x-z)\), Bochner’s theorem says that a continuous positive-definite kernel is the Fourier transform of a nonnegative measure. For the Gaussian/RBF kernel,

\[ K(x,z) = \exp\left(-\frac{\lVert x-z\rVert^2}{2b^2}\right), \]

one can draw random frequencies

\[ \omega_j \sim N(0, b^{-2}I) \]

and phases

\[ b_j \sim \operatorname{Uniform}(0,2\pi), \]

then construct features

\[ \phi_j(x) = \sqrt{\frac{2}{D}}\cos(\omega_j'x+b_j), \qquad j=1,\ldots,D. \]

The inner product approximates the kernel:

\[ \phi(x)'\phi(z) \approx K(x,z). \]

RandomFourierFeatures is useful when a smooth nonlinear relationship can be learned by a linear estimator after random nonlinear expansion.

1.10 Many-moment GMM sketching

A GMM estimator solves moments

\[ \bar g(\theta) = \frac{1}{n}\sum_{i=1}^n g_i(\theta) \in \mathbb{R}^q, \]

by minimizing

\[ Q(\theta) = \bar g(\theta)'W\bar g(\theta). \]

When \(q\) is large, the moment system and weighting matrix can dominate computation. A moment sketch draws a projection

\[ R \in \mathbb{R}^{s \times q}, \qquad s < q, \]

and fits the compressed moments

\[ \widetilde g(\theta) = R\bar g(\theta). \]

The Jacobian is compressed consistently:

\[ \widetilde G(\theta) = R G(\theta), \qquad G(\theta)=\frac{\partial \bar g(\theta)}{\partial \theta'}. \]

GMM.fit_sketch(...) uses a fixed Rademacher projection. The conservative interpretation is important: the fitted and summarized model is the sketched moment problem, not the full moment problem with magically free inference. Moment sketching is most appropriate when there are many redundant or highly correlated moments.

1.11 Randomized PCA before balancing weights

Balancing weights solve for weights \(w_i\) that make weighted source covariates match target covariates. In the simplest form, the balance condition is

\[ \sum_i w_i x_i \approx \bar x_{target}. \]

If \(x_i\) is a long donor history or wide feature vector, exact balance can be high-dimensional and unstable. One explicit compression strategy is to first map covariates to randomized principal-component scores:

\[ z_i = V_k'(x_i-\bar x), \]

where \(V_k\) is obtained by randomized SVD. Balance is then imposed on \(z_i\) rather than the full \(x_i\).

This is deliberately exposed as RandomizedPCA, not hidden inside BalancingWeights or synthetic-control estimators. Compression changes the estimand-relevant diagnostics: balance in compressed coordinates does not imply exact balance in original coordinates unless the discarded directions are irrelevant.

1.12 Practical tuning rules

Keep exact defaults for small problems.
Use randomized SVD when a matrix is large and plausibly low rank.
Increase oversamples when approximation quality is unstable.
Increase power_iter when singular values decay slowly.
Use row sketches when \(n\) is large relative to the number of regressors or instruments.
Treat GMM moment sketching as a point-estimation acceleration first; be cautious with inference.
For kernel approximations, choose the feature dimension by validation or an approximation diagnostic, not by aesthetics.
For balancing compression, always inspect balance in original variables as a diagnostic.

2 Ablation

Show code

from html import escape
import time

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display

import crabbymetrics as cm

np.set_printoptions(precision=4, suppress=True)


def html_table(headers, rows):
    parts = ["<table>", "<thead>", "<tr>"]
    parts.extend(f"<th>{escape(str(header))}</th>" for header in headers)
    parts.extend(["</tr>", "</thead>", "<tbody>"])
    for row in rows:
        parts.append("<tr>")
        parts.extend(f"<td>{cell}</td>" for cell in row)
        parts.append("</tr>")
    parts.extend(["</tbody>", "</table>"])
    return "".join(parts)

2.1 OLS design

The first design is intentionally friendly to OLS: standardized Gaussian regressors, a dense linear signal, and homoskedastic additive noise. Full OLS is the numerical target. The sketch size is a multiple of the full design dimension, including the intercept.

Show code

def run_ols_once(n, p, noise, multiple, seed):
    rng = np.random.default_rng(seed)
    x = rng.normal(size=(n, p))
    beta = rng.normal(size=p)
    intercept = 0.4
    y = intercept + x @ beta + noise * rng.normal(size=n)
    sketch_size = int(multiple * (p + 1))

    full = cm.OLS()
    t0 = time.perf_counter()
    full.fit(x, y)
    full_time = time.perf_counter() - t0
    full_summary = full.summary(vcov="vanilla")
    full_param = np.concatenate([[full_summary["intercept"]], full_summary["coef"]])

    sketch = cm.OLS()
    t0 = time.perf_counter()
    sketch.fit_sketch(x, y, sketch_size=sketch_size, seed=seed + 10_000)
    sketch_time = time.perf_counter() - t0
    sketch_summary = sketch.summary(vcov="vanilla")
    sketch_param = np.concatenate([[sketch_summary["intercept"]], sketch_summary["coef"]])

    x_test = rng.normal(size=(2000, p))
    full_pred = full.predict(x_test)
    sketch_pred = sketch.predict(x_test)

    return {
        "multiple": multiple,
        "sketch_size": sketch_size,
        "coef_rel_error": float(
            np.linalg.norm(sketch_param - full_param) / np.linalg.norm(full_param)
        ),
        "prediction_rmse_vs_full": float(np.sqrt(np.mean((sketch_pred - full_pred) ** 2))),
        "full_fit_seconds": full_time,
        "sketch_fit_seconds": sketch_time,
        "speedup": full_time / sketch_time if sketch_time > 0 else np.inf,
    }


ols_n = 20_000
ols_p = 40
ols_noise = 0.2
multiples = [2, 4, 8, 12, 16]
seeds = range(5)

ols_rows = [
    run_ols_once(ols_n, ols_p, ols_noise, multiple, seed)
    for multiple in multiples
    for seed in seeds
]
print("OLS replications:", len(ols_rows))
print("OLS design:", {"n": ols_n, "p": ols_p, "noise": ols_noise})

OLS replications: 25
OLS design: {'n': 20000, 'p': 40, 'noise': 0.2}

Show code

ols_summary = []
for multiple in multiples:
    block = [row for row in ols_rows if row["multiple"] == multiple]
    ols_summary.append(
        {
            "multiple": multiple,
            "sketch_size": block[0]["sketch_size"],
            "median_coef_rel_error": float(np.median([r["coef_rel_error"] for r in block])),
            "median_prediction_rmse_vs_full": float(
                np.median([r["prediction_rmse_vs_full"] for r in block])
            ),
            "median_speedup": float(np.median([r["speedup"] for r in block])),
        }
    )

ols_summary_rows = [
    [
        f"{entry['multiple']}x",
        entry["sketch_size"],
        f"{entry['median_coef_rel_error']:.4f}",
        f"{entry['median_prediction_rmse_vs_full']:.4f}",
        f"{entry['median_speedup']:.2f}x",
    ]
    for entry in ols_summary
]

display(
    HTML(
        html_table(
            [
                "Sketch multiple",
                "Sketch size",
                "Median coefficient relative error",
                "Median prediction RMSE vs full OLS",
                "Median speedup",
            ],
            ols_summary_rows,
        )
    )
)

Sketch multiple	Sketch size	Median coefficient relative error	Median prediction RMSE vs full OLS	Median speedup
2x	82	0.0312	0.2242	2.56x
4x	164	0.0188	0.1140	2.48x
8x	328	0.0106	0.0745	2.87x
12x	492	0.0086	0.0629	2.60x
16x	656	0.0083	0.0562	2.38x

2.2 IV design

The second design introduces one endogenous regressor whose first stage depends on excluded instruments and exogenous controls. The structural error is correlated with the first-stage shock, so IV rather than OLS is the relevant target. The comparison is full 2SLS versus sketched 2SLS.

Show code

def run_iv_once(n, multiple, seed):
    rng = np.random.default_rng(seed)
    z = rng.normal(size=(n, 3))
    x_exog = rng.normal(size=(n, 2))
    v = rng.normal(size=n)
    x_endog = 0.8 * z[:, [0]] + 0.4 * z[:, [1]] + 0.3 * x_exog[:, [0]] + v[:, None]
    eps = 0.6 * v + rng.normal(scale=0.2, size=n)
    y = 1.0 + 2.0 * x_endog[:, 0] - 0.5 * x_exog[:, 0] + 0.25 * x_exog[:, 1] + eps

    # The sketched 2SLS design has intercept + endogenous + exogenous regressors
    # and intercept + exogenous + excluded instruments.
    design_dim = max(1 + x_endog.shape[1] + x_exog.shape[1], 1 + x_exog.shape[1] + z.shape[1])
    sketch_size = int(multiple * design_dim)

    full = cm.TwoSLS()
    t0 = time.perf_counter()
    full.fit(x_endog, x_exog, z, y)
    full_time = time.perf_counter() - t0
    full_summary = full.summary(vcov="vanilla")
    full_param = np.concatenate([[full_summary["intercept"]], full_summary["coef"]])

    sketch = cm.TwoSLS()
    t0 = time.perf_counter()
    sketch.fit_sketch(x_endog, x_exog, z, y, sketch_size=sketch_size, seed=seed + 20_000)
    sketch_time = time.perf_counter() - t0
    sketch_summary = sketch.summary(vcov="vanilla")
    sketch_param = np.concatenate([[sketch_summary["intercept"]], sketch_summary["coef"]])

    x_pred = np.column_stack([x_endog, x_exog])[:2000]
    full_pred = full.predict(x_pred)
    sketch_pred = sketch.predict(x_pred)

    return {
        "multiple": multiple,
        "sketch_size": sketch_size,
        "coef_rel_error": float(
            np.linalg.norm(sketch_param - full_param) / np.linalg.norm(full_param)
        ),
        "prediction_rmse_vs_full": float(np.sqrt(np.mean((sketch_pred - full_pred) ** 2))),
        "full_fit_seconds": full_time,
        "sketch_fit_seconds": sketch_time,
        "speedup": full_time / sketch_time if sketch_time > 0 else np.inf,
    }


iv_n = 20_000
iv_rows = [run_iv_once(iv_n, multiple, seed) for multiple in multiples for seed in seeds]
print("IV replications:", len(iv_rows))
print("IV design:", {"n": iv_n, "n_endog": 1, "n_exog": 2, "n_excluded_instruments": 3})

IV replications: 25
IV design: {'n': 20000, 'n_endog': 1, 'n_exog': 2, 'n_excluded_instruments': 3}

Show code

iv_summary = []
for multiple in multiples:
    block = [row for row in iv_rows if row["multiple"] == multiple]
    iv_summary.append(
        {
            "multiple": multiple,
            "sketch_size": block[0]["sketch_size"],
            "median_coef_rel_error": float(np.median([r["coef_rel_error"] for r in block])),
            "median_prediction_rmse_vs_full": float(
                np.median([r["prediction_rmse_vs_full"] for r in block])
            ),
            "median_speedup": float(np.median([r["speedup"] for r in block])),
        }
    )

iv_summary_rows = [
    [
        f"{entry['multiple']}x",
        entry["sketch_size"],
        f"{entry['median_coef_rel_error']:.4f}",
        f"{entry['median_prediction_rmse_vs_full']:.4f}",
        f"{entry['median_speedup']:.2f}x",
    ]
    for entry in iv_summary
]

display(
    HTML(
        html_table(
            [
                "Sketch multiple",
                "Sketch size",
                "Median coefficient relative error",
                "Median prediction RMSE vs full 2SLS",
                "Median speedup",
            ],
            iv_summary_rows,
        )
    )
)

Sketch multiple	Sketch size	Median coefficient relative error	Median prediction RMSE vs full 2SLS	Median speedup
2x	12	0.1076	0.2785	0.74x
4x	24	0.1022	0.2544	0.72x
8x	48	0.0654	0.1545	0.70x
12x	72	0.0764	0.1820	0.65x
16x	96	0.0412	0.1107	0.73x

2.3 Plots

Show code

fig, axes = plt.subplots(2, 3, figsize=(14, 7), constrained_layout=True)

for row_idx, (label, summary) in enumerate([("OLS", ols_summary), ("2SLS", iv_summary)]):
    x_axis = np.array([entry["multiple"] for entry in summary])
    coef_error = np.array([entry["median_coef_rel_error"] for entry in summary])
    pred_error = np.array([entry["median_prediction_rmse_vs_full"] for entry in summary])
    speedup = np.array([entry["median_speedup"] for entry in summary])

    axes[row_idx, 0].plot(x_axis, coef_error, marker="o")
    axes[row_idx, 0].set_xlabel("Sketch size / design dimension")
    axes[row_idx, 0].set_ylabel("Median relative coefficient error")
    axes[row_idx, 0].set_title(f"{label}: coefficient accuracy")

    axes[row_idx, 1].plot(x_axis, pred_error, marker="o", color="tab:orange")
    axes[row_idx, 1].set_xlabel("Sketch size / design dimension")
    axes[row_idx, 1].set_ylabel("Median RMSE vs full fit")
    axes[row_idx, 1].set_title(f"{label}: prediction agreement")

    axes[row_idx, 2].plot(x_axis, speedup, marker="o", color="tab:green")
    axes[row_idx, 2].axhline(1.0, color="black", linestyle="--", linewidth=1)
    axes[row_idx, 2].set_xlabel("Sketch size / design dimension")
    axes[row_idx, 2].set_ylabel("Median full/sketch fit time")
    axes[row_idx, 2].set_title(f"{label}: runtime")

plt.show()

For these tall, well-conditioned designs, larger sketches move the approximate OLS and IV estimates toward the full-data estimates while keeping the solve on a much smaller synthetic dataset. Very small sketches are useful as fast exploratory approximations; larger sketches are the safer default when coefficient accuracy matters.