Randomized Sketching And Least Squares

1 Math

Randomized numerical linear algebra replaces one large deterministic linear algebra problem with a smaller random one. The approximation is useful when the random compression preserves the geometry relevant for the downstream problem.

There are two distinct sketches in this page:

subspace sketches, which find a low-dimensional basis for the range of a matrix;
row sketches, which compress observations before least squares or instrumental variables.

1.1 Subspace sketches

Let \(A\) be an \(m \times n\) matrix. If \(A\) is approximately rank \(k\), its columns mostly live in a \(k\)-dimensional subspace. A randomized range finder draws a random test matrix \(\Omega \in \mathbb{R}^{n \times (k+q)}\) and forms

\[ Y = A\Omega, \]

where \(q\) is a small oversampling parameter. The columns of \(Y\) are random linear combinations of columns of \(A\). With high probability, they span nearly the same dominant column space as \(A\). Orthogonalizing gives

\[ Y \approx QR, \qquad Q'Q = I. \]

Then \(Q\) is an approximate basis for the range of \(A\). This is the primitive behind randomized SVD and randomized QR.

For a randomized SVD, project the matrix into this basis:

\[ B = Q'A. \]

Compute the ordinary SVD of the small matrix,

\[ B = \widetilde U \Sigma V', \]

and lift the left singular vectors back:

\[ A \approx Q \widetilde U \Sigma V'. \]

For a randomized QR approximation, factor the same small projected matrix:

\[ B = \widetilde Q R, \]

then lift back:

\[ A \approx (Q \widetilde Q) R. \]

When singular values decay slowly, the leading subspace is harder to identify. Power iterations sharpen the spectrum by repeatedly applying \(A'A\) or \(AA'\) before orthogonalizing. Heuristically, singular values are raised to higher powers, so large singular directions become easier to separate from small ones.

1.2 Row sketches for OLS

OLS solves

\[ \hat\beta = \arg\min_b \lVert y - Xb \rVert_2^2, \]

where \(X\) is \(n \times p\). If \(n\) is large and \(p\) is moderate, the expensive part is the number of rows. A row sketch draws an \(s \times n\) matrix \(S\) with \(s \ll n\) and solves

\[ \hat\beta_S = \arg\min_b \lVert Sy - SXb \rVert_2^2. \]

The sketch should preserve residual norms in the space spanned by \(X\) and \(y\):

\[ \lVert S(y-Xb) \rVert_2^2 \approx \lVert y-Xb \rVert_2^2 \quad \text{for relevant } b. \]

If that approximation is good, the minimizer of the sketched objective is close to the full OLS minimizer.

The implementation uses a CountSketch embedding. Each row \(i\) is assigned to one random bucket \(h(i) \in \{1,\ldots,s\}\) and one random sign \(\sigma_i \in \{-1,+1\}\). The sketched design and outcome are accumulated as

\[ (SX)_{h(i), \cdot} \mathrel{+}= \sigma_i X_{i,\cdot}, \qquad (Sy)_{h(i)} \mathrel{+}= \sigma_i y_i. \]

This touches each original row once, so sketch construction is \(O(np)\) rather than \(O(snp)\).

1.3 Row sketches for IV / 2SLS

Two-stage least squares can be written with an endogenous/exogenous design \(X\) and instrument design \(Z\). With an intercept included where needed, the population of matrices is

\[ X = [1, X_{endog}, X_{exog}], \qquad Z = [1, X_{exog}, Z_{excluded}]. \]

The usual finite-sample 2SLS estimator is

\[ \hat\beta_{2SLS} = (X'P_ZX)^{-1}X'P_Zy, \qquad P_Z = Z(Z'Z)^{-1}Z'. \]

A row-sketched IV fit applies the same sketch \(S\) to \(X\), \(Z\), and \(y\), then runs 2SLS on the compressed problem:

\[ \hat\beta_{S,2SLS} = ((SX)'P_{SZ}(SX))^{-1}(SX)'P_{SZ}(Sy), \qquad P_{SZ} = (SZ)((SZ)'(SZ))^{-1}(SZ)'. \]

Using the same sketch for the regressors, instruments, and outcome is important: the random compression should preserve the joint geometry of the moment condition

\[ Z'(y-X\beta) = 0. \]

The result is an approximate IV fit that can be much cheaper for tall designs, while still using the same first-stage projection logic as ordinary 2SLS.

2 Ablation

Show code

from html import escape
import time

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display

import crabbymetrics as cm

np.set_printoptions(precision=4, suppress=True)


def html_table(headers, rows):
    parts = ["<table>", "<thead>", "<tr>"]
    parts.extend(f"<th>{escape(str(header))}</th>" for header in headers)
    parts.extend(["</tr>", "</thead>", "<tbody>"])
    for row in rows:
        parts.append("<tr>")
        parts.extend(f"<td>{cell}</td>" for cell in row)
        parts.append("</tr>")
    parts.extend(["</tbody>", "</table>"])
    return "".join(parts)

2.1 OLS design

The first design is intentionally friendly to OLS: standardized Gaussian regressors, a dense linear signal, and homoskedastic additive noise. Full OLS is the numerical target. The sketch size is a multiple of the full design dimension, including the intercept.

Show code

def run_ols_once(n, p, noise, multiple, seed):
    rng = np.random.default_rng(seed)
    x = rng.normal(size=(n, p))
    beta = rng.normal(size=p)
    intercept = 0.4
    y = intercept + x @ beta + noise * rng.normal(size=n)
    sketch_size = int(multiple * (p + 1))

    full = cm.OLS()
    t0 = time.perf_counter()
    full.fit(x, y)
    full_time = time.perf_counter() - t0
    full_summary = full.summary(vcov="vanilla")
    full_param = np.concatenate([[full_summary["intercept"]], full_summary["coef"]])

    sketch = cm.OLS()
    t0 = time.perf_counter()
    sketch.fit_sketch(x, y, sketch_size=sketch_size, seed=seed + 10_000)
    sketch_time = time.perf_counter() - t0
    sketch_summary = sketch.summary(vcov="vanilla")
    sketch_param = np.concatenate([[sketch_summary["intercept"]], sketch_summary["coef"]])

    x_test = rng.normal(size=(2000, p))
    full_pred = full.predict(x_test)
    sketch_pred = sketch.predict(x_test)

    return {
        "multiple": multiple,
        "sketch_size": sketch_size,
        "coef_rel_error": float(
            np.linalg.norm(sketch_param - full_param) / np.linalg.norm(full_param)
        ),
        "prediction_rmse_vs_full": float(np.sqrt(np.mean((sketch_pred - full_pred) ** 2))),
        "full_fit_seconds": full_time,
        "sketch_fit_seconds": sketch_time,
        "speedup": full_time / sketch_time if sketch_time > 0 else np.inf,
    }


ols_n = 20_000
ols_p = 40
ols_noise = 0.2
multiples = [2, 4, 8, 12, 16]
seeds = range(5)

ols_rows = [
    run_ols_once(ols_n, ols_p, ols_noise, multiple, seed)
    for multiple in multiples
    for seed in seeds
]
print("OLS replications:", len(ols_rows))
print("OLS design:", {"n": ols_n, "p": ols_p, "noise": ols_noise})

OLS replications: 25
OLS design: {'n': 20000, 'p': 40, 'noise': 0.2}

Show code

ols_summary = []
for multiple in multiples:
    block = [row for row in ols_rows if row["multiple"] == multiple]
    ols_summary.append(
        {
            "multiple": multiple,
            "sketch_size": block[0]["sketch_size"],
            "median_coef_rel_error": float(np.median([r["coef_rel_error"] for r in block])),
            "median_prediction_rmse_vs_full": float(
                np.median([r["prediction_rmse_vs_full"] for r in block])
            ),
            "median_speedup": float(np.median([r["speedup"] for r in block])),
        }
    )

ols_summary_rows = [
    [
        f"{entry['multiple']}x",
        entry["sketch_size"],
        f"{entry['median_coef_rel_error']:.4f}",
        f"{entry['median_prediction_rmse_vs_full']:.4f}",
        f"{entry['median_speedup']:.2f}x",
    ]
    for entry in ols_summary
]

display(
    HTML(
        html_table(
            [
                "Sketch multiple",
                "Sketch size",
                "Median coefficient relative error",
                "Median prediction RMSE vs full OLS",
                "Median speedup",
            ],
            ols_summary_rows,
        )
    )
)

Sketch multiple	Sketch size	Median coefficient relative error	Median prediction RMSE vs full OLS	Median speedup
2x	82	0.0312	0.2242	3.10x
4x	164	0.0188	0.1140	2.54x
8x	328	0.0106	0.0745	2.73x
12x	492	0.0086	0.0629	2.65x
16x	656	0.0083	0.0562	2.42x

2.2 IV design

The second design introduces one endogenous regressor whose first stage depends on excluded instruments and exogenous controls. The structural error is correlated with the first-stage shock, so IV rather than OLS is the relevant target. The comparison is full 2SLS versus sketched 2SLS.

Show code

def run_iv_once(n, multiple, seed):
    rng = np.random.default_rng(seed)
    z = rng.normal(size=(n, 3))
    x_exog = rng.normal(size=(n, 2))
    v = rng.normal(size=n)
    x_endog = 0.8 * z[:, [0]] + 0.4 * z[:, [1]] + 0.3 * x_exog[:, [0]] + v[:, None]
    eps = 0.6 * v + rng.normal(scale=0.2, size=n)
    y = 1.0 + 2.0 * x_endog[:, 0] - 0.5 * x_exog[:, 0] + 0.25 * x_exog[:, 1] + eps

    # The sketched 2SLS design has intercept + endogenous + exogenous regressors
    # and intercept + exogenous + excluded instruments.
    design_dim = max(1 + x_endog.shape[1] + x_exog.shape[1], 1 + x_exog.shape[1] + z.shape[1])
    sketch_size = int(multiple * design_dim)

    full = cm.TwoSLS()
    t0 = time.perf_counter()
    full.fit(x_endog, x_exog, z, y)
    full_time = time.perf_counter() - t0
    full_summary = full.summary(vcov="vanilla")
    full_param = np.concatenate([[full_summary["intercept"]], full_summary["coef"]])

    sketch = cm.TwoSLS()
    t0 = time.perf_counter()
    sketch.fit_sketch(x_endog, x_exog, z, y, sketch_size=sketch_size, seed=seed + 20_000)
    sketch_time = time.perf_counter() - t0
    sketch_summary = sketch.summary(vcov="vanilla")
    sketch_param = np.concatenate([[sketch_summary["intercept"]], sketch_summary["coef"]])

    x_pred = np.column_stack([x_endog, x_exog])[:2000]
    full_pred = full.predict(x_pred)
    sketch_pred = sketch.predict(x_pred)

    return {
        "multiple": multiple,
        "sketch_size": sketch_size,
        "coef_rel_error": float(
            np.linalg.norm(sketch_param - full_param) / np.linalg.norm(full_param)
        ),
        "prediction_rmse_vs_full": float(np.sqrt(np.mean((sketch_pred - full_pred) ** 2))),
        "full_fit_seconds": full_time,
        "sketch_fit_seconds": sketch_time,
        "speedup": full_time / sketch_time if sketch_time > 0 else np.inf,
    }


iv_n = 20_000
iv_rows = [run_iv_once(iv_n, multiple, seed) for multiple in multiples for seed in seeds]
print("IV replications:", len(iv_rows))
print("IV design:", {"n": iv_n, "n_endog": 1, "n_exog": 2, "n_excluded_instruments": 3})

IV replications: 25
IV design: {'n': 20000, 'n_endog': 1, 'n_exog': 2, 'n_excluded_instruments': 3}

Show code

iv_summary = []
for multiple in multiples:
    block = [row for row in iv_rows if row["multiple"] == multiple]
    iv_summary.append(
        {
            "multiple": multiple,
            "sketch_size": block[0]["sketch_size"],
            "median_coef_rel_error": float(np.median([r["coef_rel_error"] for r in block])),
            "median_prediction_rmse_vs_full": float(
                np.median([r["prediction_rmse_vs_full"] for r in block])
            ),
            "median_speedup": float(np.median([r["speedup"] for r in block])),
        }
    )

iv_summary_rows = [
    [
        f"{entry['multiple']}x",
        entry["sketch_size"],
        f"{entry['median_coef_rel_error']:.4f}",
        f"{entry['median_prediction_rmse_vs_full']:.4f}",
        f"{entry['median_speedup']:.2f}x",
    ]
    for entry in iv_summary
]

display(
    HTML(
        html_table(
            [
                "Sketch multiple",
                "Sketch size",
                "Median coefficient relative error",
                "Median prediction RMSE vs full 2SLS",
                "Median speedup",
            ],
            iv_summary_rows,
        )
    )
)

Sketch multiple	Sketch size	Median coefficient relative error	Median prediction RMSE vs full 2SLS	Median speedup
2x	12	0.1076	0.2785	0.72x
4x	24	0.1022	0.2544	0.72x
8x	48	0.0654	0.1545	0.70x
12x	72	0.0764	0.1820	0.66x
16x	96	0.0412	0.1107	0.70x

2.3 Plots

Show code

fig, axes = plt.subplots(2, 3, figsize=(14, 7), constrained_layout=True)

for row_idx, (label, summary) in enumerate([("OLS", ols_summary), ("2SLS", iv_summary)]):
    x_axis = np.array([entry["multiple"] for entry in summary])
    coef_error = np.array([entry["median_coef_rel_error"] for entry in summary])
    pred_error = np.array([entry["median_prediction_rmse_vs_full"] for entry in summary])
    speedup = np.array([entry["median_speedup"] for entry in summary])

    axes[row_idx, 0].plot(x_axis, coef_error, marker="o")
    axes[row_idx, 0].set_xlabel("Sketch size / design dimension")
    axes[row_idx, 0].set_ylabel("Median relative coefficient error")
    axes[row_idx, 0].set_title(f"{label}: coefficient accuracy")

    axes[row_idx, 1].plot(x_axis, pred_error, marker="o", color="tab:orange")
    axes[row_idx, 1].set_xlabel("Sketch size / design dimension")
    axes[row_idx, 1].set_ylabel("Median RMSE vs full fit")
    axes[row_idx, 1].set_title(f"{label}: prediction agreement")

    axes[row_idx, 2].plot(x_axis, speedup, marker="o", color="tab:green")
    axes[row_idx, 2].axhline(1.0, color="black", linestyle="--", linewidth=1)
    axes[row_idx, 2].set_xlabel("Sketch size / design dimension")
    axes[row_idx, 2].set_ylabel("Median full/sketch fit time")
    axes[row_idx, 2].set_title(f"{label}: runtime")

plt.show()

For these tall, well-conditioned designs, larger sketches move the approximate OLS and IV estimates toward the full-data estimates while keeping the solve on a much smaller synthetic dataset. Very small sketches are useful as fast exploratory approximations; larger sketches are the safer default when coefficient accuracy matters.