Randomized Sketching And Kernel Approximation

1 Math

Randomized numerical linear algebra is useful when an estimator spends most of its time on a large matrix whose relevant geometry is much smaller than the matrix itself. The central move is to replace a large deterministic linear algebra problem by a smaller random problem that preserves the pieces of geometry the estimator actually uses.

This page separates five families of randomized approximations now exposed in crabbymetrics:

subspace sketches for low-rank range finding, SVD, QR, matrix completion, and factor models;
row sketches for tall least squares;
kernel sketches via Nyström features and random Fourier features;
moment sketches for many-moment GMM;
feature compression before balancing or synthetic-control-style weighting.

The important design choice is conservative: exact algorithms remain the defaults for estimator classes. Sketching is opt-in through explicit method arguments or reusable transform classes.

1.1 The projection view

Let \(A \in \mathbb{R}^{m \times n}\). Many estimators do not need all of \(A\); they need either:

the action of \(A\) on a low-dimensional subspace;
the dominant left/right singular subspaces;
the Gram matrix \(A'A\) or \(AA'\);
the solution to a least-squares or moment equation involving \(A\).

A sketch introduces a random linear map and solves a smaller problem. There are two common orientations.

A range sketch multiplies on the right:

\[ Y = A\Omega, \qquad \Omega \in \mathbb{R}^{n \times \ell}, \qquad \ell = k + q, \]

where \(k\) is the target rank and \(q\) is an oversampling buffer. If \(A\) is exactly or approximately rank \(k\), the columns of \(Y\) span approximately the dominant column space of \(A\). After orthogonalizing,

\[ Y = QR, \qquad Q'Q = I_\ell, \]

we approximate

\[ A \approx QQ'A. \]

A row sketch multiplies on the left:

\[ \widetilde A = SA, \qquad S \in \mathbb{R}^{s \times m}, \qquad s \ll m. \]

For regression, this compresses observations. The sketch should be a subspace embedding: for vectors \(v\) in the relevant column span,

\[ (1-\varepsilon)\lVert v \rVert_2^2 \le \lVert Sv \rVert_2^2 \le (1+\varepsilon)\lVert v \rVert_2^2. \]

This condition is stronger than merely preserving pairwise distances for a few rows. It says that the random map approximately preserves the whole low-dimensional residual geometry.

1.2 Randomized range finding and SVD

The randomized SVD in crabbymetrics starts from the range sketch above. Given \(A \in \mathbb{R}^{m \times n}\) and a requested rank \(k\):

draw \(\Omega \in \mathbb{R}^{n \times (k+q)}\);
form \(Y=A\Omega\);
optionally perform power iterations;
orthogonalize \(Y=QR\);
compute the small SVD of \(B=Q'A\):

\[ B = \widetilde U \Sigma V'; \]

lift the left singular vectors:

\[ U \approx Q\widetilde U, \qquad A \approx U\Sigma V'. \]

If the singular values of \(A\) decay quickly, a small \(q\) is often enough. If singular values decay slowly, the leading subspace is harder to identify. Power iteration improves separation by replacing the sampled range with roughly

\[ Y = (AA')^r A\Omega. \]

The singular values are effectively transformed from \(\sigma_j\) to \(\sigma_j^{2r+1}\), so dominant directions stand out more sharply. The tradeoff is additional passes over \(A\).

The useful error heuristic is:

\[ \lVert A - QQ'A \rVert \approx \text{tail singular value scale}, \]

not machine precision. Randomized SVD is an approximation to a low-rank target, not a replacement for an exact full-rank factorization.

1.3 Randomized QR and approximate least squares

Randomized QR uses the same range basis. If \(Q\) approximately spans the column space of \(A\), then QR can be computed on the compressed coordinates and lifted back. Equivalently, randomized least squares can be viewed as solving on a sketch that preserves the column space of the design.

For a tall design \(X \in \mathbb{R}^{n \times p}\) with \(n \gg p\), OLS is

\[ \hat\beta = \arg\min_b \lVert y-Xb \rVert_2^2. \]

A sketched least-squares fit draws \(S \in \mathbb{R}^{s \times n}\) and solves

\[ \hat\beta_S = \arg\min_b \lVert S(y-Xb) \rVert_2^2. \]

If \(S\) is a subspace embedding for \(\operatorname{span}([X,y])\), then the sketched objective is uniformly close to the full objective over the relevant parameter region. That is the reason a smaller problem can recover a similar coefficient vector.

1.4 CountSketch row embeddings

The row-sketching path uses CountSketch. Each original row \(i\) receives:

a bucket \(h(i) \in \{1,\ldots,s\}\);
a sign \(\sigma_i \in \{-1,+1\}\).

Then rows are accumulated as

\[ (SX)_{h(i),\cdot} \mathrel{+}= \sigma_i X_{i,\cdot}, \qquad (Sy)_{h(i)} \mathrel{+}= \sigma_i y_i. \]

The matrix \(S\) is never materialized densely. Construction is one pass through the rows, so the cost is proportional to the number of nonzero entries touched. For dense \(X\), that is \(O(np)\) to build the sketch plus the cost of solving the \(s \times p\) compressed least-squares problem.

The sketch size \(s\) controls the bias–variance approximation tradeoff. A very small sketch is fast but noisy; a larger sketch approaches the full estimator.

1.5 Matrix completion via randomized singular-value thresholding

Panel matrix completion repeatedly solves a low-rank denoising problem. A stylized singular-value-thresholding update is

\[ M^{(t+1)} = \mathcal{S}_\lambda(A^{(t)}), \]

where \(\mathcal{S}_\lambda\) soft-thresholds singular values. If

\[ A^{(t)} = U\operatorname{diag}(\sigma_j)V', \]

then

\[ \mathcal{S}_\lambda(A^{(t)}) = U\operatorname{diag}((\sigma_j-\lambda)_+)V'. \]

When the desired panel signal is low rank, most singular values are discarded or heavily shrunk. Computing a full exact SVD in every iteration can therefore waste work. The randomized path computes an approximate rank-\(k\) SVD and applies the same soft-thresholding to the retained singular values.

This is why the MatrixCompletion sketching knobs are explicit:

svd_method="exact" keeps the deterministic path;
svd_method="randomized" uses randomized SVD;
svd_rank controls the retained rank ceiling;
oversampling and power-iteration controls tune approximation quality.

The approximation changes the low-rank update, so it should be tested against exact SVD on small panels before being trusted on large panels.

1.6 Interactive fixed effects and randomized factor extraction

Interactive fixed effects use a low-rank factor structure such as

\[ Y_{it}(0) \approx \lambda_i'f_t + \alpha_i + \xi_t. \]

After additive effects are handled, factor extraction is again an SVD problem on a residualized panel matrix. Exact factor extraction computes leading singular vectors of the full matrix. The randomized path approximates the same leading subspace with randomized SVD.

The target is not equality of individual factor signs or rotations. Factors are only identified up to invertible transformations. The relevant comparisons are reconstructed fitted values, counterfactual paths, and treatment-effect summaries.

1.7 Kernel approximation: Nyström features

Kernel methods replace dot products with similarities

\[ K(x_i,x_j). \]

A full kernel matrix for \(n\) observations costs \(O(n^2)\) memory. Nyström approximation chooses \(m \ll n\) landmarks \(L = \{\ell_1,\ldots,\ell_m\}\) and forms

\[ K_{nm} = K(X,L), \qquad K_{mm} = K(L,L). \]

The low-rank kernel approximation is

\[ K(X,X) \approx K_{nm}K_{mm}^{\dagger}K_{mn}. \]

A convenient feature representation is

\[ \Phi_{Nys}(X) = K_{nm}K_{mm}^{-1/2}, \]

so that

\[ \Phi_{Nys}(X)\Phi_{Nys}(X)' \approx K(X,X). \]

NystromBasis exposes these features as a transformer. Downstream estimators can stay ordinary linear estimators, e.g. ridge on Nyström features.

1.8 Kernel approximation: random Fourier features

For shift-invariant kernels \(K(x,z)=\kappa(x-z)\), Bochner’s theorem says that a continuous positive-definite kernel is the Fourier transform of a nonnegative measure. For the Gaussian/RBF kernel,

\[ K(x,z) = \exp\left(-\frac{\lVert x-z\rVert^2}{2b^2}\right), \]

one can draw random frequencies

\[ \omega_j \sim N(0, b^{-2}I) \]

and phases

\[ b_j \sim \operatorname{Uniform}(0,2\pi), \]

then construct features

\[ \phi_j(x) = \sqrt{\frac{2}{D}}\cos(\omega_j'x+b_j), \qquad j=1,\ldots,D. \]

The inner product approximates the kernel:

\[ \phi(x)'\phi(z) \approx K(x,z). \]

RandomFourierFeatures is useful when a smooth nonlinear relationship can be learned by a linear estimator after random nonlinear expansion.

1.9 Many-moment GMM sketching

A GMM estimator solves moments

\[ \bar g(\theta) = \frac{1}{n}\sum_{i=1}^n g_i(\theta) \in \mathbb{R}^q, \]

by minimizing

\[ Q(\theta) = \bar g(\theta)'W\bar g(\theta). \]

When \(q\) is large, the moment system and weighting matrix can dominate computation. A moment sketch draws a projection

\[ R \in \mathbb{R}^{s \times q}, \qquad s < q, \]

and fits the compressed moments

\[ \widetilde g(\theta) = R\bar g(\theta). \]

The Jacobian is compressed consistently:

\[ \widetilde G(\theta) = R G(\theta), \qquad G(\theta)=\frac{\partial \bar g(\theta)}{\partial \theta'}. \]

GMM.fit_sketch(...) uses a fixed Rademacher projection. The conservative interpretation is important: the fitted and summarized model is the sketched moment problem, not the full moment problem with magically free inference. Moment sketching is most appropriate when there are many redundant or highly correlated moments.

1.10 Randomized PCA before balancing weights

Balancing weights solve for weights \(w_i\) that make weighted source covariates match target covariates. In the simplest form, the balance condition is

\[ \sum_i w_i x_i \approx \bar x_{target}. \]

If \(x_i\) is a long donor history or wide feature vector, exact balance can be high-dimensional and unstable. One explicit compression strategy is to first map covariates to randomized principal-component scores:

\[ z_i = V_k'(x_i-\bar x), \]

where \(V_k\) is obtained by randomized SVD. Balance is then imposed on \(z_i\) rather than the full \(x_i\).

This is deliberately exposed as RandomizedPCA, not hidden inside BalancingWeights or synthetic-control estimators. Compression changes the estimand-relevant diagnostics: balance in compressed coordinates does not imply exact balance in original coordinates unless the discarded directions are irrelevant.

1.11 Practical tuning rules

Keep exact defaults for small problems.
Use randomized SVD when a matrix is large and plausibly low rank.
Increase oversamples when approximation quality is unstable.
Increase power_iter when singular values decay slowly.
Use row sketches when \(n\) is large relative to the number of regressors and the least-squares geometry is well conditioned.
Treat GMM moment sketching as a point-estimation acceleration first; be cautious with inference.
For kernel approximations, choose the feature dimension by validation or an approximation diagnostic, not by aesthetics.
For balancing compression, always inspect balance in original variables as a diagnostic.

2 Ablation

Show code

from html import escape
import time

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display

import crabbymetrics as cm

np.set_printoptions(precision=4, suppress=True)


def html_table(headers, rows):
    parts = ["<table>", "<thead>", "<tr>"]
    parts.extend(f"<th>{escape(str(header))}</th>" for header in headers)
    parts.extend(["</tr>", "</thead>", "<tbody>"])
    for row in rows:
        parts.append("<tr>")
        parts.extend(f"<td>{cell}</td>" for cell in row)
        parts.append("</tr>")
    parts.extend(["</tbody>", "</table>"])
    return "".join(parts)

2.1 OLS design

The first design is intentionally friendly to OLS: standardized Gaussian regressors, a dense linear signal, and homoskedastic additive noise. Full OLS is the numerical target. The sketch size is a multiple of the full design dimension, including the intercept.

Show code

def run_ols_once(n, p, noise, multiple, seed):
    rng = np.random.default_rng(seed)
    x = rng.normal(size=(n, p))
    beta = rng.normal(size=p)
    intercept = 0.4
    y = intercept + x @ beta + noise * rng.normal(size=n)
    sketch_size = int(multiple * (p + 1))

    full = cm.OLS()
    t0 = time.perf_counter()
    full.fit(x, y)
    full_time = time.perf_counter() - t0
    full_summary = full.summary(vcov="vanilla")
    full_param = np.concatenate([[full_summary["intercept"]], full_summary["coef"]])

    sketch = cm.OLS()
    t0 = time.perf_counter()
    sketch.fit_sketch(x, y, sketch_size=sketch_size, seed=seed + 10_000)
    sketch_time = time.perf_counter() - t0
    sketch_summary = sketch.summary(vcov="vanilla")
    sketch_param = np.concatenate([[sketch_summary["intercept"]], sketch_summary["coef"]])

    x_test = rng.normal(size=(2000, p))
    full_pred = full.predict(x_test)
    sketch_pred = sketch.predict(x_test)

    return {
        "multiple": multiple,
        "sketch_size": sketch_size,
        "coef_rel_error": float(
            np.linalg.norm(sketch_param - full_param) / np.linalg.norm(full_param)
        ),
        "prediction_rmse_vs_full": float(np.sqrt(np.mean((sketch_pred - full_pred) ** 2))),
        "full_fit_seconds": full_time,
        "sketch_fit_seconds": sketch_time,
        "speedup": full_time / sketch_time if sketch_time > 0 else np.inf,
    }


ols_n = 20_000
ols_p = 40
ols_noise = 0.2
multiples = [2, 4, 8, 12, 16]
seeds = range(5)

ols_rows = [
    run_ols_once(ols_n, ols_p, ols_noise, multiple, seed)
    for multiple in multiples
    for seed in seeds
]
print("OLS replications:", len(ols_rows))
print("OLS design:", {"n": ols_n, "p": ols_p, "noise": ols_noise})

OLS replications: 25
OLS design: {'n': 20000, 'p': 40, 'noise': 0.2}

Show code

ols_summary = []
for multiple in multiples:
    block = [row for row in ols_rows if row["multiple"] == multiple]
    ols_summary.append(
        {
            "multiple": multiple,
            "sketch_size": block[0]["sketch_size"],
            "median_coef_rel_error": float(np.median([r["coef_rel_error"] for r in block])),
            "median_prediction_rmse_vs_full": float(
                np.median([r["prediction_rmse_vs_full"] for r in block])
            ),
            "median_speedup": float(np.median([r["speedup"] for r in block])),
        }
    )

ols_summary_rows = [
    [
        f"{entry['multiple']}x",
        entry["sketch_size"],
        f"{entry['median_coef_rel_error']:.4f}",
        f"{entry['median_prediction_rmse_vs_full']:.4f}",
        f"{entry['median_speedup']:.2f}x",
    ]
    for entry in ols_summary
]

display(
    HTML(
        html_table(
            [
                "Sketch multiple",
                "Sketch size",
                "Median coefficient relative error",
                "Median prediction RMSE vs full OLS",
                "Median speedup",
            ],
            ols_summary_rows,
        )
    )
)

Sketch multiple	Sketch size	Median coefficient relative error	Median prediction RMSE vs full OLS	Median speedup
2x	82	0.0312	0.2242	2.75x
4x	164	0.0188	0.1140	2.82x
8x	328	0.0106	0.0745	2.68x
12x	492	0.0086	0.0629	2.50x
16x	656	0.0083	0.0562	2.32x

2.2 Kernel ridge design

The second design is nonlinear. The full target is Gaussian kernel ridge regression using the full \(n \times n\) training kernel. The randomized alternatives replace the full kernel by explicit features: Nyström features from sampled landmarks and random Fourier features. This makes the downstream regression an ordinary ridge problem on a much smaller feature matrix.

The comparison reports two things:

the relative Frobenius error of the training-kernel approximation, \(\lVert K-\Phi\Phi'\rVert_F / \lVert K\rVert_F\);
downstream test RMSE against the true regression function and against the full kernel-ridge predictor.

Show code

def rbf_kernel(x, z, bandwidth):
    x_norm = np.sum(x * x, axis=1)[:, None]
    z_norm = np.sum(z * z, axis=1)[None, :]
    sqdist = np.maximum(x_norm + z_norm - 2.0 * x @ z.T, 0.0)
    return np.exp(-sqdist / bandwidth)


def ridge_on_features(phi_train, y_train, phi_test, penalty):
    gram = phi_train.T @ phi_train
    rhs = phi_train.T @ y_train
    coef = np.linalg.solve(gram + penalty * np.eye(gram.shape[0]), rhs)
    return phi_test @ coef


def run_kernel_once(n_train, n_test, n_features, bandwidth, penalty, components, seed):
    rng = np.random.default_rng(seed)
    x_train = rng.uniform(-3.0, 3.0, size=(n_train, n_features))
    x_test = rng.uniform(-3.0, 3.0, size=(n_test, n_features))

    def signal(x):
        return (
            np.sin(1.4 * x[:, 0])
            + 0.7 * np.cos(1.1 * x[:, 1] * x[:, 2])
            + 0.3 * x[:, 3] ** 2
            - 0.2 * x[:, 4]
        )

    f_train = signal(x_train)
    f_test = signal(x_test)
    y_train = f_train + 0.15 * rng.normal(size=n_train)

    t0 = time.perf_counter()
    k_train = rbf_kernel(x_train, x_train, bandwidth)
    k_test = rbf_kernel(x_test, x_train, bandwidth)
    alpha = np.linalg.solve(k_train + penalty * np.eye(n_train), y_train)
    full_pred = k_test @ alpha
    full_time = time.perf_counter() - t0

    out = []
    for method in ["Nyström", "Random Fourier"]:
        t0 = time.perf_counter()
        if method == "Nyström":
            basis = cm.NystromBasis(
                components,
                kernel="gaussian",
                bandwidth=bandwidth,
                ridge=1e-10,
                seed=seed + 30_000,
            )
        else:
            basis = cm.RandomFourierFeatures(
                components,
                bandwidth=bandwidth,
                seed=seed + 40_000,
            )
        phi_train = basis.fit_transform(x_train)
        phi_test = basis.transform(x_test)
        sketch_pred = ridge_on_features(phi_train, y_train, phi_test, penalty)
        sketch_time = time.perf_counter() - t0
        k_approx = phi_train @ phi_train.T

        out.append(
            {
                "method": method,
                "components": components,
                "kernel_rel_error": float(
                    np.linalg.norm(k_approx - k_train, ord="fro")
                    / np.linalg.norm(k_train, ord="fro")
                ),
                "rmse_vs_truth": float(np.sqrt(np.mean((sketch_pred - f_test) ** 2))),
                "rmse_vs_full_krr": float(np.sqrt(np.mean((sketch_pred - full_pred) ** 2))),
                "full_rmse_vs_truth": float(np.sqrt(np.mean((full_pred - f_test) ** 2))),
                "full_fit_seconds": full_time,
                "sketch_fit_seconds": sketch_time,
                "speedup": full_time / sketch_time if sketch_time > 0 else np.inf,
            }
        )
    return out


kernel_n_train = 1_200
kernel_n_test = 600
kernel_n_features = 5
kernel_bandwidth = 8.0
kernel_penalty = 1.0
component_grid = [50, 100, 200, 400]
kernel_seeds = range(3)

kernel_rows = [
    row
    for components in component_grid
    for seed in kernel_seeds
    for row in run_kernel_once(
        kernel_n_train,
        kernel_n_test,
        kernel_n_features,
        kernel_bandwidth,
        kernel_penalty,
        components,
        seed,
    )
]
print("Kernel replications:", len(kernel_rows))
print(
    "Kernel design:",
    {
        "n_train": kernel_n_train,
        "n_test": kernel_n_test,
        "p": kernel_n_features,
        "bandwidth": kernel_bandwidth,
        "penalty": kernel_penalty,
    },
)

Kernel replications: 24
Kernel design: {'n_train': 1200, 'n_test': 600, 'p': 5, 'bandwidth': 8.0, 'penalty': 1.0}

Show code

kernel_summary = []
for method in ["Nyström", "Random Fourier"]:
    for components in component_grid:
        block = [
            row
            for row in kernel_rows
            if row["method"] == method and row["components"] == components
        ]
        kernel_summary.append(
            {
                "method": method,
                "components": components,
                "median_kernel_rel_error": float(
                    np.median([r["kernel_rel_error"] for r in block])
                ),
                "median_rmse_vs_truth": float(np.median([r["rmse_vs_truth"] for r in block])),
                "median_rmse_vs_full_krr": float(
                    np.median([r["rmse_vs_full_krr"] for r in block])
                ),
                "median_speedup": float(np.median([r["speedup"] for r in block])),
                "median_full_rmse_vs_truth": float(
                    np.median([r["full_rmse_vs_truth"] for r in block])
                ),
            }
        )

kernel_summary_rows = [
    [
        entry["method"],
        entry["components"],
        f"{entry['median_kernel_rel_error']:.4f}",
        f"{entry['median_rmse_vs_truth']:.4f}",
        f"{entry['median_rmse_vs_full_krr']:.4f}",
        f"{entry['median_speedup']:.2f}x",
    ]
    for entry in kernel_summary
]

display(
    HTML(
        html_table(
            [
                "Approximation",
                "Features",
                "Median kernel relative error",
                "Median test RMSE vs truth",
                "Median test RMSE vs full KRR",
                "Median full/sketch time",
            ],
            kernel_summary_rows,
        )
    )
)

full_rmse = kernel_summary[0]["median_full_rmse_vs_truth"]
print(f"Median full-kernel KRR test RMSE vs truth: {full_rmse:.4f}")

Approximation	Features	Median kernel relative error	Median test RMSE vs truth	Median test RMSE vs full KRR	Median full/sketch time
Nyström	50	0.3137	0.9244	0.5892	15.11x
Nyström	100	0.1577	0.7547	0.3639	5.59x
Nyström	200	0.0694	0.6203	0.1511	1.51x
Nyström	400	0.0200	0.5826	0.0552	0.30x
Random Fourier	50	0.8513	0.9714	0.6732	17.37x
Random Fourier	100	0.6446	0.8057	0.4599	8.58x
Random Fourier	200	0.4394	0.7697	0.3668	4.18x
Random Fourier	400	0.2927	0.6285	0.2293	1.92x

Median full-kernel KRR test RMSE vs truth: 0.5659

2.3 Plots

Show code

fig, axes = plt.subplots(2, 3, figsize=(14, 7), constrained_layout=True)

ols_x = np.array([entry["multiple"] for entry in ols_summary])
axes[0, 0].plot(ols_x, [entry["median_coef_rel_error"] for entry in ols_summary], marker="o")
axes[0, 0].set_xlabel("Sketch size / design dimension")
axes[0, 0].set_ylabel("Median relative coefficient error")
axes[0, 0].set_title("OLS: coefficient accuracy")

axes[0, 1].plot(
    ols_x,
    [entry["median_prediction_rmse_vs_full"] for entry in ols_summary],
    marker="o",
    color="tab:orange",
)
axes[0, 1].set_xlabel("Sketch size / design dimension")
axes[0, 1].set_ylabel("Median RMSE vs full OLS")
axes[0, 1].set_title("OLS: prediction agreement")

axes[0, 2].plot(ols_x, [entry["median_speedup"] for entry in ols_summary], marker="o", color="tab:green")
axes[0, 2].axhline(1.0, color="black", linestyle="--", linewidth=1)
axes[0, 2].set_xlabel("Sketch size / design dimension")
axes[0, 2].set_ylabel("Median full/sketch fit time")
axes[0, 2].set_title("OLS: runtime")

for method, color in [("Nyström", "tab:blue"), ("Random Fourier", "tab:purple")]:
    block = [entry for entry in kernel_summary if entry["method"] == method]
    x_axis = np.array([entry["components"] for entry in block])
    axes[1, 0].plot(
        x_axis,
        [entry["median_kernel_rel_error"] for entry in block],
        marker="o",
        label=method,
        color=color,
    )
    axes[1, 1].plot(
        x_axis,
        [entry["median_rmse_vs_full_krr"] for entry in block],
        marker="o",
        label=method,
        color=color,
    )
    axes[1, 2].plot(
        x_axis,
        [entry["median_speedup"] for entry in block],
        marker="o",
        label=method,
        color=color,
    )

axes[1, 0].set_xlabel("Feature dimension")
axes[1, 0].set_ylabel("Median relative Frobenius error")
axes[1, 0].set_title("Kernel: approximation error")
axes[1, 0].legend()

axes[1, 1].set_xlabel("Feature dimension")
axes[1, 1].set_ylabel("Median RMSE vs full KRR")
axes[1, 1].set_title("Kernel: downstream prediction agreement")
axes[1, 1].legend()

axes[1, 2].axhline(1.0, color="black", linestyle="--", linewidth=1)
axes[1, 2].set_xlabel("Feature dimension")
axes[1, 2].set_ylabel("Median full/sketch fit time")
axes[1, 2].set_title("Kernel: runtime")
axes[1, 2].legend()

plt.show()

For the tall OLS design, larger row sketches move the approximate coefficients and predictions toward the full-data fit while keeping the solve on a smaller synthetic dataset. For the nonlinear design, explicit kernel features trade approximation error against a much smaller ridge problem. The useful diagnostic is not just whether \(\Phi\Phi'\) resembles \(K\), but whether the downstream ridge predictions track the full kernel-ridge benchmark.