Exact kernel methods carry an O(n^2) storage cost (the
Gram matrix) and O(n^2) or O(n^3) compute per
operation. For ag-systems ensembles in the low thousands of members,
that wall hits around n = 5000: a 5000 x 5000
double-precision Gram matrix is 200 MB, and permutation HSIC iterates
over it B times.
Two well-known low-rank approximations of a kernel matrix
K (n x n) drop both costs to
O(n m) for some rank m << n:
m landmark observations, compute their Gram W
(m x m) and cross-Gram C (n x m),
then K \approx C W^{-1} C^\top = F F^\top with
F = C \cdot \mathrm{chol}(W)^{-1}. Works for any
kernel.\omega_k \sim N(0, \sigma^{-2} I) and
b_k \sim U[0, 2\pi]; the feature map
\phi(x) = \sqrt{2/D} \cos(\omega^\top x + b) gives
K \approx \Phi \Phi^\top for D features.
Data-independent (the same random projection applies to any point cloud
with the same bandwidth).hsic_test_nystrom() wires both into a drop-in
accelerated HSIC test.
Both approximations recover the exact kernel as the rank approaches
n (Nystrom) or as D grows (RFF).
library(kernR)
n <- 80L
x <- matrix(stats::rnorm(n * 2L), n, 2L)
k <- kernel_spec("rbf", bandwidth = 1.0)
K_full <- kernel_matrix(x, kernel = k)
# Nystrom at near-full rank
ny <- nystrom_factor(x, kernel = k, m = n - 1L, seed = 1L)
K_ny <- tcrossprod(ny$F)
# RFF at a moderate D
rf <- rff_features(x, kernel = k, D = 1500L, seed = 1L)
K_rf <- tcrossprod(rf$F)
rel_err <- function(K_hat) {
sqrt(sum((K_full - K_hat)^2)) / sqrt(sum(K_full^2))
}
data.frame(
method = c("nystrom", "rff"),
rank = c(ncol(ny$F), ncol(rf$F)),
rel_err = c(rel_err(K_ny), rel_err(K_rf))
)
#> method rank rel_err
#> 1 nystrom 79 1.849946e-07
#> 2 rff 1500 4.728235e-02Both should produce small relative Frobenius error on this size.
benchmark_hsic <- function(n, m, B = 49L, seed = 1L) {
set.seed(seed)
xx <- stats::rnorm(n)
y <- xx + stats::rnorm(n, sd = 0.5)
t_exact <- system.time(
res_exact <- hsic_test(xx, y, n_permutations = B, seed = seed)
)["elapsed"]
t_nystrom <- system.time(
res_nystrom <- hsic_test_nystrom(xx, y, m = m,
n_permutations = B, seed = seed)
)["elapsed"]
t_rff <- system.time(
res_rff <- hsic_test_nystrom(xx, y, method = "rff", m = m,
n_permutations = B, seed = seed)
)["elapsed"]
data.frame(
n = n,
m = m,
method = c("hsic_test (exact)", "nystrom", "rff"),
elapsed_s = round(c(t_exact, t_nystrom, t_rff), 3),
p_value = round(c(res_exact$p_value, res_nystrom$p_value,
res_rff$p_value), 3)
)
}
benchmark_hsic(n = 500L, m = 60L)
#> n m method elapsed_s p_value
#> 1 500 60 hsic_test (exact) 0.095 0.02
#> 2 500 60 nystrom 0.019 0.02
#> 3 500 60 rff 0.017 0.02For larger n, the gap widens:
benchmark_hsic(n = 1500L, m = 100L)
#> n m method elapsed_s p_value
#> 1 1500 100 hsic_test (exact) 1.310 0.02
#> 2 1500 100 nystrom 0.135 0.02
#> 3 1500 100 rff 0.132 0.02The verdict (reject vs accept) agrees across exact and approximate
tests at moderate m; the approximation is for speed, not
for detecting a different signal.
| Scenario | Pick |
|---|---|
Small n (< 500) |
hsic_test() – exact, fast enough. |
Medium n (500-5000), any kernel |
hsic_test_nystrom(method = "nystrom"). Choose
m in [50, 200]. |
Large n (> 5000), RBF kernel |
hsic_test_nystrom(method = "rff") with D
in [200, 1000]. Data-independent projection; cheapest at
large n. |
Repeated tests on different y with same
x |
Cache nystrom_factor(x) or rff_features(x)
once; pass the factor in (future API). |
hsic_test() uses the Gretton-2008 unbiased HSIC.
The Nystrom/RFF version uses the biased estimator
(1/n^2) tr(H K_x H K_y), because that form factors cleanly
through low-rank approximations. The bias is O(1/n) –
negligible in the large-n regime where these approximations
are useful.m for Nystrom. Rule of thumb:
m = ceiling(2 sqrt(n)) is a defensible starting point.
Diagnose via the Frobenius reconstruction error on a small subsample if
precision matters.D for RFF. RFF variance
scales as O(1/D); the bias is zero in expectation. Larger
D is always better in accuracy but costs memory;
D = 200-500 is usually adequate for verdict-style
tests.seed for deterministic output.