---
title: "Scaling HSIC with Nystrom and Random Fourier Features"
author: "Max Moldovan"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Scaling HSIC with Nystrom and Random Fourier Features}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4,
  dpi = 150
)
set.seed(1)
```

## The scaling wall

Exact kernel methods carry an `O(n^2)` storage cost (the Gram matrix)
and `O(n^2)` or `O(n^3)` compute per operation. For ag-systems
ensembles in the low thousands of members, that wall hits around
`n = 5000`: a 5000 x 5000 double-precision Gram matrix is 200 MB, and
permutation HSIC iterates over it `B` times.

Two well-known low-rank approximations of a kernel matrix `K`
(`n x n`) drop both costs to `O(n m)` for some rank `m << n`:

- **Nystrom** (Williams & Seeger, 2001) -- sample `m` landmark
  observations, compute their Gram `W` (`m x m`) and cross-Gram `C`
  (`n x m`), then `K \approx C W^{-1} C^\top = F F^\top` with
  `F = C \cdot \mathrm{chol}(W)^{-1}`. Works for any kernel.
- **Random Fourier Features** (Rahimi & Recht, 2007) -- for
  shift-invariant kernels (currently RBF in kernR), draw
  `\omega_k \sim N(0, \sigma^{-2} I)` and
  `b_k \sim U[0, 2\pi]`; the feature map
  `\phi(x) = \sqrt{2/D} \cos(\omega^\top x + b)` gives
  `K \approx \Phi \Phi^\top` for `D` features. Data-independent (the
  same random projection applies to any point cloud with the same
  bandwidth).

`hsic_test_nystrom()` wires both into a drop-in accelerated HSIC test.

## A tiny correctness check

Both approximations recover the exact kernel as the rank approaches
`n` (Nystrom) or as `D` grows (RFF).

```{r correctness}
library(kernR)

n <- 80L
x <- matrix(stats::rnorm(n * 2L), n, 2L)
k <- kernel_spec("rbf", bandwidth = 1.0)
K_full <- kernel_matrix(x, kernel = k)

# Nystrom at near-full rank
ny <- nystrom_factor(x, kernel = k, m = n - 1L, seed = 1L)
K_ny <- tcrossprod(ny$F)

# RFF at a moderate D
rf <- rff_features(x, kernel = k, D = 1500L, seed = 1L)
K_rf <- tcrossprod(rf$F)

rel_err <- function(K_hat) {
  sqrt(sum((K_full - K_hat)^2)) / sqrt(sum(K_full^2))
}
data.frame(
  method = c("nystrom", "rff"),
  rank   = c(ncol(ny$F), ncol(rf$F)),
  rel_err = c(rel_err(K_ny), rel_err(K_rf))
)
```

Both should produce small relative Frobenius error on this size.

## Scaling demonstration

```{r scaling}
benchmark_hsic <- function(n, m, B = 49L, seed = 1L) {
  set.seed(seed)
  xx <- stats::rnorm(n)
  y  <- xx + stats::rnorm(n, sd = 0.5)

  t_exact <- system.time(
    res_exact <- hsic_test(xx, y, n_permutations = B, seed = seed)
  )["elapsed"]
  t_nystrom <- system.time(
    res_nystrom <- hsic_test_nystrom(xx, y, m = m,
                                     n_permutations = B, seed = seed)
  )["elapsed"]
  t_rff <- system.time(
    res_rff <- hsic_test_nystrom(xx, y, method = "rff", m = m,
                                 n_permutations = B, seed = seed)
  )["elapsed"]
  data.frame(
    n         = n,
    m         = m,
    method    = c("hsic_test (exact)", "nystrom", "rff"),
    elapsed_s = round(c(t_exact, t_nystrom, t_rff), 3),
    p_value   = round(c(res_exact$p_value, res_nystrom$p_value,
                        res_rff$p_value), 3)
  )
}

benchmark_hsic(n = 500L,  m = 60L)
```

For larger `n`, the gap widens:

```{r scaling-larger}
benchmark_hsic(n = 1500L, m = 100L)
```

The verdict (reject vs accept) agrees across exact and approximate
tests at moderate `m`; the approximation is for speed, not for
detecting a different signal.

## When to use which

| Scenario | Pick |
|---|---|
| Small `n` (< 500) | `hsic_test()` -- exact, fast enough. |
| Medium `n` (500-5000), any kernel | `hsic_test_nystrom(method = "nystrom")`. Choose `m` in `[50, 200]`. |
| Large `n` (> 5000), RBF kernel | `hsic_test_nystrom(method = "rff")` with `D` in `[200, 1000]`. Data-independent projection; cheapest at large n. |
| Repeated tests on different `y` with same `x` | Cache `nystrom_factor(x)` or `rff_features(x)` once; pass the factor in (future API). |

## Notes on practice

- **Bias vs unbiased estimator.** The exact `hsic_test()` uses the
  Gretton-2008 *unbiased* HSIC. The Nystrom/RFF version uses the
  *biased* estimator `(1/n^2) tr(H K_x H K_y)`, because that form
  factors cleanly through low-rank approximations. The bias is
  `O(1/n)` -- negligible in the large-`n` regime where these
  approximations are useful.
- **Choosing `m` for Nystrom.** Rule of thumb: `m = ceiling(2 sqrt(n))`
  is a defensible starting point. Diagnose via the Frobenius
  reconstruction error on a small subsample if precision matters.
- **Choosing `D` for RFF.** RFF variance scales as `O(1/D)`; the bias
  is zero in expectation. Larger `D` is always better in accuracy but
  costs memory; `D = 200-500` is usually adequate for verdict-style
  tests.
- **Reproducibility.** Both factorisations are randomised; pass `seed`
  for deterministic output.

## References

- Williams, C. K. I., & Seeger, M. (2001). Using the Nystrom method
  to speed up kernel machines. *NeurIPS*, 13.
- Rahimi, A., & Recht, B. (2007). Random features for large-scale
  kernel machines. *NeurIPS*, 20.
- Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005).
  Measuring statistical dependence with Hilbert-Schmidt norms.
  *ALT*, 63-77.