---
title: "One mixture, many methods"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{One mixture, many methods}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4,
  out.width = "100%"
)
set.seed(20260617)
```

```{r setup}
library(proxymix)
```

A fitted Gaussian mixture over your data is a single object, and several
classical analyses follow from it in closed form -- because conditioning,
marginalisation, and eigendecomposition are all exact on a Gaussian mixture.
This vignette shows `proxymix` standing in for five familiar tools. Each
section keeps the same shape: the **problem**, the **usual tool**, the
**`proxymix` route**, the **idea** behind the substitution, and **what it does
and does not** replace. The point is one fitted object serving several jobs,
with the trade-offs in plain view -- not that the mixture always wins.

## Regression — from `lm()` to a conditioned mixture

**Problem.** Predict an outcome \(y\) from a predictor \(x\). **Usual tool.**
`lm()` / `glm()` -- one global line. **`proxymix` route.** Fit a mixture over
the joint \((y, x)\), then read \(E[y \mid x]\) by conditioning. At \(K = 1\)
this *is* least squares; at \(K > 1\) the gate bends the line.

```{r reg-fit}
n <- 400L
x <- runif(n, -3, 3)
y <- 0.3 * x + 1.2 * pmax(x, 0) + rnorm(n, sd = 0.4)   # a bent relationship
dat <- data.frame(y = y, x = x)

joint <- gmm_target_from_samples(cbind(y, x))
fit1  <- fit_proxymix(joint, N = 1L, regime = "moment", ridge_eps = 0)   # = OLS
fit3  <- fit_proxymix(joint, N = 3L, regime = "sample", max_iter = 150L)

## At K = 1 the conditional slope equals the lm coefficient, exactly.
slope_mix <- gmm_conditionalise(fit1, given = c(NA, 1))@means[[1L]] -
  gmm_conditionalise(fit1, given = c(NA, 0))@means[[1L]]
c(proxymix = slope_mix, lm = unname(coef(lm(y ~ x, dat))["x"]))
```

```{r reg-plot, fig.cap = "One global line (K = 1, = OLS) versus a conditioned mixture (K = 3).", fig.alt = "Scatter of y against x with the OLS line and the curved mixture conditional mean."}
.cond_mean <- function(fit, xv) {
  vapply(xv, function(xx) {
    g <- gmm_conditionalise(fit, given = c(NA, xx))
    sum(g@weights * vapply(g@means, function(m) m[1L], numeric(1L)))
  }, numeric(1L))
}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  grid <- data.frame(x = seq(-3, 3, length.out = 200L))
  grid$ols <- as.numeric(predict(lm(y ~ x, dat), newdata = grid))
  grid$mix <- .cond_mean(fit3, grid$x)
  ggplot() +
    geom_point(data = dat, aes(x, y), colour = "grey60", alpha = 0.4, size = 0.7) +
    geom_line(data = grid, aes(x, ols, colour = "lm (K = 1)"), linewidth = 0.9) +
    geom_line(data = grid, aes(x, mix, colour = "mixture (K = 3)"), linewidth = 0.9) +
    scale_colour_manual(name = NULL,
                        values = c("lm (K = 1)" = "#2C7FB8",
                                   "mixture (K = 3)" = "#D95F0E")) +
    labs(title = "Regression: one global line vs a conditioned mixture",
         x = expression(x), y = expression(y)) +
    theme_minimal(base_size = 11) +
    theme(legend.position = "top", plot.title = element_text(face = "bold"))
}
```

**Idea.** Each component supplies an affine conditional mean
\(\mu_k^{y \mid x}\); the responsibilities \(\pi_k(x)\) gate them, so the global
mean \(\sum_k \pi_k(x)\, \mu_k^{y \mid x}\) is nonlinear -- Gaussian-mixture
regression.

**Does / does not.** Captures nonlinear, heteroscedastic, even multimodal
conditional *densities*, not just the mean. It does not return classical
inference (standard errors, \(p\)-values) by default, and it assumes Gaussian
components.

## Kernel regression — from Nadaraya--Watson to a conditioned KDE

**Problem.** Predict \(y\) from \(x\) with no global functional form -- a purely
local smoother. **Usual tool.** Nadaraya--Watson kernel regression
(`stats::ksmooth`, package `np`): a kernel-weighted local average of the
responses. **`proxymix` route.** Place one Gaussian per data point over the
joint \((y, x)\) -- a kernel density estimate -- and read \(E[y \mid x]\) by
conditioning. With a Gaussian kernel that conditional mean *is* the
Nadaraya--Watson estimator, exactly. Where `lm()` sat at \(K = 1\) (one global
line), Nadaraya--Watson sits at \(K = n\) (one component per datum, fully
local), and `from_kde()` lands anywhere in between -- a single axis with
least squares at one end and kernel smoothing at the other.

```{r nw-fit}
h <- 0.4                                       # kernel bandwidth
nw <- function(xq) vapply(xq, function(q) {
  w <- dnorm(q, x, h); sum(w * y) / sum(w)     # Nadaraya--Watson, Gaussian kernel
}, numeric(1L))

## One Gaussian per datum -- a kernel density estimate of the joint (y, x) --
## then condition on x and read the conditional mean.
kde <- gmm(weights = rep(1 / n, n),
           means = lapply(seq_len(n), function(i) c(y[i], x[i])),
           covariances = rep(list(diag(c(h^2, h^2))), n))
xq <- c(-2, -0.5, 1, 2.2)
c(max_abs_diff = max(abs(nw(xq) - .cond_mean(kde, xq))))   # equal to NW
```

```{r nw-plot, fig.cap = "One axis, two endpoints: the global line (K = 1, OLS) and the fully local kernel smoother (K = n, Nadaraya-Watson).", fig.alt = "Scatter of y against x with the straight OLS line and the curved Nadaraya-Watson kernel-regression line."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  grid_nw <- data.frame(x = seq(-3, 3, length.out = 200L))
  grid_nw$ols <- as.numeric(predict(lm(y ~ x, dat), newdata = grid_nw))
  grid_nw$nw  <- nw(grid_nw$x)
  ggplot() +
    geom_point(data = dat, aes(x, y), colour = "grey60", alpha = 0.4, size = 0.7) +
    geom_line(data = grid_nw, aes(x, ols, colour = "OLS (K = 1)"), linewidth = 0.9) +
    geom_line(data = grid_nw, aes(x, nw, colour = "kernel (K = n)"), linewidth = 0.9) +
    scale_colour_manual(name = NULL,
                        values = c("OLS (K = 1)" = "#2C7FB8",
                                   "kernel (K = n)" = "#D95F0E")) +
    labs(title = "From global line to local kernel, one conditioning operation",
         x = expression(x), y = expression(y)) +
    theme_minimal(base_size = 11) +
    theme(legend.position = "top", plot.title = element_text(face = "bold"))
}
```

**Idea.** Conditioning a per-point KDE gives responsibilities
\(\pi_i(x) \propto \mathcal{N}(x \mid x_i, h^2)\) and per-component means \(y_i\),
so \(E[y \mid x] = \sum_i \pi_i(x)\, y_i\) -- the Nadaraya--Watson ratio. Kernel
regression is therefore Gaussian-mixture regression with one component per datum,
a diagonal bandwidth, and only the conditional mean read off. The bandwidth
\(h\) plays the role of the component covariance.

**Does / does not.** The same operation returns the full conditional *density*
(predictive variance, quantiles, multimodality), and `gmm_conditional_entropy()`
scores its uncertainty -- neither available from the bare Nadaraya--Watson mean.
It also works in regime (iii), where you can evaluate the joint density but have
no points to average, so a literal kernel estimator does not exist. The price is
that the one-per-datum estimate costs \(O(n)\) per query; `from_kde()` compresses
it to a \(K\)-component mixture for \(O(K)\) queries, and -- as with any kernel
method -- the bandwidth must be chosen.

## Clustering — from `kmeans()` to the mixture components

**Problem.** Group rows by similarity. **Usual tool.** `kmeans()`, or `mclust`
(which fits exactly this model). **`proxymix` route.**
`fit_proxymix(N = K, regime = "sample")` -- the components *are* the clusters
and the responsibilities are soft assignments.

```{r clust-fit}
X <- rbind(
  mvnfast::rmvn(150L, c(-2, -1), 0.5 * diag(2)),
  mvnfast::rmvn(150L, c( 2,  0), matrix(c(0.6, 0.3, 0.3, 0.4), 2L)),
  mvnfast::rmvn(150L, c( 0,  2.5), 0.3 * diag(2))
)
colnames(X) <- c("V1", "V2")
ct   <- gmm_target_from_samples(X)
fitc <- fit_proxymix(ct, N = 3L, regime = "sample", max_iter = 150L)

## Soft assignment = posterior responsibility of each component.
.responsibilities <- function(fit, x) {
  comp <- vapply(seq_len(gmm_n_components(fit)), function(k) {
    fit@weights[k] *
      mvnfast::dmvn(x, mu = fit@means[[k]], sigma = fit@covariances[[k]])
  }, numeric(nrow(x)))
  comp / rowSums(comp)
}
resp <- .responsibilities(fitc, X)
head(round(resp, 3L), 3L)
```

**Idea.** Soft, model-based clustering: a point's assignment to cluster \(k\) is
its posterior responsibility
\(\pi_k\,\mathcal{N}(x \mid \mu_k, \Sigma_k) / \sum_j \pi_j\,\mathcal{N}(x \mid \mu_j, \Sigma_j)\).

**Does / does not.** Gives soft, full-covariance, density-aware clusters
(ellipsoidal, not spherical like \(k\)-means), and it doubles as a density
estimate. It does not choose \(K\) for you (use `bic_aic()`), and `kmeans()` is
faster at very large \(n\).

## Principal components — from `prcomp()` to the fitted covariance

**Problem.** Find the directions of greatest variation. **Usual tool.**
`prcomp()`. **`proxymix` route.** Fit a single Gaussian (`N = 1`,
`regime = "moment"`) and eigendecompose its covariance.

```{r pca}
fit_pca <- fit_proxymix(ct, N = 1L, regime = "moment", ridge_eps = 0)
ev <- eigen(fit_pca@covariances[[1L]])$vectors
pr <- prcomp(X)$rotation
## Identical to the prcomp rotation, up to the sign convention of each axis.
max(abs(abs(ev) - abs(unname(pr))))
```

```{r clust-pca-plot, fig.cap = "Clustering (colour) and the principal axes (arrows) read off one mixture.", fig.alt = "Three coloured point clusters with two principal-axis arrows from the centroid."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  vals <- eigen(fit_pca@covariances[[1L]])$values
  mu   <- fit_pca@means[[1L]]
  axes <- data.frame(
    x = mu[1L], y = mu[2L],
    xend = mu[1L] + 2 * sqrt(vals) * ev[1L, ],
    yend = mu[2L] + 2 * sqrt(vals) * ev[2L, ]
  )
  pts <- data.frame(X, cluster = factor(max.col(resp)))
  ggplot() +
    geom_point(data = pts, aes(V1, V2, colour = cluster), alpha = 0.6, size = 0.9) +
    geom_segment(data = axes, aes(x = x, y = y, xend = xend, yend = yend),
                 arrow = grid::arrow(length = grid::unit(0.2, "cm")),
                 linewidth = 0.8) +
    scale_colour_viridis_d(name = "cluster", end = 0.85) +
    coord_equal() +
    labs(title = "Clustering and principal axes from one mixture",
         x = expression(x[1]), y = expression(x[2])) +
    theme_minimal(base_size = 11) +
    theme(plot.title = element_text(face = "bold"))
}
```

**Idea.** PCA is the eigendecomposition of the covariance. The moment-matched
single Gaussian carries the sample covariance exactly, so its eigenvectors are
the principal components; with \(K > 1\) you get a local PCA inside each
cluster.

**Does / does not.** Recovers PCA exactly at \(K = 1\) and a local PCA at
\(K > 1\). For a one-off linear projection `prcomp()` is more direct -- loadings,
scree plots, and biplots come for free there.

## Penalised regression — ridge as a covariance ridge

**Problem.** Stabilise a regression when predictors are collinear. **Usual
tool.** `glmnet` (ridge / lasso), `MASS::lm.ridge`. **`proxymix` route.** The
`ridge_eps` argument adds \(\lambda I\) to the component covariance, which
shrinks the conditional regression -- exactly an \(L_2\) (ridge) penalty.

```{r ridge}
lambda <- c(0, 0.5, 2, 8)
slope_pm <- vapply(lambda, function(lam) {
  f <- fit_proxymix(joint, N = 1L, regime = "moment", ridge_eps = lam)
  gmm_conditionalise(f, given = c(NA, 1))@means[[1L]] -
    gmm_conditionalise(f, given = c(NA, 0))@means[[1L]]
}, numeric(1L))
data.frame(
  lambda          = lambda,
  proxymix_slope  = round(slope_pm, 4L),
  ridge_formula   = round(cov(x, y) / (var(x) + lambda), 4L)
)
```

**Idea.** Conditioning gives \(\beta = \Sigma_{xx}^{-1}\Sigma_{xy}\). Ridging
\(\Sigma_{xx} \to \Sigma_{xx} + \lambda I\) yields
\((\Sigma_{xx} + \lambda I)^{-1}\Sigma_{xy}\) -- the ridge estimator -- so the
slope shrinks toward zero as \(\lambda\) grows.

**Does / does not.** Delivers ridge-style shrinkage and numerical stability.
There is no Gaussian-mixture analogue of the lasso: \(L_2\) shrinkage matches a
Gaussian prior, \(L_1\) sparsity a Laplace prior that lies outside the Gaussian
family -- so `proxymix` does not perform variable selection.

## A few more, in one line each

- **Imputation** -- `gmm_conditionalise()` on the observed coordinates gives the
  conditional-mean (or, via `rgmm()`, a sampled) fill-in.
- **Density estimation** -- the native use; `from_kde()` compresses a kernel
  estimate into a closed-form mixture (see `vignette("from_kde")`).
- **Kalman filtering** -- `gmm_observe()` is the mixture Kalman update (see
  `vignette("operator_calculus")`).

## Summary

| Method | Usual tool | `proxymix` route | What you gain | What you give up |
|---|---|---|---|---|
| Regression | `lm` / `glm` | mixture over \((y, x)\) + `gmm_conditionalise()` | nonlinear, heteroscedastic, multimodal conditional densities | built-in inference; Gaussian components |
| Kernel regression | `ksmooth` / `np` (Nadaraya--Watson) | per-point KDE + `gmm_conditionalise()`; compress with `from_kde()` | full conditional density, uncertainty, the evaluate-only regime | \(O(n)\) per query unless compressed; bandwidth choice |
| Clustering | `kmeans` / `mclust` | `fit_proxymix(regime = "sample")` | soft, ellipsoidal, density-aware clusters | automatic \(K\); speed at huge \(n\) |
| PCA | `prcomp` | `eigen()` of the `N = 1` covariance | local PCA per cluster at \(K > 1\) | loadings / scree / biplot conveniences |
| Ridge | `glmnet` / `lm.ridge` | `ridge_eps` on the covariance | \(L_2\) shrinkage from one object | \(L_1\) sparsity / variable selection |

## When to reach for the specialised tool

Use the purpose-built tool when its strengths are exactly what you need:
high-dimensional sparse problems suit `glmnet`; very large-\(n\) clustering suits
`kmeans`; a single linear projection suits `prcomp`; classical inferential
guarantees suit `lm` / `glm`. `proxymix` earns its place when you want **one
fitted object** to serve several of these at once, in closed form, and when the
Gaussian-mixture assumptions fit your data.

## References

* de Veaux, R. D. (1989). *Mixtures of linear regressions.* Computational
  Statistics & Data Analysis 8(3), 227--245.
* Fraley, C. and Raftery, A. E. (2002). *Model-based clustering, discriminant
  analysis, and density estimation.* Journal of the American Statistical
  Association 97(458), 611--631.
* Jolliffe, I. T. (2002). *Principal Component Analysis*, 2nd ed. Springer.
* Nadaraya, E. A. (1964). *On estimating regression.* Theory of Probability and
  Its Applications 9(1), 141--142.
* Watson, G. S. (1964). *Smooth regression analysis.* Sankhya A 26(4),
  359--372.
* Tipping, M. E. and Bishop, C. M. (1999). *Probabilistic principal component
  analysis.* Journal of the Royal Statistical Society B 61(3), 611--622.
* Hoerl, A. E. and Kennard, R. W. (1970). *Ridge regression: biased estimation
  for nonorthogonal problems.* Technometrics 12(1), 55--67.
* Hoek, J. van der and Elliott, R. J. (2024). *Mixtures of multivariate
  Gaussians.* Stochastic Analysis and Applications.
  <doi:10.1080/07362994.2024.2372605>.
```