A fitted Gaussian mixture over your data is a single object, and
several classical analyses follow from it in closed form – because
conditioning, marginalisation, and eigendecomposition are all exact on a
Gaussian mixture. This vignette shows proxymix standing in
for five familiar tools. Each section keeps the same shape: the
problem, the usual tool, the
proxymix route, the idea
behind the substitution, and what it does and does not
replace. The point is one fitted object serving several jobs, with the
trade-offs in plain view – not that the mixture always wins.
lm() to a conditioned mixtureProblem. Predict an outcome \(y\) from a predictor \(x\). Usual tool.
lm() / glm() – one global line.
proxymix route. Fit a mixture over the
joint \((y, x)\), then read \(E[y \mid x]\) by conditioning. At \(K = 1\) this is least squares; at
\(K > 1\) the gate bends the
line.
n <- 400L
x <- runif(n, -3, 3)
y <- 0.3 * x + 1.2 * pmax(x, 0) + rnorm(n, sd = 0.4) # a bent relationship
dat <- data.frame(y = y, x = x)
joint <- gmm_target_from_samples(cbind(y, x))
fit1 <- fit_proxymix(joint, N = 1L, regime = "moment", ridge_eps = 0) # = OLS
fit3 <- fit_proxymix(joint, N = 3L, regime = "sample", max_iter = 150L)
## At K = 1 the conditional slope equals the lm coefficient, exactly.
slope_mix <- gmm_conditionalise(fit1, given = c(NA, 1))@means[[1L]] -
gmm_conditionalise(fit1, given = c(NA, 0))@means[[1L]]
c(proxymix = slope_mix, lm = unname(coef(lm(y ~ x, dat))["x"]))
#> proxymix lm
#> 0.8968121 0.8968121.cond_mean <- function(fit, xv) {
vapply(xv, function(xx) {
g <- gmm_conditionalise(fit, given = c(NA, xx))
sum(g@weights * vapply(g@means, function(m) m[1L], numeric(1L)))
}, numeric(1L))
}
if (requireNamespace("ggplot2", quietly = TRUE)) {
library(ggplot2)
grid <- data.frame(x = seq(-3, 3, length.out = 200L))
grid$ols <- as.numeric(predict(lm(y ~ x, dat), newdata = grid))
grid$mix <- .cond_mean(fit3, grid$x)
ggplot() +
geom_point(data = dat, aes(x, y), colour = "grey60", alpha = 0.4, size = 0.7) +
geom_line(data = grid, aes(x, ols, colour = "lm (K = 1)"), linewidth = 0.9) +
geom_line(data = grid, aes(x, mix, colour = "mixture (K = 3)"), linewidth = 0.9) +
scale_colour_manual(name = NULL,
values = c("lm (K = 1)" = "#2C7FB8",
"mixture (K = 3)" = "#D95F0E")) +
labs(title = "Regression: one global line vs a conditioned mixture",
x = expression(x), y = expression(y)) +
theme_minimal(base_size = 11) +
theme(legend.position = "top", plot.title = element_text(face = "bold"))
}One global line (K = 1, = OLS) versus a conditioned mixture (K = 3).
Idea. Each component supplies an affine conditional mean \(\mu_k^{y \mid x}\); the responsibilities \(\pi_k(x)\) gate them, so the global mean \(\sum_k \pi_k(x)\, \mu_k^{y \mid x}\) is nonlinear – Gaussian-mixture regression.
Does / does not. Captures nonlinear, heteroscedastic, even multimodal conditional densities, not just the mean. It does not return classical inference (standard errors, \(p\)-values) by default, and it assumes Gaussian components.
Problem. Predict \(y\) from \(x\) with no global functional form – a
purely local smoother. Usual tool. Nadaraya–Watson
kernel regression (stats::ksmooth, package
np): a kernel-weighted local average of the responses.
proxymix route. Place one Gaussian per
data point over the joint \((y, x)\) –
a kernel density estimate – and read \(E[y
\mid x]\) by conditioning. With a Gaussian kernel that
conditional mean is the Nadaraya–Watson estimator, exactly.
Where lm() sat at \(K =
1\) (one global line), Nadaraya–Watson sits at \(K = n\) (one component per datum, fully
local), and from_kde() lands anywhere in between – a single
axis with least squares at one end and kernel smoothing at the
other.
h <- 0.4 # kernel bandwidth
nw <- function(xq) vapply(xq, function(q) {
w <- dnorm(q, x, h); sum(w * y) / sum(w) # Nadaraya--Watson, Gaussian kernel
}, numeric(1L))
## One Gaussian per datum -- a kernel density estimate of the joint (y, x) --
## then condition on x and read the conditional mean.
kde <- gmm(weights = rep(1 / n, n),
means = lapply(seq_len(n), function(i) c(y[i], x[i])),
covariances = rep(list(diag(c(h^2, h^2))), n))
xq <- c(-2, -0.5, 1, 2.2)
c(max_abs_diff = max(abs(nw(xq) - .cond_mean(kde, xq)))) # equal to NW
#> max_abs_diff
#> 1.110223e-16if (requireNamespace("ggplot2", quietly = TRUE)) {
grid_nw <- data.frame(x = seq(-3, 3, length.out = 200L))
grid_nw$ols <- as.numeric(predict(lm(y ~ x, dat), newdata = grid_nw))
grid_nw$nw <- nw(grid_nw$x)
ggplot() +
geom_point(data = dat, aes(x, y), colour = "grey60", alpha = 0.4, size = 0.7) +
geom_line(data = grid_nw, aes(x, ols, colour = "OLS (K = 1)"), linewidth = 0.9) +
geom_line(data = grid_nw, aes(x, nw, colour = "kernel (K = n)"), linewidth = 0.9) +
scale_colour_manual(name = NULL,
values = c("OLS (K = 1)" = "#2C7FB8",
"kernel (K = n)" = "#D95F0E")) +
labs(title = "From global line to local kernel, one conditioning operation",
x = expression(x), y = expression(y)) +
theme_minimal(base_size = 11) +
theme(legend.position = "top", plot.title = element_text(face = "bold"))
}One axis, two endpoints: the global line (K = 1, OLS) and the fully local kernel smoother (K = n, Nadaraya-Watson).
Idea. Conditioning a per-point KDE gives responsibilities \(\pi_i(x) \propto \mathcal{N}(x \mid x_i, h^2)\) and per-component means \(y_i\), so \(E[y \mid x] = \sum_i \pi_i(x)\, y_i\) – the Nadaraya–Watson ratio. Kernel regression is therefore Gaussian-mixture regression with one component per datum, a diagonal bandwidth, and only the conditional mean read off. The bandwidth \(h\) plays the role of the component covariance.
Does / does not. The same operation returns the full
conditional density (predictive variance, quantiles,
multimodality), and gmm_conditional_entropy() scores its
uncertainty – neither available from the bare Nadaraya–Watson mean. It
also works in regime (iii), where you can evaluate the joint density but
have no points to average, so a literal kernel estimator does not exist.
The price is that the one-per-datum estimate costs \(O(n)\) per query; from_kde()
compresses it to a \(K\)-component
mixture for \(O(K)\) queries, and – as
with any kernel method – the bandwidth must be chosen.
kmeans() to the mixture
componentsProblem. Group rows by similarity. Usual
tool. kmeans(), or mclust (which fits
exactly this model). proxymix route.
fit_proxymix(N = K, regime = "sample") – the components
are the clusters and the responsibilities are soft
assignments.
X <- rbind(
mvnfast::rmvn(150L, c(-2, -1), 0.5 * diag(2)),
mvnfast::rmvn(150L, c( 2, 0), matrix(c(0.6, 0.3, 0.3, 0.4), 2L)),
mvnfast::rmvn(150L, c( 0, 2.5), 0.3 * diag(2))
)
colnames(X) <- c("V1", "V2")
ct <- gmm_target_from_samples(X)
fitc <- fit_proxymix(ct, N = 3L, regime = "sample", max_iter = 150L)
## Soft assignment = posterior responsibility of each component.
.responsibilities <- function(fit, x) {
comp <- vapply(seq_len(gmm_n_components(fit)), function(k) {
fit@weights[k] *
mvnfast::dmvn(x, mu = fit@means[[k]], sigma = fit@covariances[[k]])
}, numeric(nrow(x)))
comp / rowSums(comp)
}
resp <- .responsibilities(fitc, X)
head(round(resp, 3L), 3L)
#> [,1] [,2] [,3]
#> [1,] 0.000 0 1.000
#> [2,] 0.001 0 0.999
#> [3,] 0.001 0 0.999Idea. Soft, model-based clustering: a point’s assignment to cluster \(k\) is its posterior responsibility \(\pi_k\,\mathcal{N}(x \mid \mu_k, \Sigma_k) / \sum_j \pi_j\,\mathcal{N}(x \mid \mu_j, \Sigma_j)\).
Does / does not. Gives soft, full-covariance,
density-aware clusters (ellipsoidal, not spherical like \(k\)-means), and it doubles as a density
estimate. It does not choose \(K\) for
you (use bic_aic()), and kmeans() is faster at
very large \(n\).
prcomp() to the fitted
covarianceProblem. Find the directions of greatest variation.
Usual tool. prcomp().
proxymix route. Fit a single Gaussian
(N = 1, regime = "moment") and eigendecompose
its covariance.
fit_pca <- fit_proxymix(ct, N = 1L, regime = "moment", ridge_eps = 0)
ev <- eigen(fit_pca@covariances[[1L]])$vectors
pr <- prcomp(X)$rotation
## Identical to the prcomp rotation, up to the sign convention of each axis.
max(abs(abs(ev) - abs(unname(pr))))
#> [1] 4.440892e-16if (requireNamespace("ggplot2", quietly = TRUE)) {
vals <- eigen(fit_pca@covariances[[1L]])$values
mu <- fit_pca@means[[1L]]
axes <- data.frame(
x = mu[1L], y = mu[2L],
xend = mu[1L] + 2 * sqrt(vals) * ev[1L, ],
yend = mu[2L] + 2 * sqrt(vals) * ev[2L, ]
)
pts <- data.frame(X, cluster = factor(max.col(resp)))
ggplot() +
geom_point(data = pts, aes(V1, V2, colour = cluster), alpha = 0.6, size = 0.9) +
geom_segment(data = axes, aes(x = x, y = y, xend = xend, yend = yend),
arrow = grid::arrow(length = grid::unit(0.2, "cm")),
linewidth = 0.8) +
scale_colour_viridis_d(name = "cluster", end = 0.85) +
coord_equal() +
labs(title = "Clustering and principal axes from one mixture",
x = expression(x[1]), y = expression(x[2])) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
}
#> Warning in data.frame(x = mu[1L], y = mu[2L], xend = mu[1L] + 2 * sqrt(vals) *
#> : row names were found from a short variable and have been discardedClustering (colour) and the principal axes (arrows) read off one mixture.
Idea. PCA is the eigendecomposition of the covariance. The moment-matched single Gaussian carries the sample covariance exactly, so its eigenvectors are the principal components; with \(K > 1\) you get a local PCA inside each cluster.
Does / does not. Recovers PCA exactly at \(K = 1\) and a local PCA at \(K > 1\). For a one-off linear projection
prcomp() is more direct – loadings, scree plots, and
biplots come for free there.
Problem. Stabilise a regression when predictors are
collinear. Usual tool. glmnet (ridge /
lasso), MASS::lm.ridge. proxymix
route. The ridge_eps argument adds \(\lambda I\) to the component covariance,
which shrinks the conditional regression – exactly an \(L_2\) (ridge) penalty.
lambda <- c(0, 0.5, 2, 8)
slope_pm <- vapply(lambda, function(lam) {
f <- fit_proxymix(joint, N = 1L, regime = "moment", ridge_eps = lam)
gmm_conditionalise(f, given = c(NA, 1))@means[[1L]] -
gmm_conditionalise(f, given = c(NA, 0))@means[[1L]]
}, numeric(1L))
data.frame(
lambda = lambda,
proxymix_slope = round(slope_pm, 4L),
ridge_formula = round(cov(x, y) / (var(x) + lambda), 4L)
)
#> lambda proxymix_slope ridge_formula
#> 1 0.0 0.8968 0.8968
#> 2 0.5 0.7680 0.7680
#> 3 2.0 0.5367 0.5367
#> 4 8.0 0.2434 0.2434Idea. Conditioning gives \(\beta = \Sigma_{xx}^{-1}\Sigma_{xy}\). Ridging \(\Sigma_{xx} \to \Sigma_{xx} + \lambda I\) yields \((\Sigma_{xx} + \lambda I)^{-1}\Sigma_{xy}\) – the ridge estimator – so the slope shrinks toward zero as \(\lambda\) grows.
Does / does not. Delivers ridge-style shrinkage and
numerical stability. There is no Gaussian-mixture analogue of the lasso:
\(L_2\) shrinkage matches a Gaussian
prior, \(L_1\) sparsity a Laplace prior
that lies outside the Gaussian family – so proxymix does
not perform variable selection.
gmm_conditionalise() on
the observed coordinates gives the conditional-mean (or, via
rgmm(), a sampled) fill-in.from_kde() compresses a kernel estimate into a closed-form
mixture (see vignette("from_kde")).gmm_observe() is
the mixture Kalman update (see
vignette("operator_calculus")).| Method | Usual tool | proxymix route |
What you gain | What you give up |
|---|---|---|---|---|
| Regression | lm / glm |
mixture over \((y, x)\) +
gmm_conditionalise() |
nonlinear, heteroscedastic, multimodal conditional densities | built-in inference; Gaussian components |
| Kernel regression | ksmooth / np (Nadaraya–Watson) |
per-point KDE + gmm_conditionalise(); compress with
from_kde() |
full conditional density, uncertainty, the evaluate-only regime | \(O(n)\) per query unless compressed; bandwidth choice |
| Clustering | kmeans / mclust |
fit_proxymix(regime = "sample") |
soft, ellipsoidal, density-aware clusters | automatic \(K\); speed at huge \(n\) |
| PCA | prcomp |
eigen() of the N = 1 covariance |
local PCA per cluster at \(K > 1\) | loadings / scree / biplot conveniences |
| Ridge | glmnet / lm.ridge |
ridge_eps on the covariance |
\(L_2\) shrinkage from one object | \(L_1\) sparsity / variable selection |
Use the purpose-built tool when its strengths are exactly what you
need: high-dimensional sparse problems suit glmnet; very
large-\(n\) clustering suits
kmeans; a single linear projection suits
prcomp; classical inferential guarantees suit
lm / glm. proxymix earns its
place when you want one fitted object to serve several
of these at once, in closed form, and when the Gaussian-mixture
assumptions fit your data.
lm() to a conditioned mixturekmeans() to the mixture componentsprcomp() to the fitted
covariance