--- title: "From KDE: compressing a kernel density estimate into a Gaussian-mixture proxy" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{From KDE: compressing a kernel density estimate into a Gaussian-mixture proxy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, out.width = "100%" ) set.seed(20260601) ``` ```{r setup} library(proxymix) library(ggplot2) ``` # Why a `from_kde()` constructor A kernel density estimate (KDE) gives a smooth, non-parametric density from a sample of `n` points, but with three operational costs: * **No closed-form marginals.** Integrating out a coordinate of a KDE requires a numerical sweep over the kernel mixture. * **No closed-form conditioning.** There is no analogue of the Schur complement; one has to resample or build a conditional KDE per query. * **No native aggregation algebra.** Pushforwards through linear operators, observation-channel updates, or block-sums all collapse to Monte Carlo. `proxymix::from_kde()` *compresses* the KDE into an `N`-component Gaussian mixture proxy (with `N` typically much smaller than `n`) by treating the KDE as a normalised, evaluable target and running regime (iii) KLD-EM against it. The proxy then supports the full closed-form operator set exposed by the rest of the package — `dgmm()`, `rgmm()`, `gmm_marginalise()`, `gmm_conditionalise()`, plus the v0.3 affine-Gaussian calculus (`gmm_affine()`, `gmm_observe()`). This is **not new science**. It is a practical bridge between a non-parametric density estimate and a parametric closed-form mixture representation. The bias inherited from the KDE is reproduced in the proxy; the bandwidth controls the bias-variance trade-off. # Recovering a known mixture Sample from a known bimodal mixture and ask `from_kde()` to recover it. ```{r recover-bimodal} set.seed(1L) true_means <- list(c(-2, 0), c(2, 0)) true_cov <- diag(2) x <- rbind( mvnfast::rmvn(150L, mu = true_means[[1L]], sigma = true_cov), mvnfast::rmvn(150L, mu = true_means[[2L]], sigma = true_cov) ) fit <- from_kde( x, N = 2L, bandwidth = "silverman", is_size = 2000L, max_iter = 60L, seed = 1L, validation_size = 2000L ) fit ``` The proxy components sit near the true means. ```{r} vapply(fit@means, function(mu) round(mu, 2L), numeric(2L)) ``` # Quality of the importance-sampled fit Inspect ESS and the largest self-normalised weight: a healthy regime-(iii) fit has ESS close to `is_size` and a max-weight share of a few percent. ```{r} ess_summary(fit) ``` The validation block reports a held-out KLD on an independent IS draw — this is the audit-mandated safeguard against in-sample overfitting to one particular IS realisation. # Bandwidth sensitivity Smaller bandwidths track the data more tightly (lower bias, higher variance); larger bandwidths smooth the proxy. The proxy inherits this trade-off. ```{r bandwidth-sweep} bandwidth_grid <- c(0.2, 0.5, 1.0) fits <- lapply(bandwidth_grid, function(h) { from_kde(x, N = 2L, bandwidth = h, is_size = 1500L, max_iter = 40L, seed = 1L) }) data.frame( bandwidth = bandwidth_grid, ess = vapply(fits, function(f) f@diagnostics$ess, numeric(1L)), max_weight = vapply(fits, function(f) f@diagnostics$max_weight, numeric(1L)), trace_Sigma_k1 = vapply(fits, function(f) sum(diag(f@covariances[[1L]])), numeric(1L)) ) ``` # Visualising the compression Compare the KDE log-density and the proxy log-density on a planar grid. ```{r visualise, fig.height = 4.5} grid <- expand.grid( x = seq(-5, 5, length.out = 80L), y = seq(-4, 4, length.out = 60L) ) gm <- as.matrix(grid) kde_log_density <- fit@target@log_density grid$kde <- kde_log_density(gm) grid$proxy <- log(dgmm(gm, fit)) ggplot() + geom_contour(data = grid, aes(x, y, z = kde), colour = "steelblue") + geom_contour(data = grid, aes(x, y, z = proxy), colour = "tomato", linetype = "dashed") + geom_point(data = as.data.frame(x), aes(V1, V2), alpha = 0.2, size = 0.6) + coord_equal() + labs(title = "KDE (solid) vs Gaussian-mixture proxy (dashed)", x = "x", y = "y") + theme_minimal(base_size = 11) ``` # Scope and limits (v0.2) * **Dimensional cap.** `from_kde()` is well-tested for `p <= 5`; it warns between `p = 6` and `p = 10`, and refuses `p > 10`. The wedge of regime (iii) is importance sampling, whose effective sample size falls sharply in high dimensions. The higher-dimensional path waits for the v0.3 affine-Gaussian operator calculus, which composes a low-`p` proxy with linear operators rather than fitting in the ambient space. * **Bandwidth shapes.** Only scalar and diagonal bandwidths are supported. A full-matrix bandwidth effectively encodes a covariance estimate and blurs the KDE-vs-mixture distinction; if that is what you want, fit the mixture directly with `fit_proxymix(regime = "sample")`. * **Normalisation.** A KDE is normalised by construction, so the returned `gmm_target` declares `normalised = TRUE` and `log_normalizer = 0`. This makes `hellinger_mc()` meaningful and lets the KLD diagnostics report absolute values, not shifted ones. * **What this is not.** It is not a method for *picking* the KDE bandwidth. Use the conventional rules of thumb (`"silverman"`, `"scott"`) or a cross-validation procedure outside the package, then pass the chosen bandwidth in. # When to prefer `fit_proxymix(regime = "sample")` instead If your goal is "find a Gaussian-mixture density given samples", the classical EM regime is more direct, fits in linear time, and does not incur the regime-(iii) importance-sampling tax. `from_kde()` only earns its place when you specifically want the KDE's smoothing as a step in your pipeline — e.g. as part of a sequence such as ``` samples |> from_kde() |> gmm_conditionalise() |> rgmm() ``` — or when the KDE's bias-variance trade-off is part of what you are trying to validate downstream. # References * Hoek, J. van der and Elliott, R. J. (2024). *Mixtures of multivariate Gaussians.* Stochastic Analysis and Applications. . * Silverman, B. W. (1986). *Density Estimation for Statistics and Data Analysis.* CRC Press. * Scott, D. W. (1992). *Multivariate Density Estimation: Theory, Practice, and Visualization.* Wiley.