--- title: "Imputing missing data with a mixture" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Imputing missing data with a mixture} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4.2, out.width = "100%" ) set.seed(20260620) ``` ```{r setup} library(proxymix) ``` # Imputation is conditioning When a value is missing, what we know about it is its conditional distribution given everything we did observe. If the joint distribution of the data is a Gaussian mixture, that conditional is itself a Gaussian mixture, available in closed form by the Schur-complement algebra of `gmm_conditionalise()`. `gmm_impute()` fits a mixture to a dataset that contains holes and draws completed datasets from these per-row conditionals. A mixture can be multimodal and heteroscedastic. That is the point: it represents shapes a single-Gaussian model (the multivariate-normal assumption) and a linear-Gaussian conditional (a normal regression imputation) cannot, so the imputations -- and the inference pooled from them -- stay faithful to data those models mis-specify. # A two-cluster example Two groups, each a tight cloud, with a positive within-group association between `x1` and `x2`. Roughly forty per cent of `x2` is then made missing at random, with the chance of being missing depending on the observed `x1`. ```{r data} n <- 600 lab <- sample(c(-1, 1), n, replace = TRUE) x1 <- 2 * lab + rnorm(n, 0, 0.6) x2 <- 2 * lab + 0.5 * (x1 - 2 * lab) + rnorm(n, 0, 0.6) truth <- cbind(x1 = x1, x2 = x2) X <- truth missing <- runif(n) < plogis(0.6 * x1) # missing at random, depends on x1 X[missing, "x2"] <- NA round(mean(missing), 2) ``` The marginal distribution of `x2` is bimodal: one cluster near $-2$, one near $+2$, with little mass between them. # Impute, then look at the shape ```{r impute} imp <- gmm_impute(X, N = 2L, m = 20L, seed = 1L) imp ``` A single completed dataset has no holes: ```{r complete} done <- gmm_complete(imp, 1L) anyNA(done) ``` Do the imputed values follow the two clusters, or do they pile up in the empty middle? Overlaying the imputed `x2` on the values that were actually missing shows the mixture imputer reproducing both modes. ```{r modes, fig.alt = "Density of the missing x2 values against the mixture-imputed values; both are bimodal and overlap."} imputed <- done[missing, "x2"] hidden <- truth[missing, "x2"] plot(density(hidden), lwd = 2, col = "grey30", main = "Missing x2: truth vs mixture imputation", xlab = expression(x[2]), ylim = c(0, 0.5)) lines(density(imputed), lwd = 2, col = "#1b7837") legend("topright", c("held-out truth", "mixture imputation"), col = c("grey30", "#1b7837"), lwd = 2, bty = "n") ``` A single Gaussian fitted to the same holes has one centre and one spread, so it fills the empty middle and erases the two clusters. Forcing one component (`N = 1L`) shows the contrast: ```{r single, fig.alt = "A single-Gaussian imputation places mass in the empty middle, unlike the bimodal truth."} imp1 <- gmm_impute(X, N = 1L, m = 20L, seed = 1L) imputed1 <- gmm_complete(imp1, 1L)[missing, "x2"] plot(density(hidden), lwd = 2, col = "grey30", main = "A single Gaussian fills the gap", xlab = expression(x[2]), ylim = c(0, 0.5)) lines(density(imputed1), lwd = 2, col = "#d95f02") legend("topright", c("held-out truth", "single-Gaussian imputation"), col = c("grey30", "#d95f02"), lwd = 2, bty = "n") ``` # Pooling an estimand A column mean is pooled in closed form -- the exact large-sample limit of the between-imputation variance, so there is no Monte-Carlo noise in the pooling step -- reporting the estimate, a standard error, a confidence interval, and the fraction of missing information. ```{r pool-mean} proxy_pool(imp, "x2") proxy_fmi(imp, "x2") ``` For a regression or any other model, the established place to combine an estimand across imputations is `mice`. `as_mids()` packages the completions as a `mice` object, so the joint mixture imputations flow into `mice::pool()` unchanged -- proxymix supplies the imputation model, `mice` the pooling. ```{r pool-mice, eval = requireNamespace("mice", quietly = TRUE)} library(mice) fit <- with(as_mids(imp), lm(x2 ~ x1)) summary(pool(fit)) ``` # Scope This release imputes numeric data that is missing at random: the probability that an entry is missing may depend on the observed entries but not on the missing value itself. The number of mixture components is chosen by the Bayesian information criterion when `N` is left at its default, and a supplied `seed` makes the result reproducible without disturbing the ambient random-number state.