--- title: "Missing data that depends on the missing value" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Missing data that depends on the missing value} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4.2, out.width = "100%" ) set.seed(20260622) ``` ```{r setup} library(proxymix) ``` # Imputation is conditioning on a gate In proxymix you do not impute a missing entry so much as condition the fitted mixture on what its being missing tells you. For an entry missing at random that is nothing beyond the other coordinates, so the imputation law is the plain mixture conditional. Two common situations carry more information than that. An entry can be missing because it fell in a known interval -- below a detection limit, above a ceiling -- which is *censoring*. Or it can be missing with a probability that depends on its own unobserved value, larger values dropping out more often, which is *missing not at random*. Each is a gate $\delta(y)$ multiplying the mixture conditional before the value is drawn. ```{r gate-table, echo = FALSE} knitr::kable(data.frame( Mechanism = c("missing at random", "censored", "missing not at random"), Gate = c("constant", "indicator of an interval", "selection probability"), Constructor = c("mar()", "censored()", "mnar()")), align = "lll") ``` # A value-dependent mechanism biases the ordinary fit Consider a two-component mixture where the second coordinate `y` is missing with a probability that grows with `y` itself, so the larger values are under-represented among the observed. ```{r dgp} n <- 1200 comp <- sample(1:2, n, replace = TRUE) mu <- rbind(c(0, 0), c(1.5, 0.5)) L <- chol(matrix(c(1, 0.6, 0.6, 1), 2)) Z <- matrix(rnorm(2 * n), n, 2) %*% L + mu[comp, ] colnames(Z) <- c("x1", "y") truth <- mean(Z[, 2]) # the complete-data mean miss <- runif(n) < plogis(-0.5 + 0.7 * Z[, 2]) # larger y, more missing dat <- Z dat[miss, "y"] <- NA ``` An imputer that assumes the data are missing at random fills the gap from the conditional of the observed, which is itself short of the large values, so it stays biased. Supplying the selection slope to `mnar()` lets the fit account for the mechanism and recover the mean. ```{r recover} mar_fit <- gmm_impute(dat, N = 2, m = 20, mechanism = mar(), seed = 1) mnar_fit <- gmm_impute(dat, N = 2, m = 20, mechanism = mnar("y", beta = 0.7), seed = 1) c(truth = truth, available = mean(dat[!miss, "y"]), mar = proxy_pool(mar_fit, "y")$estimate, mnar = proxy_pool(mnar_fit, "y")$estimate) ``` The slope `beta` is the strength of the dependence on the log-odds scale, and `beta = 0` is the missing-at-random case. The intercept is not supplied -- it is calibrated to the observed fraction missing. # The slope is a sensitivity parameter A missing-not-at-random departure cannot be identified from the observed data, so a single estimate would overstate what the data support. The honest summary is a curve: `proxy_mnar_sensitivity()` sweeps the slope and reports the pooled mean with its interval at each value, and the analyst reads off where a conclusion would change. ```{r sweep} sweep <- proxy_mnar_sensitivity(dat, "y", beta_grid = seq(0, 1.2, by = 0.2), N = 2, m = 20, seed = 1) sweep[, c("beta", "estimate", "conf.low", "conf.high")] ``` ```{r tornado, echo = FALSE, fig.alt = "Pooled mean of y against the assumed sensitivity slope, with confidence bands and the complete-data truth marked."} op <- par(mar = c(4.2, 4.2, 1, 1)) plot(sweep$beta, sweep$estimate, type = "n", ylim = range(sweep$conf.low, sweep$conf.high, truth), xlab = expression("sensitivity slope " * beta), ylab = "pooled mean of y") polygon(c(sweep$beta, rev(sweep$beta)), c(sweep$conf.low, rev(sweep$conf.high)), col = "grey85", border = NA) lines(sweep$beta, sweep$estimate, lwd = 2) points(sweep$beta, sweep$estimate, pch = 19) abline(h = truth, lty = 2) text(min(sweep$beta), truth, "truth", pos = 3, cex = 0.8) par(op) ``` The estimate rises monotonically with the assumed slope and crosses the complete-data truth near the slope that generated the data, which is the value proposition of a sensitivity analysis -- without knowing the slope, sweeping it brackets the truth. # Censoring is the exact, identifiable case When the reason for missingness is a known bound, the mechanism is identified and there is no sensitivity parameter. A detection limit that hides values of `y` below a threshold is `censored("y", upper = threshold)`, and proxymix draws from the mixture conditional truncated to the censored interval rather than substituting a constant such as half the limit. ```{r censor} thr <- 0.3 cmiss <- Z[, 2] < thr cdat <- Z cdat[cmiss, "y"] <- NA cfit <- gmm_impute(cdat, N = 2, m = 20, mechanism = censored("y", upper = thr), seed = 1) c(truth = mean(Z[, 2]), available = mean(cdat[!cmiss, "y"]), half_limit = mean(ifelse(cmiss, thr / 2, Z[, 2])), proxymix = proxy_pool(cfit, "y")$estimate) ``` # Scope and references The interval and value-dependent gates act on a single coordinate, and a row missing that coordinate is assumed to have its other coordinates observed, which is the detection-limit or single-outcome setting. A model estimand other than a coordinate mean is pooled by handing the completions to `mice` through `as_mids()`. The selection-model formulation follows Diggle and Kenward (1994) and the sensitivity-analysis posture follows Little (1993) and the missing-data literature it anchors. The truncated-Gaussian algebra behind the censored path is standard, and the closed-form moments are evaluated directly.