--- title: "Density-Ratio Backends and the proxymix Binding" author: "Max Moldovan" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Density-Ratio Backends and the proxymix Binding} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) have_proxymix <- requireNamespace("proxymix", quietly = TRUE) ``` ## The cross-package contract `kernR::estimate_density_ratio()` is the entry point for every kernR estimator that needs to reweight observational samples to an interventional distribution — most notably `bd_hsic_test()` for the backdoor-HSIC causal test. The function offers four density-ratio backends behind a single signature: | Backend | Family | When to use | |---------------|---------------------------------|-----------------------------------------------------| | `logistic` | Noise-contrastive classifier | Default; robust on smooth, unimodal densities | | `ranger` | Random-forest classifier | Flexible non-linear NCE; needs `ranger` | | `xgboost` | Gradient-boosted classifier | Strong on tabular interactions; needs `xgboost` | | `proxymix` | Gaussian-mixture density ratio | Multimodal/skewed densities; parametric alternative | The `proxymix` backend is the cross-package wedge between kernR (the distributional verdict layer of the UQ ag stack) and proxymix (the Gaussian-mixture proxy / KL-density-ratio bridge, Hoek & Elliott 2024). It fits one GMM to the joint sample cloud `(x, z)`, a second GMM to the product-of-marginals cloud `(x_perm, z)`, then evaluates the analytic ratio of the two mixture densities at each observation. No classifier calibration step — the ratio is closed-form in the fitted parameters. ```{r dep-note, echo=FALSE, results='asis'} if (!have_proxymix) { cat( "> **Note.** The `proxymix` package is not installed in the current ", "library; chunks that compare the GMM backend below are skipped. ", "Install via `remotes::install_github(\"max578/proxymix\")` or from ", "the local source tarball to reproduce the full comparison.\n\n", sep = "" ) } ``` ## Four backends, one problem A toy confounded design: `z` is a 2-D Gaussian confounder; `x` is a linear-Gaussian function of `z`; `y` carries a real causal effect from `x` plus a confounded path through `z`. ```{r data} suppressPackageStartupMessages(library(kernR)) set.seed(2026L) n <- 200L z <- matrix(rnorm(n * 2L), n, 2L) x <- z[, 1L] + rnorm(n, sd = 0.5) y <- 0.7 * x + z[, 2L] + rnorm(n, sd = 0.4) ``` Each backend produces a `density_ratio_fit` object exposing ESS and a weight vector. We tabulate them side-by-side. ```{r logistic-ranger-xgb} dr_logistic <- estimate_density_ratio(x, z, method = "logistic", seed = 1L) dr_ranger <- if (requireNamespace("ranger", quietly = TRUE)) { estimate_density_ratio(x, z, method = "ranger", seed = 1L) } else NULL dr_xgb <- if (requireNamespace("xgboost", quietly = TRUE)) { estimate_density_ratio(x, z, method = "xgboost", seed = 1L) } else NULL ``` ```{r proxymix-fit, eval=have_proxymix} dr_proxymix <- estimate_density_ratio( x, z, method = "proxymix", proxymix_components = 2L, seed = 1L ) ``` ```{r summary-table, echo=FALSE} rows <- list( list(name = "logistic", fit = dr_logistic), if (!is.null(dr_ranger)) list(name = "ranger", fit = dr_ranger), if (!is.null(dr_xgb)) list(name = "xgboost", fit = dr_xgb), if (have_proxymix) list(name = "proxymix", fit = dr_proxymix) ) rows <- Filter(Negate(is.null), rows) summary_df <- do.call(rbind, lapply(rows, function(r) { w <- r$fit$weights data.frame( backend = r$name, ess = round(r$fit$ess, 1), min_weight = signif(min(w), 3), max_weight = signif(max(w), 3), stringsAsFactors = FALSE ) })) knitr::kable(summary_df, caption = "Density-ratio diagnostics across backends.") ``` ## bd-HSIC under each backend ```{r bdhsic-classifiers} res_logistic <- bd_hsic_test( x, y, z, density_ratio = "logistic", n_permutations = 199L, seed = 1L ) ``` ```{r bdhsic-proxymix, eval=have_proxymix} res_proxymix <- bd_hsic_test( x, y, z, density_ratio = "proxymix", n_permutations = 199L, seed = 1L ) ``` ```{r bdhsic-table, echo=FALSE} rows2 <- list( data.frame(backend = "logistic", statistic = signif(res_logistic$statistic, 3), p_value = res_logistic$p_value, ess = round(res_logistic$ess, 1), stringsAsFactors = FALSE) ) if (have_proxymix) { rows2 <- c(rows2, list( data.frame(backend = "proxymix", statistic = signif(res_proxymix$statistic, 3), p_value = res_proxymix$p_value, ess = round(res_proxymix$ess, 1), stringsAsFactors = FALSE) )) } knitr::kable(do.call(rbind, rows2), caption = "bd-HSIC test under each density-ratio backend.") ``` ## When to reach for proxymix The classifier-based backends (`logistic`, `ranger`, `xgboost`) are the default for a reason: they tolerate misspecified densities, scale to high-dimensional `z`, and have well-understood calibration. Reach for `method = "proxymix"` when: * the joint density is plausibly **multimodal** (multi-regime climate, paddock × variety designs with distinct production zones, animal cohorts with separable subpopulations) — a 2-component GMM represents this cleanly where a classifier would smear across the modes; * you need a **parametric** density-ratio whose components you can inspect, hand off to a downstream Bayesian step (via `proxymix::gmm_target_from_posterior()`), or use as the seed of a KLD-EM refinement on a target you can evaluate but not sample from; * classifier calibration is unreliable on the cohort at hand (small `n`, sharp class imbalance after the joint-vs-marginal split, pathological feature scaling). The `proxymix` package is GRDC-firewalled (MIT, no GRDC IP flows in) and ships its full Gaussian-mixture proxy API independently of kernR. kernR consumes it as a soft dependency via `requireNamespace()` — the binding is one-way and rebuildable from the local `proxymix_*.tar.gz` source. ## References * Hoek, J. van der & Elliott, R. J. (2024). *Mixtures of multivariate Gaussians.* Stochastic Analysis and Applications. DOI: 10.1080/07362994.2024.2372605. * Hu, R., Sejdinovic, D. & Evans, R. J. (2024). A kernel test for causal association via noise contrastive backdoor adjustment. *Journal of Machine Learning Research*, 25(160), 1–56. ```{r session-info} sessionInfo() ```