--- title: "Tier-2 stubs: the research roadmap" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tier-2 stubs: the research roadmap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, out.width = "100%" ) set.seed(20260513) ``` ```{r setup} library(proxymix) ``` `proxymix` ships five **Tier-2 stubs** — functions with stable signatures, full roxygen documentation, and signature-stability tests, whose bodies raise a `proxymix_not_yet_implemented` condition. They mark research directions the package's foundations support but does not yet implement. Each will graduate to Tier 1 in a later release after its own design, validation, and stress-audit pass. The stubs collectively reveal *why* the regime-(iii) wedge is load-bearing: each of the five stubs ultimately calls into KLD-EM with a target that cannot be sampled. ## What the stubs are for ### `from_kde()` — KDE to Gaussian-mixture proxy *(graduated in v0.2.0)* Take an `n` by `p` sample matrix, fit a kernel density estimate, then fit a Gaussian-mixture proxy *to that KDE* via KLD-EM. The proxy retains the closed-form mixture operations of `gmm_marginalise()` / `gmm_conditionalise()`, which the KDE lacks. Useful as a smoothing-and-closing step in posterior summarisation. Companion vignette: `vignettes/from_kde.Rmd`. ### `from_aggregate_likelihood()` — kernel-downsizing integration Aggregate-likelihood downscaling concerns models of the form \[ g(y) = \int K(y \mid x) \, f(x) \, dx \] where `y` is observed at aggregate scale and `x` at finer scale. If `f` is a Gaussian-mixture proxy fitted via KLD-EM, the integral is closed-form and the downscaling likelihood becomes a Gaussian mixture in `y`. This stub plugs `proxymix` into Sejdinovic et al.'s kernel-downsizing framework — *the parametric `f` alternative to their GP `f`*. ```{r from_agg} tryCatch( from_aggregate_likelihood(matrix(0, 1, 1), latent_aggregator = identity, N = 2L), proxymix_not_yet_implemented = function(e) message(conditionMessage(e)) ) ``` ### `fit_kld_em_collider()` — DAG-constrained KLD-EM Project each KLD-EM iteration onto the manifold of joint densities respecting a user-supplied directed-acyclic-graph's set of conditional-independence constraints (the collider-regularised regression idea, Sejdinovic et al.). This is a *novel* methodological extension beyond Hoek and Elliott (2024) — useful for testing causal-inference methods on Gaussian-mixture-generated joints with known DAG structure. ```{r kld_collider} tryCatch( fit_kld_em_collider(banana_target(), dag = matrix(0, 2, 2), N = 2L), proxymix_not_yet_implemented = function(e) message(conditionMessage(e)) ) ``` ### `to_apsim_scenarios()` — APSIM scenario generation Convert samples from a `gmm_fit` (typically fitted to a multivariate weather / soil / management distribution) into the tabular format consumed by APSIM scenario runners. Provides a clean bridge from `proxymix` proxies to mechanistic agronomy simulators. ```{r apsim} x <- matrix(stats::rnorm(200), ncol = 2) fit <- fit_proxymix(gmm_target_from_samples(x), N = 2L, regime = "sample", max_iter = 10L) tryCatch( to_apsim_scenarios(fit, n = 100L, schema = list()), proxymix_not_yet_implemented = function(e) message(conditionMessage(e)) ) ``` ### `from_simulator()` — wrap an expensive simulator as a target Probe an expensive simulator on a designed grid of inputs, build a KDE (or empirical-likelihood surface) on its outputs, and expose the result as a `gmm_target` with an evaluable `log_density`. The simulator is treated as a black-box `f` that can be evaluated but not (cheaply) sampled — the wedge use case. ```{r from_sim} tryCatch( from_simulator(simulator = identity, design = matrix(stats::rnorm(20), ncol = 2)), proxymix_not_yet_implemented = function(e) message(conditionMessage(e)) ) ``` ## Why these stubs and not others The choice of stubs is opinionated. Three guidelines apply: 1. **Each stub must terminate in a regime-(i)–(iii) verb.** If a stub graduates to a Tier-1 implementation, it should be a thin shim around the existing fitters. No new core algorithms are introduced silently. 2. **Each stub must have a sponsor application.** `from_kde` / `from_simulator` are general-purpose; `from_aggregate_likelihood` and `fit_kld_em_collider` are anchored on Sejdinovic et al.'s recent work; `to_apsim_scenarios` is anchored on agronomy applications. *Unsponsored stubs do not appear here.* 3. **Each stub must respect the package's Tier-3 deferrals.** No stub introduces normalising flows, automatic differentiation variational inference, or Stan / INLA interop. ## What is *not* coming The following are explicit **Tier-3 deferrals** and will *not* appear in `proxymix` as stubs: * **Adaptive importance sampling.** Redrawing the IS sample each KLD-EM iteration. A real win at moderate dimensions; deferred because a clean treatment requires more design than the current scope allows. * **Variational boosting.** Add components until the IS-estimated KLD plateaus. * **Normalising-flow proposals.** Out of scope for a Gaussian-mixture package; recommend [`tensorflow` / `keras`] or `torch` directly. * **Stan / INLA interop.** Adjacent but distinct ecosystem. ## Reference Hoek, J. van der and Elliott, R. J. (2024). *Mixtures of multivariate Gaussians.* Stochastic Analysis and Applications. .