--- title: "Pre-IES Identifiability Screening with HSIC" author: "Max Moldovan" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Pre-IES Identifiability Screening with HSIC} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, dpi = 150 ) set.seed(1) ``` ## Why screen before calibrating? Iterative Ensemble-Smoother (IES) calibration of a mechanistic agricultural model such as APSIM is expensive: each ensemble member is a full simulator run. Spending realisations on parameters that have no detectable effect on the outputs of interest is wasted budget. A pre-IES screen answers a sharper question than variance-based Sobol analysis: *does this parameter influence the distribution of any output at all?* HSIC -- the Hilbert-Schmidt Independence Criterion -- detects both linear and non-linear, mean-shift and tail-shift effects, including those Sobol misses when variance is preserved. The workflow is three lines: 1. Build a Latin-hypercube design across plausible parameter ranges. 2. Run the simulator on the design. 3. Pass the design + outputs to `hsic_identifiability()`. ## A stubbed APSIM archetype Real APSIM workflows are heavy, so this vignette uses a small simulator stub with four parameters -- two genuinely identifiable, one weakly nonlinear, and one inert -- and two outputs (yield and biomass). ```{r stub-simulator} library(kernR) stub_apsim <- function(theta) { # theta: n x 4 matrix with columns slope, curvature, weak, inert yield <- 1.5 * theta[, "slope"] + 0.3 * stats::rnorm(nrow(theta), sd = 0.1) biomass <- theta[, "curvature"]^2 + 0.4 * theta[, "weak"] + stats::rnorm(nrow(theta), sd = 0.1) cbind(yield = yield, biomass = biomass) } ``` ## Latin-hypercube design ```{r design} bounds <- rbind( slope = c(0.0, 2.0), # active, linear in yield curvature = c(-1.0, 1.0), # active, quadratic in biomass weak = c(0.0, 1.0), # weak linear effect on biomass inert = c(0.0, 1.0) # no effect ) design <- lhs_design(n = 80L, bounds = bounds, seed = 11L) head(design) ``` ## Simulate and screen ```{r screen} outputs <- stub_apsim(design) fit <- hsic_identifiability( theta = design, y = outputs, alpha = 0.05, p_adjust = "BH", n_permutations = 299L, seed = 11L ) fit ``` The print method ranks parameters by their maximum HSIC across outputs. `slope` and `curvature` should be flagged identifiable with small p-values; `weak` may or may not survive depending on noise; `inert` should fall below the threshold. ## Visual diagnostic ```{r plot} plot(fit) ``` ## Handing off to PESTO The identifiable subset is precisely the set you would forward to `pesto_ies()` (PESTO's IES wrapper) as the parameter prior. In pseudo-R: identifiable_params <- names(fit$identifiable)[fit$identifiable] pesto::pesto_ies(..., params = identifiable_params, ...) For the cross-package workflow specification (kernR -> PESTO -> APSIM ensemble loop), see the architecture document `uq_ag_stack_design_v0.md`. ## Tabular form For downstream reporting (or to feed into a manuscript table), call `as.data.frame()`: ```{r as-df} df <- as.data.frame(fit) head(df) ``` ## Notes on practice - **Design size.** `n` should comfortably exceed the number of parameters; we recommend `n >= 10p` as a starting point for ag-systems-scale designs. - **Permutations.** 199-499 permutations is usually sufficient; the minimum achievable p-value at `n_permutations` is `1 / (n_permutations + 1)`, so `n_permutations >= 199` is the floor for testing at `alpha = 0.05`. - **Multiple testing.** With many parameters and several outputs, unadjusted p-values will produce spurious "identifiable" flags. The default Benjamini-Hochberg adjustment controls the false discovery rate across the `p x q` grid. - **Cost.** Kernel matrices are reused across the grid, so cost scales as `O((p + q) n^2)` for kernel construction plus `O(p q B n^2)` for the permutation null, where `B = n_permutations`. ## References - Hu, Z., Sejdinovic, D., & Evans, R. J. (2024). *A kernel-based statistical test for causal inference with backdoor adjustment.* Journal of Machine Learning Research, 25. - Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J. (2008). A kernel statistical test of independence. *NeurIPS*, 20. - Da Veiga, S. (2015). Global sensitivity analysis with dependence measures. *Journal of Statistical Computation and Simulation*, 85(7), 1283-1305.