--- title: "Calibration and concordance: kernR's validation layer" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Calibration and concordance: kernR's validation layer} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(kernR) ``` `kernR` is a verdict layer: it does not fit models, it decides whether a fitted model's output should be trusted. Two questions sit at the centre of that role, and this vignette covers the two tests that answer them. - **Calibration.** Does a sample actually follow the distribution it claims to represent? This is a one-sample, score-based question, answered by `ksd_test()` -- the kernel Stein discrepancy goodness-of-fit test. - **Concordance.** Do several sources -- posterior draws from different inference engines, or scenario ensembles from different simulators -- agree with each other? This is a k-sample, sample-based question, answered by `concordance_test()`. The two are complementary. Calibration compares a sample against a *density*; concordance compares samples against *each other*. Together they let `kernR` act as a falsification gate over an upstream inference pipeline. ## Calibration: `ksd_test()` The kernel Stein discrepancy needs only the *score* of the target -- the gradient of its log density -- so the target may be unnormalised and no reference sample is required. For a multivariate-normal target the score is available in closed form via `gaussian_score()`. ```{r ksd-gaussian} set.seed(1) # A sample that genuinely follows the standard normal x_ok <- matrix(rnorm(400), ncol = 2) ksd_test(x_ok, n_boot = 199, seed = 1) ``` The verdict is *consistent with target*: the statistic sits near zero and the p-value is large. A mis-specified sample -- here shifted in mean -- is caught. ```{r ksd-shift} x_bad <- x_ok + 0.7 ksd_test(x_bad, n_boot = 199, seed = 1) ``` When the target has no convenient closed-form score, supply it as a log density and let `numeric_score()` take the gradient by finite differences. Any additive normalising constant cancels, so an unnormalised log density is enough. ```{r ksd-numeric} log_density <- function(z) -0.5 * rowSums(z^2) # standard normal, unnormalised ksd_test(x_ok, score = numeric_score(log_density), n_boot = 199, seed = 1) ``` This is the adapter that lets `ksd_test()` consume any externally supplied log-posterior evaluator: wrap the evaluator in `numeric_score()` and the test asks whether posterior draws are calibrated against that posterior, with no dependency on the producer of the evaluator. The default base kernel is the inverse multi-quadric, which stays sensitive to mis-specification in higher dimensions where the Gaussian kernel can lose power; pass `kernel = "rbf"` for the Gaussian alternative. ## Concordance: `concordance_test()` Where calibration checks one sample against a density, concordance checks several samples against one another. The input is a list -- one element per source -- and a named list labels the sources in the output. ```{r concordance-ok} set.seed(2) draws <- list( engine_a = matrix(rnorm(400), ncol = 2), engine_b = matrix(rnorm(400), ncol = 2), engine_c = matrix(rnorm(400), ncol = 2) ) concordance_test(draws, n_permutations = 199, seed = 1) ``` The three sources are mutually concordant, so the verdict is *concordant*. The test is more than a single yes-or-no, though: it returns the full pairwise discrepancy matrix, so when one source departs the rejection can be read down to the offending pair. ```{r concordance-bad} draws$engine_c <- draws$engine_c + 1 # engine_c now disagrees fit <- concordance_test(draws, n_permutations = 199, seed = 1) fit$pairwise ``` The pairwise matrix localises the problem: the `engine_c` row and column carry the large discrepancies, while `engine_a` and `engine_b` remain close. The single joint-permutation null keeps the overall p-value calibrated across all pairs, so this is one test with one verdict, not a multiple-comparison sweep. ## The validation pattern Used together the two tests express a simple discipline. Concordance across independent sources is corroborating evidence -- agreement is hard to fake. Calibration against a target density is the absolute check that the agreed-upon answer is also the correct one. Divergence on either test is informative: `concordance_test()` localises *which* source departs, and `ksd_test()` identifies *whether* a sample departs from its claimed target. A pipeline whose ensembles pass both has earned more trust than one validated on the posterior mean alone.