---
title: "Calibration and concordance: kernR's validation layer"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Calibration and concordance: kernR's validation layer}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(kernR)
```

`kernR` is a verdict layer: it does not fit models, it decides whether a
fitted model's output should be trusted. Two questions sit at the centre of
that role, and this vignette covers the two tests that answer them.

- **Calibration.** Does a sample actually follow the distribution it claims to
  represent? This is a one-sample, score-based question, answered by
  `ksd_test()` -- the kernel Stein discrepancy goodness-of-fit test.
- **Concordance.** Do several sources -- posterior draws from different
  inference engines, or scenario ensembles from different simulators -- agree
  with each other? This is a k-sample, sample-based question, answered by
  `concordance_test()`.

The two are complementary. Calibration compares a sample against a *density*;
concordance compares samples against *each other*. Together they let `kernR`
act as a falsification gate over an upstream inference pipeline.

## Calibration: `ksd_test()`

The kernel Stein discrepancy needs only the *score* of the target -- the
gradient of its log density -- so the target may be unnormalised and no
reference sample is required. For a multivariate-normal target the score is
available in closed form via `gaussian_score()`.

```{r ksd-gaussian}
set.seed(1)

# A sample that genuinely follows the standard normal
x_ok <- matrix(rnorm(400), ncol = 2)
ksd_test(x_ok, n_boot = 199, seed = 1)
```

The verdict is *consistent with target*: the statistic sits near zero and the
p-value is large. A mis-specified sample -- here shifted in mean -- is caught.

```{r ksd-shift}
x_bad <- x_ok + 0.7
ksd_test(x_bad, n_boot = 199, seed = 1)
```

When the target has no convenient closed-form score, supply it as a log
density and let `numeric_score()` take the gradient by finite differences. Any
additive normalising constant cancels, so an unnormalised log density is
enough.

```{r ksd-numeric}
log_density <- function(z) -0.5 * rowSums(z^2)   # standard normal, unnormalised
ksd_test(x_ok, score = numeric_score(log_density), n_boot = 199, seed = 1)
```

This is the adapter that lets `ksd_test()` consume any externally supplied
log-posterior evaluator: wrap the evaluator in `numeric_score()` and the test
asks whether posterior draws are calibrated against that posterior, with no
dependency on the producer of the evaluator.

The default base kernel is the inverse multi-quadric, which stays sensitive to
mis-specification in higher dimensions where the Gaussian kernel can lose
power; pass `kernel = "rbf"` for the Gaussian alternative.

## Concordance: `concordance_test()`

Where calibration checks one sample against a density, concordance checks
several samples against one another. The input is a list -- one element per
source -- and a named list labels the sources in the output.

```{r concordance-ok}
set.seed(2)
draws <- list(
  engine_a = matrix(rnorm(400), ncol = 2),
  engine_b = matrix(rnorm(400), ncol = 2),
  engine_c = matrix(rnorm(400), ncol = 2)
)
concordance_test(draws, n_permutations = 199, seed = 1)
```

The three sources are mutually concordant, so the verdict is *concordant*. The
test is more than a single yes-or-no, though: it returns the full pairwise
discrepancy matrix, so when one source departs the rejection can be read down
to the offending pair.

```{r concordance-bad}
draws$engine_c <- draws$engine_c + 1   # engine_c now disagrees
fit <- concordance_test(draws, n_permutations = 199, seed = 1)
fit$pairwise
```

The pairwise matrix localises the problem: the `engine_c` row and column carry
the large discrepancies, while `engine_a` and `engine_b` remain close. The
single joint-permutation null keeps the overall p-value calibrated across all
pairs, so this is one test with one verdict, not a multiple-comparison sweep.

## The validation pattern

Used together the two tests express a simple discipline. Concordance across
independent sources is corroborating evidence -- agreement is hard to fake.
Calibration against a target density is the absolute check that the agreed-upon
answer is also the correct one. Divergence on either test is informative:
`concordance_test()` localises *which* source departs, and `ksd_test()`
identifies *whether* a sample departs from its claimed target. A pipeline whose
ensembles pass both has earned more trust than one validated on the posterior
mean alone.