---
title: "Ensemble Manifests -- the Cross-Package Contract"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Ensemble Manifests -- the Cross-Package Contract}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
set.seed(20260516)
```

## Why a manifest?

A PESTO ensemble run produces a moderately large pile of state:
parameter ensemble, simulated outputs, target observations, IES
weights, the RNG seed, the lambda schedule, runtime context. To let
`kernR`, `proxymix`, and the paper-writing pipeline consume that
output **without reaching into PESTO-specific list internals**, v0.3.0
introduces a versioned S7 contract object -- `pesto_ensemble_manifest` -- plus YAML+CSV persistence with SHA-256
integrity checking.

This is Year-1 §A5 of the UQ ag-stack roadmap. It is deliberately
infrastructure rather than methodology: the value is that every
downstream consumer reads the same object, with the same provenance
guarantees, on every run.

## Constructing a manifest from an IES run

```{r}
library(PESTO)

set.seed(20260516)
npar <- 3L; nobs <- 6L; nreal <- 60L

G <- matrix(rnorm(nobs * npar), nobs, npar)
theta_true <- c(1.0, -0.5, 2.0)
y <- as.numeric(G %*% theta_true) + rnorm(nobs, sd = 0.05)
names(y) <- paste0("o", seq_len(nobs))

prior <- matrix(rnorm(nreal * npar), nreal, npar,
                dimnames = list(NULL, paste0("p", seq_len(npar))))

fit <- pesto_ies_callback(
  forward_model  = function(theta) theta %*% t(G),
  prior_ensemble = prior,
  obs            = y,
  obs_sd         = 0.05,
  noptmax        = 4L,
  verbose        = FALSE
)
```

Wrap the result:

```{r}
m <- as_manifest(fit, seed = 20260516L,
                 apsim_version = NA_character_)
print(m)
```

Slots are reachable via the standard S7 `@` accessor:

```{r}
m@run_id
m@data_hash
m@noptmax
head(m@params)
```

## Writing, reading, and verifying

`write_manifest()` emits the YAML plus three RDS sidecars (`*_params.rds`,
`*_outputs.rds`, `*_assim.rds`). RDS is used in preference to CSV so
IEEE 754 doubles round-trip bit-exactly -- the SHA-256 integrity check
would otherwise trip on formatter precision loss:

```{r}
dir <- tempfile("pesto_manifest_")
dir.create(dir)
paths <- write_manifest(m, file.path(dir, "wagga_2026_run01.yaml"))
basename(paths)
```

Read it back and confirm the integrity hash matches:

```{r}
m2 <- read_manifest(file.path(dir, "wagga_2026_run01.yaml"))
verify_manifest(m2)$ok
```

A peek at what the YAML actually looks like (truncated):

```{r}
cat(paste(readLines(file.path(dir, "wagga_2026_run01.yaml"))[1:14],
          collapse = "\n"))
```

### Inspection CSVs (optional)

For workflows that need a human-readable view of the ensemble -- quick
scans in Excel, ad-hoc exports for a domain collaborator, copy-paste
into a Quarto narrative -- `write_manifest()` accepts a
`format = "both"` argument that emits CSV sidecars **alongside** the
RDS. The hash stays bound to the RDS (bit-exact integrity preserved),
and the YAML records the inspection paths under a separate
`inspection_csv:` block so consumers can ignore them:

```{r}
dir2 <- tempfile("pesto_manifest_csv_"); dir.create(dir2)
paths2 <- write_manifest(m, file.path(dir2, "wagga_2026_run01.yaml"),
                         format = "both")
basename(paths2)
```

A `format = "csv_unverified"` mode (renamed from `"csv"` in PESTO 0.3.2)
is available for one-way exports to non-R analysts where round-trip
integrity is not required. The YAML carries
`integrity: not_verifiable` so downstream tools can branch on the
weaker contract. The `verify_manifest()` function returns `ok = NA`
rather than a spurious `FALSE` because CSV formatter precision loss
(~1 ULP at IEEE 754 epsilon) is enough to flip the hash:

```{r}
dir3 <- tempfile("pesto_manifest_unverified_"); dir.create(dir3)
write_manifest(m, file.path(dir3, "snapshot.yaml"),
               format = "csv_unverified")
m_csv <- read_manifest(file.path(dir3, "snapshot.yaml"))
verify_manifest(m_csv)$ok      # NA -- see $message for why
```

If you need both human inspection AND verifiable integrity, use
`format = "both"`. The `csv_unverified` mode is deliberately named
to flag the weaker contract at every call-site. It is for export, not
for storage you intend to re-load and trust.

## Tamper-detection

If a downstream tool silently re-saves the outputs sidecar (accidental
edit, partial overwrite) the hash will not match:

```{r}
out_rds <- file.path(dir, "wagga_2026_run01_outputs.rds")
df <- readRDS(out_rds)
df[1, 2] <- df[1, 2] + 1e-3       # perturb one cell
saveRDS(df, out_rds, version = 3L)

m3 <- read_manifest(file.path(dir, "wagga_2026_run01.yaml"))
v  <- verify_manifest(m3)
v$ok
```

`verify_manifest()` returns the stored vs recomputed hashes so a
downstream consumer can fail fast and report the divergence cleanly.

## Cross-package contract

The manifest is the single object that `kernR` (HSIC identifiability,
DR-DATE counterfactuals, MMD posterior-predictive checks) and
`proxymix` (GMM density-ratio bridges) consume. By dispatching on the
S7 class, those packages never see PESTO-internal list shapes. They
read `m@params`, `m@outputs`, `m@weights`, `m@obs_target` through the
contract and let `verify_manifest()` gate on integrity.

The companion jstyle `outputs_manifest.yaml` (the project-level
artefact index) will reference per-run manifests like this one by
relative path + the same `data_hash`, so a project's manifest tree is
end-to-end hash-verifiable.

```{r, include=FALSE}
unlink(dir, recursive = TRUE)
```

## Reproducibility

```{r}
sessionInfo()
```