--- title: "Ensemble Manifests -- the Cross-Package Contract" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Ensemble Manifests -- the Cross-Package Contract} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") set.seed(20260516) ``` ## Why a manifest? A PESTO ensemble run produces a moderately large pile of state: parameter ensemble, simulated outputs, target observations, IES weights, the RNG seed, the lambda schedule, runtime context. To let `kernR`, `proxymix`, and the paper-writing pipeline consume that output **without reaching into PESTO-specific list internals**, v0.3.0 introduces a versioned S7 contract object -- `pesto_ensemble_manifest` -- plus YAML+CSV persistence with SHA-256 integrity checking. This is Year-1 §A5 of the UQ ag-stack roadmap. It is deliberately infrastructure rather than methodology: the value is that every downstream consumer reads the same object, with the same provenance guarantees, on every run. ## Constructing a manifest from an IES run ```{r} library(PESTO) set.seed(20260516) npar <- 3L; nobs <- 6L; nreal <- 60L G <- matrix(rnorm(nobs * npar), nobs, npar) theta_true <- c(1.0, -0.5, 2.0) y <- as.numeric(G %*% theta_true) + rnorm(nobs, sd = 0.05) names(y) <- paste0("o", seq_len(nobs)) prior <- matrix(rnorm(nreal * npar), nreal, npar, dimnames = list(NULL, paste0("p", seq_len(npar)))) fit <- pesto_ies_callback( forward_model = function(theta) theta %*% t(G), prior_ensemble = prior, obs = y, obs_sd = 0.05, noptmax = 4L, verbose = FALSE ) ``` Wrap the result: ```{r} m <- as_manifest(fit, seed = 20260516L, apsim_version = NA_character_) print(m) ``` Slots are reachable via the standard S7 `@` accessor: ```{r} m@run_id m@data_hash m@noptmax head(m@params) ``` ## Writing, reading, and verifying `write_manifest()` emits the YAML plus three RDS sidecars (`*_params.rds`, `*_outputs.rds`, `*_assim.rds`). RDS is used in preference to CSV so IEEE 754 doubles round-trip bit-exactly -- the SHA-256 integrity check would otherwise trip on formatter precision loss: ```{r} dir <- tempfile("pesto_manifest_") dir.create(dir) paths <- write_manifest(m, file.path(dir, "wagga_2026_run01.yaml")) basename(paths) ``` Read it back and confirm the integrity hash matches: ```{r} m2 <- read_manifest(file.path(dir, "wagga_2026_run01.yaml")) verify_manifest(m2)$ok ``` A peek at what the YAML actually looks like (truncated): ```{r} cat(paste(readLines(file.path(dir, "wagga_2026_run01.yaml"))[1:14], collapse = "\n")) ``` ### Inspection CSVs (optional) For workflows that need a human-readable view of the ensemble -- quick scans in Excel, ad-hoc exports for a domain collaborator, copy-paste into a Quarto narrative -- `write_manifest()` accepts a `format = "both"` argument that emits CSV sidecars **alongside** the RDS. The hash stays bound to the RDS (bit-exact integrity preserved), and the YAML records the inspection paths under a separate `inspection_csv:` block so consumers can ignore them: ```{r} dir2 <- tempfile("pesto_manifest_csv_"); dir.create(dir2) paths2 <- write_manifest(m, file.path(dir2, "wagga_2026_run01.yaml"), format = "both") basename(paths2) ``` A `format = "csv_unverified"` mode (renamed from `"csv"` in PESTO 0.3.2) is available for one-way exports to non-R analysts where round-trip integrity is not required. The YAML carries `integrity: not_verifiable` so downstream tools can branch on the weaker contract. The `verify_manifest()` function returns `ok = NA` rather than a spurious `FALSE` because CSV formatter precision loss (~1 ULP at IEEE 754 epsilon) is enough to flip the hash: ```{r} dir3 <- tempfile("pesto_manifest_unverified_"); dir.create(dir3) write_manifest(m, file.path(dir3, "snapshot.yaml"), format = "csv_unverified") m_csv <- read_manifest(file.path(dir3, "snapshot.yaml")) verify_manifest(m_csv)$ok # NA -- see $message for why ``` If you need both human inspection AND verifiable integrity, use `format = "both"`. The `csv_unverified` mode is deliberately named to flag the weaker contract at every call-site. It is for export, not for storage you intend to re-load and trust. ## Tamper-detection If a downstream tool silently re-saves the outputs sidecar (accidental edit, partial overwrite) the hash will not match: ```{r} out_rds <- file.path(dir, "wagga_2026_run01_outputs.rds") df <- readRDS(out_rds) df[1, 2] <- df[1, 2] + 1e-3 # perturb one cell saveRDS(df, out_rds, version = 3L) m3 <- read_manifest(file.path(dir, "wagga_2026_run01.yaml")) v <- verify_manifest(m3) v$ok ``` `verify_manifest()` returns the stored vs recomputed hashes so a downstream consumer can fail fast and report the divergence cleanly. ## Cross-package contract The manifest is the single object that `kernR` (HSIC identifiability, DR-DATE counterfactuals, MMD posterior-predictive checks) and `proxymix` (GMM density-ratio bridges) consume. By dispatching on the S7 class, those packages never see PESTO-internal list shapes. They read `m@params`, `m@outputs`, `m@weights`, `m@obs_target` through the contract and let `verify_manifest()` gate on integrity. The companion jstyle `outputs_manifest.yaml` (the project-level artefact index) will reference per-run manifests like this one by relative path + the same `data_hash`, so a project's manifest tree is end-to-end hash-verifiable. ```{r, include=FALSE} unlink(dir, recursive = TRUE) ``` ## Reproducibility ```{r} sessionInfo() ```