--- title: "Confidentiality and the threat model" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Confidentiality and the threat model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") options(width = 100) library(masque) set.seed(1) ``` ## What masque protects, and what it does not `masque` is **not** a privacy-preserving or differential-privacy tool. It is a structurally faithful development surrogate. Read this vignette before sharing any `masque` output beyond your own machine. The recipe returned by `mask()` is at least as sensitive as the original data. The whole design assumes that only the *synthetic* crosses the trust boundary and the recipe stays with the custodian. | Actor holds | Wants to learn | What `masque` protects | |---|---|---| | Synthetic only | Original raw values | Aliased treatment / categorical vocabularies; jittered numerics; dropped ids and free text; optionally, the column names | | Synthetic + recipe | Original raw values | Nothing -- the combination is as sensitive as the original | | Recipe only | Original raw values | Nothing useful -- the recipe is meaningless without the synthetic | | Synthetic + external side information | Identity of treatments / sites | Only the label vocabulary; a preserved design footprint or a `keep` column is recognisable, and side information wins | What `masque` does: it preserves enough structure for pipelines to run unchanged, exposes the privacy-versus-fidelity trade-off through two explicit modes, records every translation in a private recipe that round-trips, and audits its own output before you share it. What `masque` does not do: it gives no differential-privacy guarantee, it does not make output safe for public release, it does not hide rare strata, small designs, or operational metadata such as small site-by-year combinations or contact names, and it does not rewrite pipeline source code. ## Two axes of control: roles and actions Since version 0.6.0, every column carries a `role` (what it is) and an `action` (how deeply it is masked). The mode sets the default action per role; you override any column you like. The four actions are the privacy dial: | Action | Effect | Vocabulary visible? | |---|---|---| | `keep` | byte-identical pass-through | yes -- real values published | | `scramble` | re-simulate numerics; row-permute categoricals / dates | yes -- labels stay, assignment moves | | `alias` | scramble, then replace labels with opaque codes | no | | `drop` | column omitted from the synthetic | n/a | The mode chooses sensible defaults so the common case is safe without per-column work: | Knob | `mode = "local"` | `mode = "collaborate"` | |---|---|---| | Treatment labels | kept | aliased (`trt_001`) | | Categorical covariate labels | kept | aliased (`_L001`) | | Date / time columns | row-permuted; class preserved | row-permuted; class preserved | | Identifiers (`id`) | kept | dropped | | Free text (`text`) | kept | dropped | | Numeric synthesis | empirical-quantile (may emit observed values) | empirical-quantile plus within-resolution jitter; integers stochastically rounded | | NA mask | preserved cell-by-cell | preserved cell-by-cell | | `audit_mask()` | on demand | automatic at `mask()` time | | `print(recipe)` | redacted | redacted; explicit `reveal_maps()` to inspect | ## Depth controls Two further controls sharpen the depth where the defaults are not enough. **Hiding column names.** Names themselves can identify a column (`roseworthy_yield`). `mask(..., alias_names = TRUE)` replaces every retained column name with an opaque code, which the recipe inverts on the round-trip. Pass a character vector to hide only some names. **Hiding design structure.** Design columns are byte-identical by default, which preserves the experimental layout exactly -- and a publicly registered trial layout is a fingerprint. To keep the structure but hide the site or block *labels*, set a design column's action to `alias`: ```{r design-alias} df <- data.frame( site = factor(rep(c("Roseworthy", "Minnipa", "Turretfield"), each = 6)), rep = rep(1:3, 6), yield = rnorm(18) ) roles <- propose_roles(df, detect = FALSE) roles <- set_role(roles, "site", role = "design", action = "alias") s <- synthetic(mask(df, roles, mode = "collaborate", seed = 1)) unique(as.character(s$site)) # real site names gone table(s$site) # ... but the three-site structure intact ``` This knowingly breaks byte-identity, so it is never a default -- you ask for it explicitly. ## The leakage audit `audit_mask()` inspects the synthetic against the original and grades the leakage of each column. In collaborate mode it runs automatically and warns at `mask()` time. Here is a fixture built to trip it: a PII-suspected column the user retains against the default, and a categorical covariate with a frequency-one level. ```{r leaky-fixture} df <- data.frame( rep = rep(1:3, each = 20), variety = factor(rep(paste0("V", 1:6), 10)), contact_email = factor(rep(c("a@x", "b@y"), 30)), rare_treatment = factor(c( "only_one", sample(c("alpha", "beta", "gamma"), 59, replace = TRUE) )), yield = rnorm(60, 5, 1), stringsAsFactors = FALSE ) roles <- propose_roles(df, mode = "collaborate") roles[, c("col", "role", "action", "pii_suspected")] ``` `contact_email` was auto-flagged `pii_suspected` and set to `drop`. We override it -- pretending the custodian insists on keeping it -- and make the rare column a covariate, then mask: ```{r leaky-mask} roles <- set_role(roles, "yield", role = "outcome") roles <- set_role(roles, "contact_email", role = "covariate", action = "keep") roles <- set_role(roles, "rare_treatment", role = "covariate") m <- mask(df, roles, mode = "collaborate", seed = 1) audit_mask(m) ``` `contact_email` (real values kept across the trust boundary) and `rare_treatment` (a frequency-one level) are flagged. `masque` records the risks in `recipe@warnings` and warns at construction time, but it does not block -- the decision to share stays with the custodian. ## Multi-table sets When several tables share a key, `mask_set()` aliases that key *identically across all of them* so the synthetic tables still join. A linked key is the join surface; it is aliased consistently rather than permuted (permuting a key would break the join regardless of masking). ```{r set} set_dir <- system.file("extdata", "met_set", package = "masque") ms <- mask_set(set_dir, mode = "collaborate", seed = 1, quiet = TRUE) ag <- synthetic(ms)$agronomy qa <- synthetic(ms)$quality setequal(unique(ag$gen), unique(qa$gen)) # same genotype codes in both ``` The recipe bundle is private exactly as a single recipe is; `write_set()` never writes it. ## Operational guidance Default to collaborate mode when in doubt; local mode is for owner-only development. Treat the recipe as you treat the original data -- same security class, same access controls. Re-run `audit_mask()` before any sharing even though collaborate mode runs it for you. Never override a `pii_suspected` flag without deciding deliberately. Remember that date/time columns and a preserved design footprint both carry real operational signal; alias or roll them up when that signal is sensitive. Small designs leak -- a categorical cell count of one is high leakage, so aggregate or drop before masking. For what the recipe stores and how the round-trip works, see *Recipe anatomy and the round-trip*.