---
title: "Confidentiality and the threat model"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Confidentiality and the threat model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(width = 100)
library(masque)
set.seed(1)
```
## What masque protects, and what it does not
`masque` is **not** a privacy-preserving or differential-privacy tool.
It is a structurally faithful development surrogate. Read this vignette
before sharing any `masque` output beyond your own machine.
The recipe returned by `mask()` is at least as sensitive as the original
data. The whole design assumes that only the *synthetic* crosses the
trust boundary and the recipe stays with the custodian.
| Actor holds | Wants to learn | What `masque` protects |
|---|---|---|
| Synthetic only | Original raw values | Aliased treatment / categorical vocabularies; jittered numerics; dropped ids and free text; optionally, the column names |
| Synthetic + recipe | Original raw values | Nothing -- the combination is as sensitive as the original |
| Recipe only | Original raw values | Nothing useful -- the recipe is meaningless without the synthetic |
| Synthetic + external side information | Identity of treatments / sites | Only the label vocabulary; a preserved design footprint or a `keep` column is recognisable, and side information wins |
What `masque` does: it preserves enough structure for pipelines to run
unchanged, exposes the privacy-versus-fidelity trade-off through two
explicit modes, records every translation in a private recipe that
round-trips, and audits its own output before you share it.
What `masque` does not do: it gives no differential-privacy guarantee, it
does not make output safe for public release, it does not hide rare
strata, small designs, or operational metadata such as small
site-by-year combinations or contact names, and it does not rewrite
pipeline source code.
## Two axes of control: roles and actions
Since version 0.6.0, every column carries a `role` (what it is) and an
`action` (how deeply it is masked). The mode sets the default action per
role; you override any column you like. The four actions are the privacy
dial:
| Action | Effect | Vocabulary visible? |
|---|---|---|
| `keep` | byte-identical pass-through | yes -- real values published |
| `scramble` | re-simulate numerics; row-permute categoricals / dates | yes -- labels stay, assignment moves |
| `alias` | scramble, then replace labels with opaque codes | no |
| `drop` | column omitted from the synthetic | n/a |
The mode chooses sensible defaults so the common case is safe without
per-column work:
| Knob | `mode = "local"` | `mode = "collaborate"` |
|---|---|---|
| Treatment labels | kept | aliased (`trt_001`) |
| Categorical covariate labels | kept | aliased (`
_L001`) |
| Date / time columns | row-permuted; class preserved | row-permuted; class preserved |
| Identifiers (`id`) | kept | dropped |
| Free text (`text`) | kept | dropped |
| Numeric synthesis | empirical-quantile (may emit observed values) | empirical-quantile plus within-resolution jitter; integers stochastically rounded |
| NA mask | preserved cell-by-cell | preserved cell-by-cell |
| `audit_mask()` | on demand | automatic at `mask()` time |
| `print(recipe)` | redacted | redacted; explicit `reveal_maps()` to inspect |
## Depth controls
Two further controls sharpen the depth where the defaults are not enough.
**Hiding column names.** Names themselves can identify a column
(`roseworthy_yield`). `mask(..., alias_names = TRUE)` replaces every
retained column name with an opaque code, which the recipe inverts on the
round-trip. Pass a character vector to hide only some names.
**Hiding design structure.** Design columns are byte-identical by
default, which preserves the experimental layout exactly -- and a
publicly registered trial layout is a fingerprint. To keep the structure
but hide the site or block *labels*, set a design column's action to
`alias`:
```{r design-alias}
df <- data.frame(
site = factor(rep(c("Roseworthy", "Minnipa", "Turretfield"), each = 6)),
rep = rep(1:3, 6),
yield = rnorm(18)
)
roles <- propose_roles(df, detect = FALSE)
roles <- set_role(roles, "site", role = "design", action = "alias")
s <- synthetic(mask(df, roles, mode = "collaborate", seed = 1))
unique(as.character(s$site)) # real site names gone
table(s$site) # ... but the three-site structure intact
```
This knowingly breaks byte-identity, so it is never a default -- you ask
for it explicitly.
## The leakage audit
`audit_mask()` inspects the synthetic against the original and grades the
leakage of each column. In collaborate mode it runs automatically and
warns at `mask()` time. Here is a fixture built to trip it: a
PII-suspected column the user retains against the default, and a
categorical covariate with a frequency-one level.
```{r leaky-fixture}
df <- data.frame(
rep = rep(1:3, each = 20),
variety = factor(rep(paste0("V", 1:6), 10)),
contact_email = factor(rep(c("a@x", "b@y"), 30)),
rare_treatment = factor(c(
"only_one",
sample(c("alpha", "beta", "gamma"), 59, replace = TRUE)
)),
yield = rnorm(60, 5, 1),
stringsAsFactors = FALSE
)
roles <- propose_roles(df, mode = "collaborate")
roles[, c("col", "role", "action", "pii_suspected")]
```
`contact_email` was auto-flagged `pii_suspected` and set to `drop`. We
override it -- pretending the custodian insists on keeping it -- and
make the rare column a covariate, then mask:
```{r leaky-mask}
roles <- set_role(roles, "yield", role = "outcome")
roles <- set_role(roles, "contact_email", role = "covariate", action = "keep")
roles <- set_role(roles, "rare_treatment", role = "covariate")
m <- mask(df, roles, mode = "collaborate", seed = 1)
audit_mask(m)
```
`contact_email` (real values kept across the trust boundary) and
`rare_treatment` (a frequency-one level) are flagged. `masque` records
the risks in `recipe@warnings` and warns at construction time, but it
does not block -- the decision to share stays with the custodian.
## Multi-table sets
When several tables share a key, `mask_set()` aliases that key
*identically across all of them* so the synthetic tables still join. A
linked key is the join surface; it is aliased consistently rather than
permuted (permuting a key would break the join regardless of masking).
```{r set}
set_dir <- system.file("extdata", "met_set", package = "masque")
ms <- mask_set(set_dir, mode = "collaborate", seed = 1, quiet = TRUE)
ag <- synthetic(ms)$agronomy
qa <- synthetic(ms)$quality
setequal(unique(ag$gen), unique(qa$gen)) # same genotype codes in both
```
The recipe bundle is private exactly as a single recipe is; `write_set()`
never writes it.
## Operational guidance
Default to collaborate mode when in doubt; local mode is for owner-only
development. Treat the recipe as you treat the original data -- same
security class, same access controls. Re-run `audit_mask()` before any
sharing even though collaborate mode runs it for you. Never override a
`pii_suspected` flag without deciding deliberately. Remember that
date/time columns and a preserved design footprint both carry real
operational signal; alias or roll them up when that signal is sensitive.
Small designs leak -- a categorical cell count of one is high leakage, so
aggregate or drop before masking.
For what the recipe stores and how the round-trip works, see *Recipe
anatomy and the round-trip*.