---
title: "Getting started with masque"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with masque}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
options(width = 100)
```

# masque: development surrogates for tabular data

`masque` turns a single tabular dataset into a structurally faithful
synthetic clone you can develop a pipeline against. It preserves the
experimental design, the NA pattern, and the global covariance of your
outcome and numeric covariates. It does **not** anonymise; it produces
controlled substitutes with a private `recipe` that round-trips the
finished pipeline back onto the original data.

## Read this first: threat model

`masque` is *not* a privacy-preserving or differential-privacy tool.
It is a **structurally faithful development surrogate**. The recipe
returned by `mask()` is at least as sensitive as the original data:
never share it alongside the synthetic.

| Mode | Use case | Defaults |
|---|---|---|
| `local` | Owner develops on a realistic surrogate locally | Vocabulary preserved; numeric values may match observed |
| `collaborate` | Owner shares synthetic with a collaborator while keeping the recipe private | Opaque aliasing of treatment + categorical-covariate levels; jitter on numerics; ignore columns dropped; `audit_mask()` auto-runs |

For the full threat model and limitations, see
`vignette("confidentiality")`.

## A worked example: an alpha-design field trial

We use the classical John (1987) alpha-design dataset, shipped as a
small CSV in `inst/extdata/`.

```{r load-fixture}
library(masque)

f <- system.file("extdata", "john_alpha.csv", package = "masque")
df <- read.csv(f, stringsAsFactors = TRUE)
str(df)
head(df)
```

## Step 1: propose roles, edit, validate

`propose_roles()` runs a heuristic classification of every column into
one of `{design, treatment, outcome, covariate, ignore}`. The result is
a tibble that you inspect and edit.

```{r propose-roles}
roles <- propose_roles(df)
roles
```

`plot`, `rep`, `block`, `row`, and `col` are detected as design columns
(byte-identical pass-through). `gen` is detected as a treatment factor.
`yield` defaults to `covariate` — we re-role it as `outcome`.

```{r edit-roles}
roles$role[roles$col == "yield"] <- "outcome"
roles
```

Validation is a single call:

```{r validate-roles}
roles_validate(roles, df)
```

## Step 2: mask in local mode

Local mode is the owner's realistic surrogate: it preserves the
treatment vocabulary, the design pattern, and the NA mask. The
synthetic is suitable for pipeline development on the owner's machine
but not for external sharing.

```{r mask-local}
m_local <- mask(df, roles, mode = "local", seed = 2026)
```

The construction-time warning is part of the contract — local-mode
synthetic is not for external sharing.

```{r local-summary}
m_local
```

Extract the synthetic data via `synthetic()`:

```{r local-synthetic}
synth_local <- synthetic(m_local)
head(synth_local)
```

Design columns are byte-identical to the original; treatment labels
are preserved.

```{r local-checks}
identical(synth_local$rep, df$rep)
identical(synth_local$block, df$block)
identical(levels(synth_local$gen), levels(df$gen))
```

## Step 3: mask in collaborate mode

Collaborate mode opaquely aliases the treatment and categorical-
covariate vocabularies (`G01 -> trt_001` etc.), drops `ignore`
columns, jitters numerics within their observed measurement
resolution, and auto-runs `audit_mask()`. The synthetic is suitable
for handing to a pipeline developer while the recipe stays private.

```{r mask-collab}
m_collab <- mask(df, roles, mode = "collaborate", seed = 2026)
synth_collab <- synthetic(m_collab)
head(synth_collab)
```

Note `gen` is now `trt_NNN`:

```{r collab-gen}
head(levels(synth_collab$gen))
```

Original labels never leak through `print(recipe(m))`:

```{r collab-recipe}
recipe(m_collab)
```

To see them you must call `reveal_maps()` explicitly.

## Step 4: round-trip a pipeline

The classic `masque` workflow:

```{r roundtrip}
# Train a model against the synthetic namespace
fit <- lm(yield ~ gen + rep, data = synth_collab)

# Translate the original into the synthetic namespace via the recipe
df_in_synth <- apply_recipe(df, recipe(m_collab))
head(df_in_synth)

# Predict on the translated data
preds_synth <- predict(fit, newdata = df_in_synth)
length(preds_synth)

# Numeric predictions need no inverse map
head(preds_synth)
```

If the pipeline returned factor-valued predictions (e.g., a classifier
predicting a treatment), `unmask()` translates them back into the
original vocabulary:

```{r unmask-atomic}
pred_factor_synth <- synth_collab$gen[1:5]
pred_factor_orig <- unmask(pred_factor_synth, recipe(m_collab),
  column = "gen"
)
data.frame(
  synth = as.character(pred_factor_synth),
  original = as.character(pred_factor_orig)
)
```

## Step 5: audit and ship

`audit_mask()` returns a per-column leakage audit. In collaborate mode
it runs automatically at `mask()` time and is stored on the object.

```{r audit}
audit_mask(m_collab)
```

The recipe can be persisted alongside the original data (treat as
sensitive); the synthetic alone is what crosses the trust boundary.

```{r persist}
tmp <- tempfile(fileext = ".rds")
save_recipe(recipe(m_collab), tmp)
file.info(tmp)$size
rec2 <- read_recipe(tmp)
identical(rec2@masque_version, recipe(m_collab)@masque_version)
```

## Next steps

- `vignette("confidentiality")` — the full threat model, mode comparison,
  and `audit_mask()` walk-through with deliberately leaky fixtures.
- `vignette("recipe_anatomy")` — what's inside a recipe, runtime vs full,
  `print()` vs `reveal_maps()`.
- `vignette("roadmap")` — features deliberately deferred from the
  current release.