| Title: | Structurally Faithful Development Surrogates for Tabular Data |
|---|---|
| Description: | Turns a single tabular dataset into a structurally faithful synthetic clone suitable for pipeline development. Experimental-design columns and the NA pattern are preserved exactly; treatment and categorical-covariate level vocabularies are optionally aliased; outcome and numeric-covariate values are re-simulated via a Gaussian copula that preserves the global covariance structure. A private 'recipe' object round-trips a pipeline written against the synthetic clone onto the original data. Not a differential-privacy or anonymisation tool: outputs are development surrogates, not public-release-safe artefacts. |
| Authors: | Max Moldovan [aut, cre] (ORCID: <https://orcid.org/0000-0001-9680-8474>, affiliation: Adelaide University) |
| Maintainer: | Max Moldovan <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.4.1 |
| Built: | 2026-05-31 11:35:12 UTC |
| Source: | https://github.com/max578/masque |
Forward translation: takes an original-namespace data frame and
returns it renamed and re-labelled to match the synthetic namespace
produced by mask(). Use this to run a pipeline (trained on the
synthetic) against the original data without modifying pipeline code.
apply_recipe(original, rec, check_integrity = TRUE)apply_recipe(original, rec, check_integrity = TRUE)
original |
A data frame in the original namespace. |
rec |
A |
check_integrity |
Logical. When |
Operations applied (in order):
Verify integrity by comparing the NA mask of original to the
SHA-256 fingerprint stored on the recipe (controlled by
check_integrity).
Drop columns that mask() dropped (in collaborate mode this is
every ignore column; in local mode no columns are dropped).
Subset and reorder to the columns the recipe knows about.
Re-label factors / characters for any column with a level map
held by the recipe (i.e., treatment and categorical covariates in
collaborate mode; or treatment in local mode with opt-in
permutation). Unknown non-NA values fail closed.
Rename columns per recipe@column_name_map (currently NULL;
reserved for a future opt-in column-aliasing flag — see
vignette("roadmap")).
Numeric columns are passed through unchanged: the synthetic-namespace for numeric columns is the same as the original. NA cells in the input remain NA in the output (no synthesis is performed here).
A tibble in the synthetic namespace, ready for the pipeline.
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" r$role[r$col == "Species"] <- "covariate" m <- mask(iris, r, mode = "collaborate", seed = 1) rec <- recipe(m) iris_in_synth_space <- apply_recipe(iris, rec) head(iris_in_synth_space)r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" r$role[r$col == "Species"] <- "covariate" m <- mask(iris, r, mode = "collaborate", seed = 1) rec <- recipe(m) iris_in_synth_space <- apply_recipe(iris, rec) head(iris_in_synth_space)
Returns a per-column audit tibble and prints a severity-grouped report.
Auto-runs in mode = "collaborate" at mask() time and stores the
result on the masque object (m@audit); for local-mode audits or
explicit re-audits, pass the original data frame via original.
audit_mask(m, original = NULL, print = TRUE)audit_mask(m, original = NULL, print = TRUE)
m |
A |
original |
Optional. Required when |
print |
Logical; if TRUE (default), print a styled report. |
Each row of the returned tibble holds:
col: column name in the original.
role: assigned role.
kind: storage kind.
leakage_class: low, medium, or high.
n_unique_levels: distinct non-NA values (categorical only).
freq_min: minimum per-level frequency (categorical only).
exact_match_pct: percentage of synthetic cells equal to the
original cell (numeric only; cell-by-cell).
na_pct: percentage of NA cells in the original column.
na_pattern_uniqueness: fraction of rows in the original with
a globally unique NA pattern (one number per data frame, repeated
on every row).
alias_status: aliased, passthrough, or dropped.
notes: short human summary.
Classification heuristics (CODEX-aligned):
Retained PII-pattern column -> high.
Treatment unaliased in collaborate -> high.
Categorical covariate with a frequency-1 level in collaborate ->
high.
Outcome with exact-match-pct > 1\
Numeric covariate with exact-match-pct > 5\
medium.
Ignore column retained in local -> low (informational).
Step 7 will lower numeric exact-match-pct under collaborate by adding
within-resolution jitter; until then, expect medium leakage on
collaborate-mode numerics.
The audit tibble, returned invisibly.
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" r$role[r$col == "Species"] <- "covariate" m <- mask(iris, r, mode = "collaborate", seed = 1) audit_mask(m)r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" r$role[r$col == "Species"] <- "covariate" m <- mask(iris, r, mode = "collaborate", seed = 1) audit_mask(m)
Inspects df and returns an S7 design_summary describing the most
likely experimental design — one of "CRD", "RCBD",
"IBD/alpha-lattice", "row-column", "split-plot", "factorial",
or "none" (observational / no detectable design).
detect_design( df, roles = NULL, interactive = FALSE, threshold = 0.5, tie_delta = 0.02 )detect_design( df, roles = NULL, interactive = FALSE, threshold = 0.5, tie_delta = 0.02 )
df |
A data frame. |
roles |
Optional roles tibble (as returned by |
interactive |
If |
threshold |
Minimum top-rule score for a class to be reported.
Below this, |
tie_delta |
Score difference within which two rules are treated
as tied. Default |
Detection runs six independent rules; each returns a score in
. The orchestrator picks the highest-scoring class above
threshold. Ties within tie_delta are broken in favour of the
simpler design (CRD < RCBD < factorial < IBD < row-column <
split-plot).
The detector never edits df. Its job is to recommend a role
assignment, surface the evidence, and (optionally) draw a sanity
check via plot().
An S7 design_summary object with slots class_label,
treatment_col, block_cols, whole_plot_col, sub_plot_col,
spatial_cols, scores, evidence, recommended_roles,
candidates, warnings.
propose_roles() for the role tibble that feeds detection;
plot_design_summary() for the sanity-check visualisation.
# Classic alpha-lattice (24 genotypes, 3 reps, 6 blocks per rep). if (requireNamespace("agridat", quietly = TRUE)) { d <- agridat::john.alpha ds <- detect_design(d) print(ds) } # Observational data frame -> class_label "none". detect_design(mtcars)# Classic alpha-lattice (24 genotypes, 3 reps, 6 blocks per rep). if (requireNamespace("agridat", quietly = TRUE)) { d <- agridat::john.alpha ds <- detect_design(d) print(ds) } # Observational data frame -> class_label "none". detect_design(mtcars)
Takes one data frame and a user-edited roles tibble (from
propose_roles()) and produces a synthetic clone whose experimental
design and NA pattern are preserved, while outcome and numeric-covariate
values are re-simulated via a Gaussian copula and categorical-covariate
values are row-permuted. Returns a masque S7 object holding the
synthetic data and a private masque_recipe.
mask(df, roles, mode = c("local", "collaborate"), seed = NULL, ...)mask(df, roles, mode = c("local", "collaborate"), seed = NULL, ...)
df |
A data frame. |
roles |
A tibble produced by |
mode |
Either |
seed |
Optional integer for reproducibility. |
... |
Currently ignored. |
mode = "local" keeps original column / level vocabularies and warns
that the synthetic is for owner development only. mode = "collaborate"
opaque-aliases treatment and categorical-covariate level vocabularies
(trt_001, <col>_L01) and drops ignore columns; the resulting
synthetic can be passed to a collaborator while the recipe stays
private. In collaborate mode, numeric draws are jittered within
their measurement resolution, integer columns are stochastically
rounded, and audit_mask() runs automatically.
A masque S7 object. Use synthetic() and recipe() to
extract the components.
designByte-identical pass-through.
treatmentLocal: pass-through (optional opt-in seeded
permutation via roles$mask_levels = "permute"). Collaborate:
opaque alias trt_NNN.
outcome + numeric covariate
Re-simulated jointly via a Gaussian copula on global Pearson covariance. Empirical-quantile marginals (type 1: returns observed values).
covariate
Row-permuted within non-NA positions.
Local: vocabulary preserved. Collaborate: opaque alias
<col>_LNN.
ignoreLocal: passes through. Collaborate: dropped.
RNG state is preserved across the call.
propose_roles(), roles_validate(), synthetic(),
recipe(), reveal_maps().
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- suppressWarnings(mask(iris, r, seed = 1)) head(synthetic(m))r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- suppressWarnings(mask(iris, r, seed = 1)) head(synthetic(m))
Plots the structure that drove the detect_design() verdict. The
panel layout depends on the detected class:
plot_design_summary(x, df, engine = c("base", "ggplot2"), ...)plot_design_summary(x, df, engine = c("base", "ggplot2"), ...)
x |
A |
df |
The data frame that was passed to |
engine |
|
... |
Ignored. |
CRD, factorial, none -> frequency-of-treatment + NA-pattern.
RCBD, IBD/alpha-lattice -> treatment x block replication tile.
row-column -> spatial layout tile (row x col, fill =
treatment).
split-plot -> factor-nesting tree + within-block replication.
Output is purely diagnostic; do not use it as a publication figure
(use desplot::desplot() or ggplot2-based packages for that).
The input x, invisibly. Called for the plot side-effect.
if (requireNamespace("agridat", quietly = TRUE)) { d <- agridat::john.alpha ds <- detect_design(d) plot(ds, df = d) }if (requireNamespace("agridat", quietly = TRUE)) { d <- agridat::john.alpha ds <- detect_design(d) plot(ds, df = d) }
Generates a heuristic role tibble for every column of df. The user is
expected to inspect this tibble and edit it before passing it to mask().
Heuristics are seeds, not law.
propose_roles(df, detect = TRUE)propose_roles(df, detect = TRUE)
df |
A data frame. Must have at least one column. |
detect |
Logical scalar (default |
Roles are exactly one of:
designByte-identical pass-through. Trial / site / replicate / block / plot / row / column / year etc.
treatmentSame factor cardinality and per-level frequency; optional label aliasing or seeded permutation.
outcomeRe-simulated via Gaussian copula. Multiple allowed.
covariateNumeric: Gaussian copula (joint with outcomes). Categorical: row-permuted, levels preserved (local) or aliased (collaborate).
ignoreDropped or passed through depending on mask() options;
auto-assigned for date/time, free text, and PII-pattern names.
Default classification rules, applied in order:
PII-pattern column names (contact, email, phone, gps,
latitude/longitude, postcode, ssn, password, owner,
farmer, operator, etc., case-insensitive substring) -> ignore
with pii_suspected = TRUE.
Date / POSIXct / POSIXlt / difftime columns -> ignore.
ID-pattern names (\\bid\\b, _id$, ^id_) -> ignore.
Design-pattern names (rep, block, row, col(umn)?, range,
plot(no)?, site, env(ironment)?, trial, year, season,
colrep, tos) -> design.
Treatment-pattern names (treatment, variety, cultivar,
genotype, ^trt, ^dose) -> treatment.
Character columns with > 50% unique values on non-NA -> ignore
(likely free text).
Everything else -> covariate. The user re-classifies one or more
columns as outcome.
Failing to designate at least one outcome is a hard error at mask()
time (via roles_validate()).
Since masque 0.3.0, propose_roles() also calls detect_design() by
default (detect = TRUE) and applies the detected design's
recommended_roles on top of the name-based heuristic. This promotes
structurally-identified block / treatment columns even when the
column names do not match the design / treatment regexes. The
resulting design summary is stashed as attr(roles, "design") so the
user can plot() it or inspect alternates. Pass detect = FALSE to
recover the v0.2.x name-only behaviour byte-for-byte.
A tibble with one row per column, containing:
col: column name.
role: one of design, treatment, outcome, covariate,
ignore.
kind: storage kind (numeric, integer, factor,
character, logical, date, datetime, other).
freq_or_range: brief summary string (range for numeric,
level count for factor, etc.).
pii_suspected: TRUE if column name matches a PII pattern.
notes: short explanation of the auto-classification.
roles_validate() for the fail-closed validation applied at
mask() time.
propose_roles(iris)propose_roles(iris)
Loads a recipe written by save_recipe(). Validates that the file
contains a masque_recipe and informs (does not error) if the recipe
was written by a different package version than the one currently
installed.
read_recipe(path)read_recipe(path)
path |
File path. |
A masque_recipe object.
The recipe is private: at least as sensitive as the original data.
Never share alongside the synthetic. By default print(recipe(m))
redacts all level maps; use reveal_maps() for an explicit reveal.
recipe(m)recipe(m)
m |
A |
A masque_recipe S7 object.
synthetic(), reveal_maps(), mask().
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- suppressWarnings(mask(iris, r, seed = 1)) recipe(m)r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- suppressWarnings(mask(iris, r, seed = 1)) recipe(m)
Explicit, audited reveal: prints each original -> synthetic level map
held by the recipe. Use sparingly. Recipe maps are at least as
sensitive as the original data; printing them defeats the redaction
built into print() and summary().
reveal_maps(rec)reveal_maps(rec)
rec |
A |
rec, invisibly.
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" r$role[r$col == "Species"] <- "covariate" m <- suppressWarnings(mask(iris, r, mode = "collaborate", seed = 1)) rec <- recipe(m) reveal_maps(rec)r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" r$role[r$col == "Species"] <- "covariate" m <- suppressWarnings(mask(iris, r, mode = "collaborate", seed = 1)) rec <- recipe(m) reveal_maps(rec)
Fail-closed validation of a roles tibble before mask() consumes it.
Errors are raised for every misuse the v0.2 spec calls out.
roles_validate(roles, df = NULL)roles_validate(roles, df = NULL)
roles |
A tibble produced by |
df |
Optional data frame. If supplied, |
Hard errors:
missing required columns (col, role, kind);
unknown role string (not in
c("design","treatment","outcome","covariate","ignore"));
any NA role;
zero columns flagged outcome;
more than one column flagged treatment (joint-treatment
masking is not yet supported by mask());
duplicate col entries;
if df supplied: any df column missing from roles, or any
roles column missing from df.
Returns the validated roles invisibly (mirrors stopifnot-style use).
roles, invisibly.
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" roles_validate(r, iris)r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" roles_validate(r, iris)
Writes the recipe to a single .rds file. The default is
runtime-minimal:
no simulator state (copula covariance, raw margins) is written, only the
translation maps, factor metadata, storage classes, integrity fingerprint,
and warnings. This keeps the saved artefact small and reduces the
information that would leak if the recipe file alone were shared.
save_recipe(rec, path, include_simulator = FALSE)save_recipe(rec, path, include_simulator = FALSE)
rec |
A |
path |
File path. By convention, |
include_simulator |
Logical. Reserved for a future release. Currently a no-op (recipe is always written runtime-minimal). |
include_simulator = TRUE is accepted but is currently a no-op: the
recipe does not carry simulator state. The flag is reserved for a
future release that will let read_recipe() regenerate fresh
synthetic samples without access to the original data (see
vignette("roadmap")).
Recipes are at least as sensitive as the original data. Protect the saved file at the same security class as the original.
path, invisibly.
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- mask(iris, r, mode = "collaborate", seed = 1) tmp <- tempfile(fileext = ".rds") save_recipe(recipe(m), tmp) rec2 <- read_recipe(tmp)r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- mask(iris, r, mode = "collaborate", seed = 1) tmp <- tempfile(fileext = ".rds") save_recipe(recipe(m), tmp) rec2 <- read_recipe(tmp)
Replaces the latitude / longitude values in a masqued data frame with coordinates anchored at user-supplied centroids and clustered to preserve the original's site-count-per-anchor structure. The function never reads the real coordinates beyond counting how many distinct sites the original holds per anchor level — so it leaks the replication-per-site distribution and the count of distinct sites, nothing more.
synthesise_geospatial( synth, original, anchor_col, lat_col, lon_col, anchor_centroids, site_spread_deg = 0.6, jitter_deg = 0.05, seed = NULL )synthesise_geospatial( synth, original, anchor_col, lat_col, lon_col, anchor_centroids, site_spread_deg = 0.6, jitter_deg = 0.05, seed = NULL )
synth |
A synthetic data frame (typically |
original |
The original data frame from which |
anchor_col |
Name of the column whose levels anchor each cluster
(e.g., |
lat_col, lon_col
|
Column names of the latitude and longitude
fields to overwrite in |
anchor_centroids |
Named list keyed by anchor levels; each
element is a length-2 numeric named |
site_spread_deg |
Half-width of the box (in decimal degrees)
around each anchor centroid within which fake site centroids are
uniformly placed. Default |
jitter_deg |
Within-site uniform jitter (in decimal degrees)
added to each row's assigned site centroid. Default |
seed |
Optional integer seed for reproducibility. The function
uses |
Typical use: after mask() produces a synthetic with copula-drawn or
missing coordinates, call synthesise_geospatial() to substitute
plausible points. The synthetic ends up with:
(e.g., if the original has five distinct trial sites in NSW, the synthetic will have five fake sites in NSW);
(each fake site receives a share of the synthetic rows proportional to its real counterpart's count);
(small jitter within site; larger spread between sites within each anchor centroid's neighbourhood).
What the function does not preserve:
the real positions of the sites (they are random within a user-defined neighbourhood of each anchor centroid);
the relative spacing or bearings between real sites;
any spatial autocorrelation in the outcome.
Coordinates that are NA in the original remain NA in the
synthetic — the NA pattern is preserved cell-by-cell.
synth, with lat_col and lon_col overwritten by the
re-anchored coordinates.
# Toy example: 50 rows split across two states. set.seed(1) n <- 50 df <- data.frame( state = sample(c("NSW", "VIC"), n, replace = TRUE), lat = stats::rnorm(n, -33, 0.3), lon = stats::rnorm(n, 145, 0.3), y = stats::rnorm(n) ) roles <- propose_roles(df, detect = FALSE) roles$role[roles$col == "y"] <- "outcome" roles$role[roles$col %in% c("lat", "lon")] <- "covariate" roles$role[roles$col == "state"] <- "design" m <- mask(df, roles, mode = "collaborate", seed = 1L) centroids <- list( NSW = c(lat = -32.5, lon = 147), VIC = c(lat = -36.5, lon = 144) ) synth_geo <- synthesise_geospatial( synthetic(m), df, anchor_col = "state", lat_col = "lat", lon_col = "lon", anchor_centroids = centroids, seed = 2L ) head(synth_geo[, c("state", "lat", "lon")])# Toy example: 50 rows split across two states. set.seed(1) n <- 50 df <- data.frame( state = sample(c("NSW", "VIC"), n, replace = TRUE), lat = stats::rnorm(n, -33, 0.3), lon = stats::rnorm(n, 145, 0.3), y = stats::rnorm(n) ) roles <- propose_roles(df, detect = FALSE) roles$role[roles$col == "y"] <- "outcome" roles$role[roles$col %in% c("lat", "lon")] <- "covariate" roles$role[roles$col == "state"] <- "design" m <- mask(df, roles, mode = "collaborate", seed = 1L) centroids <- list( NSW = c(lat = -32.5, lon = 147), VIC = c(lat = -36.5, lon = 144) ) synth_geo <- synthesise_geospatial( synthetic(m), df, anchor_col = "state", lat_col = "lat", lon_col = "lon", anchor_centroids = centroids, seed = 2L ) head(synth_geo[, c("state", "lat", "lon")])
Extract the synthetic data from a masque object
synthetic(m)synthetic(m)
m |
A |
A tibble: the synthetic data frame.
r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- suppressWarnings(mask(iris, r, seed = 1)) head(synthetic(m))r <- propose_roles(iris) r$role[r$col == "Sepal.Length"] <- "outcome" m <- suppressWarnings(mask(iris, r, seed = 1)) head(synthetic(m))
Inverse of apply_recipe(). Accepts a data frame or an atomic vector.
For an atomic factor / character vector with a recipe that holds
multiple level maps, column must name which map to invert. Atomic
numeric, integer, logical, and Date / POSIXct vectors are
returned unchanged (no inverse map applies — these are pass-through
under apply_recipe() too).
unmask(x, rec, column = NULL)unmask(x, rec, column = NULL)
x |
A data frame or an atomic vector to translate from synthetic-namespace to original-namespace. |
rec |
A |
column |
Optional column name. Only consulted for atomic
factor / character |
The most common pattern is round-tripping pipeline predictions:
fit <- my_model(synthetic(m)) # train on synthetic orig_in_synth_space <- apply_recipe(original, recipe(m)) # forward preds_synth <- predict(fit, orig_in_synth_space) preds_orig <- unmask(preds_synth, recipe(m)) # inverse
Unknown levels (synthetic aliases not in the recipe's map) fail
closed with an informative error rather than silently coercing to
NA.
An object of the same type as x, in the original namespace.