Package 'masque' reference manual

Title:	Structurally Faithful Development Surrogates for Tabular Data
Description:	Turns a confidential tabular dataset - a single table, a folder of files, or a multi-sheet workbook - into a structurally faithful synthetic clone suitable for pipeline development. Experimental-design allocation and the NA pattern are preserved exactly; categorical design, treatment, and covariate vocabularies can be reversibly aliased; outcome and numeric-covariate values are re-simulated via a Gaussian copula that preserves the global covariance structure. Columns shared across tables are aliased consistently so the synthetic tables still join. A private 'recipe' object round-trips a pipeline written against the synthetic clone onto the original data. Not a differential-privacy or anonymisation tool: outputs are development surrogates, not public-release-safe artefacts.
Authors:	Max Moldovan [aut, cre] (ORCID: <https://orcid.org/0000-0001-9680-8474>, affiliation: Adelaide University)
Maintainer:	Max Moldovan <[email protected]>
License:	MIT + file LICENSE
Version:	0.9.2
Built:	2026-07-23 05:12:12 UTC
Source:	https://github.com/max578/masque

Translate a data frame into the synthetic namespace

Description

Forward translation: takes an original-namespace data frame and returns it renamed and re-labelled to match the synthetic namespace produced by mask(). Use this to run a pipeline (trained on the synthetic) against the original data without modifying pipeline code.

Usage

apply_recipe(original, rec, check_integrity = TRUE)
apply_recipe(original, rec, check_integrity = TRUE)

Arguments

original

A data frame in the original namespace.

rec

A masque_recipe object (e.g. from recipe(m)).

check_integrity

Logical. When TRUE (default), verifies that the NA mask of original matches the recipe's recorded integrity_fp. Mismatches error with guidance. Pass FALSE to bypass when the missingness has legitimately changed since the recipe was built.

Details

Operations applied (in order):

Verify integrity by comparing the NA mask of original to the SHA-256 fingerprint stored on the recipe (controlled by check_integrity).
Drop columns that mask() dropped (in collaborate mode this is every ignore column; in local mode no columns are dropped).
Subset and reorder to the columns the recipe knows about.
Re-label factors / characters for any column with a level map held by the recipe (i.e., treatment and categorical covariates in collaborate mode; or treatment in local mode with opt-in permutation). Unknown non-NA values fail closed.
Rename columns per recipe@column_name_map when mask() was called with alias_names (otherwise NULL, and names are unchanged).

Numeric columns are passed through unchanged: the synthetic-namespace for numeric columns is the same as the original. NA cells in the input remain NA in the output (no synthesis is performed here).

Value

A tibble in the synthetic namespace, ready for the pipeline.

Examples

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
r$role[r$col == "Species"] <- "covariate"
m <- mask(iris, r, mode = "collaborate", seed = 1)
rec <- recipe(m)
iris_in_synth_space <- apply_recipe(iris, rec)
head(iris_in_synth_space)

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
r$role[r$col == "Species"] <- "covariate"
m <- mask(iris, r, mode = "collaborate", seed = 1)
rec <- recipe(m)
iris_in_synth_space <- apply_recipe(iris, rec)
head(iris_in_synth_space)

Audit a masque object for leakage and shareability risks

Description

Returns a per-column audit tibble and prints a severity-grouped report. Auto-runs in mode = "collaborate" at mask() time and stores the result on the masque object (m@audit); for local-mode audits or explicit re-audits, pass the original data frame via original.

Usage

audit_mask(m, original = NULL, print = TRUE)
audit_mask(m, original = NULL, print = TRUE)

Arguments

m

A masque object from mask().

original

Optional. Required when m@audit is NULL (typically in local mode). Used to recompute exact-match-pct etc. on demand.

print

Logical; if TRUE (default), print a styled report.

Details

Each row of the returned tibble holds:

col: column name in the original.
role: assigned role.
kind: storage kind.
leakage_class: low, medium, or high.
n_unique_levels: distinct non-NA values (categorical only).
freq_min: minimum per-level frequency (categorical only).
exact_match_pct: percentage of synthetic cells equal to the original cell (numeric only; cell-by-cell).
na_pct: percentage of NA cells in the original column.
na_pattern_uniqueness: fraction of rows in the original with a globally unique NA pattern (one number per data frame, repeated on every row).
alias_status: aliased, passthrough, or dropped.
notes: short human summary.

Classification heuristics (CODEX-aligned):

Retained PII-pattern column -> high.
Treatment unaliased in collaborate -> high.
Categorical covariate with a frequency-1 level in collaborate -> high.
Outcome with exact-match-pct > 1\
Numeric covariate with exact-match-pct > 5\ medium.
Ignore column retained in local -> low (informational).

Step 7 will lower numeric exact-match-pct under collaborate by adding within-resolution jitter; until then, expect medium leakage on collaborate-mode numerics.

Value

The audit tibble, returned invisibly.

Examples

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
r$role[r$col == "Species"] <- "covariate"
m <- mask(iris, r, mode = "collaborate", seed = 1)
audit_mask(m)

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
r$role[r$col == "Species"] <- "covariate"
m <- mask(iris, r, mode = "collaborate", seed = 1)
audit_mask(m)

Tidy a dirty table's column names and category labels before masking

Description

Real custodian tables arrive with column names that are not valid R names ("Yield (t/ha)", "Site Name"), leading or trailing whitespace in names and factor / character values, and the occasional near-duplicate label ("north" vs "North" vs " north"). clean_table() makes the safe fixes - legalising names and trimming whitespace - loudly, and reports the unsafe ones (near-duplicate labels) without touching them, because merging two labels that only look alike is a judgement call masque must not make silently.

Usage

clean_table(df, clean = c("auto", "report", "off"), quiet = FALSE)
clean_table(df, clean = c("auto", "report", "off"), quiet = FALSE)

Arguments

df

A data frame.

clean

One of "auto" (default - legalise names, trim whitespace, report near-duplicates), "report" (legalise names, report what whitespace / near-duplicate changes would be made but apply none), or "off" (legalise names only, skip all other hygiene). Column-name legalisation is applied in every mode – an invalid name silently rewritten downstream corrupts the clone – and is surfaced as a masque_name_repaired warning; only the whitespace and near-duplicate handling is governed by the mode.

quiet

Logical. When FALSE (default) a cli summary of the fixes and advisories is printed. Set TRUE to suppress it (the report object is returned either way).

Details

The corrections are returned alongside the cleaned data so mask() can record them in the recipe and apply_recipe() can re-apply the identical cleaning to a fresh copy of the original. Cleaning is therefore part of the round-trip contract, not a destructive pre-step.

Value

An object of class masque_cleaning: a list with

data - the cleaned (or, under report / off, unchanged) data frame;
name_map - named character original -> clean for every column whose name changed (empty if none);
level_fixes - named list, one entry per column whose values were trimmed, each a named character original -> clean;
near_duplicates - a data frame of report-only label pairs (col, a, b, kind) that look like typos but were left untouched;
mode - the clean mode applied.

Examples

df <- data.frame(
  `Site Name` = c("north ", "north", "South"),
  `Yield (t/ha)` = c(3.1, 2.9, 5.0),
  check.names = FALSE
)
cl <- clean_table(df, quiet = TRUE)
names(cl$data)
cl$near_duplicates

df <- data.frame(
  `Site Name` = c("north ", "north", "South"),
  `Yield (t/ha)` = c(3.1, 2.9, 5.0),
  check.names = FALSE
)
cl <- clean_table(df, quiet = TRUE)
names(cl$data)
cl$near_duplicates

Detect environment scope and experimental-design structure

Description

Inspects df and returns an S7 design_summary. Environment scope and experimental-design class are separate conclusions: a table can be a multi-environment trial (MET) even when its within-environment randomisation cannot be recovered from the recorded columns.

Usage

detect_design(
  df,
  roles = NULL,
  interactive = FALSE,
  threshold = 0.5,
  tie_delta = 0.02,
  env = NULL
)
detect_design(
  df,
  roles = NULL,
  interactive = FALSE,
  threshold = 0.5,
  tie_delta = 0.02,
  env = NULL
)

Arguments

df

A data frame.

roles

Optional roles tibble (as returned by propose_roles()). When supplied, declared roles constrain candidate generation, and columns roled treatment define the treatment basis.

interactive

If TRUE, when the top-2 rule scores are within tie_delta the user is asked to choose between them via a cli menu. Default FALSE.

threshold

Minimum top-rule score for a class to be reported. Below this, class_label is "none". Default 0.5.

tie_delta

Score difference within which two rules are treated as tied. Default 0.02 — tight enough that 0.05-point score differences (the typical name-bonus / coverage gap) are decisive.

env

Environment specification. NULL performs conservative automatic resolution and leaves ambiguous or weak evidence unresolved. FALSE disables MET handling and runs the legacy whole-table path. A character vector names one or more columns whose interaction defines the environment.

Details

With env = NULL, exact environment names and a bounded set of site-year patterns are assessed conservatively. A site-only candidate auto-resolves only when treatments are replicated across sites. Weak or competing evidence produces an explicit uncertain result rather than a guessed single trial. Supply env to define the environment basis, or use env = FALSE to run the pre-0.9 whole-table path exactly.

After the scope step, the pooled legacy detector runs six independent design rules. Each returns a score in $[0, 1]$ . The highest-scoring class above threshold is one of "CRD", "RCBD", "IBD/alpha-lattice", "row-column", "split-plot", "factorial", or "none". Ties within tie_delta favour the simpler design. For a MET, per-environment classes, treatment connectivity, and near-disjoint experiment groups are diagnostics only. Dense connectivity calculations that exceed the package safety bound are reported as not_computed. None of these diagnostics proves the original randomisation protocol.

The detector never edits df. Its job is to recommend a role assignment, surface the evidence, and (optionally) draw a sanity check via plot().

Value

An S7 design_summary object. Legacy design fields include class_label, treatment_col, block_cols, whole_plot_col, sub_plot_col, spatial_cols, scores, evidence, recommended_roles, candidates, and warnings. Scope fields include scope_label, scope_status, scope_confidence, is_met, env_cols, env_method, n_env, group_cols, connectivity, per_env, and within_design_label.

Examples

# Classic alpha-lattice (24 genotypes, 3 reps, 6 blocks per rep).
if (requireNamespace("agridat", quietly = TRUE)) {
  d <- agridat::john.alpha
  ds <- detect_design(d)
  print(ds)
}

# Observational data frame -> class_label "none".
detect_design(mtcars)

# Explicit two-environment trial.
met <- expand.grid(
  env = factor(c("E1", "E2")),
  rep = factor(seq_len(2L)),
  gen = factor(c("G1", "G2", "G3"))
)
met$yield <- seq_len(nrow(met))
met_design <- detect_design(met, env = "env")
met_design@scope_label
met_design@connectivity$status

# Classic alpha-lattice (24 genotypes, 3 reps, 6 blocks per rep).
if (requireNamespace("agridat", quietly = TRUE)) {
  d <- agridat::john.alpha
  ds <- detect_design(d)
  print(ds)
}

# Observational data frame -> class_label "none".
detect_design(mtcars)

# Explicit two-environment trial.
met <- expand.grid(
  env = factor(c("E1", "E2")),
  rep = factor(seq_len(2L)),
  gen = factor(c("G1", "G2", "G3"))
)
met$yield <- seq_len(nrow(met))
met_design <- detect_design(met, env = "env")
met_design@scope_label
met_design@connectivity$status

Coarsen geographic coordinates by an on-land privacy jitter

Description

Displaces each latitude / longitude pair by a small random distance, in place, so a synthetic table can carry realistic coordinates without revealing the true location of a field, farm or site. Two geographic-masking schemes are offered. The default "donut" displaces every point by a distance drawn uniformly (by area) from an annulus between min_km and max_km, in a random direction: this guarantees a minimum displacement (a privacy floor – no point is left almost where it started) while bounding the maximum drift. The "gaussian" scheme adds independent normal noise of standard deviation sd_km to each axis; it is simpler but has an unbounded tail, so a handful of points can move much further than intended.

Usage

jitter_coordinates(
  df,
  lat_col,
  lon_col,
  method = c("donut", "gaussian"),
  min_km = 5,
  max_km = 20,
  sd_km = 10,
  on_land = TRUE,
  max_tries = 100L,
  seed = NULL
)
jitter_coordinates(
  df,
  lat_col,
  lon_col,
  method = c("donut", "gaussian"),
  min_km = 5,
  max_km = 20,
  sd_km = 10,
  on_land = TRUE,
  max_tries = 100L,
  seed = NULL
)

Arguments

df

A data frame containing the coordinate columns.

lat_col, lon_col

Column names of the latitude and longitude (numeric, in decimal degrees).

method

"donut" (default) or "gaussian".

min_km, max_km

Inner and outer radii of the donut, in kilometres (used when method = "donut"). Every point moves at least min_km and at most max_km.

sd_km

Per-axis standard deviation in kilometres (used when method = "gaussian").

on_land

Controls the land constraint. TRUE (default) rejects any displacement that lands in the sea, using maps::map.where() (requires the maps package). A function ⁠function(lon, lat)⁠ returning a logical vector supplies your own test (for example an sf-based high-resolution coastline). FALSE applies no constraint.

max_tries

Maximum re-draws per point before giving up and leaving it unchanged (with a warning). Default 100.

seed

Optional integer seed for reproducibility. The caller's RNG state is preserved.

Details

The displacement is computed in kilometres and converted to degrees with a cos(latitude) correction on longitude, so the true ground distance matches the requested magnitude at any latitude. Each candidate is checked against a land mask and re-drawn until it falls on land, so a coastal point is never pushed offshore. The original NA pattern is preserved cell-by-cell and the two axes stay paired (if either coordinate is missing, both are set to NA).

Value

df, with lat_col and lon_col overwritten by the jittered coordinates (rounded to five decimal places, about one metre).

Choosing the magnitude

The right displacement is not a universal constant: it is calibrated to the density of the entities you are protecting, so that the masked point is spatially k-anonymous (roughly, at least k comparable entities lie closer to the masked point than the true one). Individual-level urban health data is typically masked with a standard deviation of about 1 km, because cities are dense. Agricultural fields and farms are orders of magnitude sparser, so a comparable level of protection needs a much larger displacement – a donut of roughly 5 to 20 km (the default) moves a point across several properties while keeping it in the same agroclimatic region. For a formal guarantee, calibrate min_km / max_km to the local field density to hit a target k-anonymity rather than relying on the default.

References

Hampton, K. H., Fitch, M. K., Allshouse, W. B., Doherty, I. A., Gesink, D. C., Leone, P. A., Serre, M. L., & Miller, W. C. (2010). Mapping health data: improved privacy protection with donut method geomasking. American Journal of Epidemiology, 172(9), 1062-1069. doi:10.1093/aje/kwq248

Zandbergen, P. A. (2014). Ensuring confidentiality of geocoded health data: assessing geographic masking strategies for individual-level data. Advances in Medicine, 2014, 567049. doi:10.1155/2014/567049

Examples

df <- data.frame(
  site = c("A", "B", "C"),
  lat  = c(-34.9, -35.2, -33.6),
  lon  = c(138.6, 142.0, 148.2)
)
# Move each site 5-20 km, staying on land (needs the `maps` package):
if (requireNamespace("maps", quietly = TRUE)) {
  jitter_coordinates(df, "lat", "lon", min_km = 5, max_km = 20, seed = 1)
}

df <- data.frame(
  site = c("A", "B", "C"),
  lat  = c(-34.9, -35.2, -33.6),
  lon  = c(138.6, 142.0, 148.2)
)
# Move each site 5-20 km, staying on land (needs the `maps` package):
if (requireNamespace("maps", quietly = TRUE)) {
  jitter_coordinates(df, "lat", "lon", min_km = 5, max_km = 20, seed = 1)
}

Mask a tabular dataset into a structurally faithful development surrogate

Description

Takes one data frame and a user-edited roles table (from propose_roles(), possibly adjusted with set_role()) and produces a synthetic clone according to each column's action. Returns a masque S7 object holding the synthetic data and a private masque_recipe.

Usage

mask(
  df,
  roles,
  mode = c("local", "collaborate"),
  seed = NULL,
  clean = c("auto", "report", "off"),
  alias_names = FALSE,
  conditional = FALSE,
  coords = NULL,
  .shared_maps = list(),
  ...
)
mask(
  df,
  roles,
  mode = c("local", "collaborate"),
  seed = NULL,
  clean = c("auto", "report", "off"),
  alias_names = FALSE,
  conditional = FALSE,
  coords = NULL,
  .shared_maps = list(),
  ...
)

Arguments

df

A data frame.

roles

A roles table from propose_roles() (possibly edited). Tables from masque <= 0.5.0 are upgraded with a deprecation warning; see roles_validate().

mode

Either "local" or "collaborate". When omitted, inherit attr(roles, "mode"), falling back to "local" for a roles table with no mode provenance. An explicit collaborate-to-local downgrade raises a classed masque_mode_downgrade warning.

seed

Optional integer for reproducibility.

clean

Label and column-name hygiene before masking, passed to clean_table(): one of "auto" (default - trim whitespace and report near-duplicate labels), "report" (report only), or "off" (skip). Invalid column names are legalised in every mode – an invalid name silently rewritten during synthesis corrupts the clone – and the repair is raised as a masque_name_repaired warning. The roles table's column references are remapped to the legalised names, the fixes are recorded in the recipe, and apply_recipe() / unmask() reverse them on the round-trip.

alias_names

Hide the column names themselves. FALSE (the default) keeps them. TRUE replaces every retained column name with an opaque alias (col_001, col_002, ... in column order). A character vector names just the columns to alias. The original-to-alias map is stored in the recipe and inverted by apply_recipe() / unmask(), so a pipeline written against the aliased synthetic round-trips. Column names are the last identifying surface a kept or design column exposes; alias them when even the schema is sensitive.

conditional

Logical scalar (default FALSE). The collaborate-grade conditional clone. When FALSE, scrambled numeric columns are re-simulated from one global Gaussian copula - marginals and global covariance survive, but the treatment-to-outcome relationship does not, so a causal model fitted on the clone recovers a null effect. When TRUE, the numeric block is re-simulated within each treatment-by-design stratum, so a row's synthetic outcome inherits the location of the treatment that row carries. A causal model fitted on the conditional clone recovers the real treatment effect within sampling tolerance - the data-side analogue of preserving a conditional mean embedding rather than a pooled marginal. The conditioning columns (treatment plus retained design) are recorded on the recipe. With no treatment or design column to condition on, the path degrades cleanly to the global copula and a note is emitted.

coords

Optional geographic-coordinate declaration. Supply one or more latitude/longitude pairs and each is coarsened in place by an on-land jitter (see jitter_coordinates()) instead of being copula-scrambled into implausible locations. A pair is a named vector c(lat = "lat_col", lon = "lon_col") or a named list that also carries jitter parameters (method, min_km, max_km, sd_km, on_land); pass several pairs as a list. Defaults to a donut of 5-20 km on land. A declared pair always survives masking, coarsened; the recipe records that it was coarsened and apply_recipe() retargets to the real coordinates.

.shared_maps

Internal. A named list of pre-computed original -> alias level maps for cross-table linked columns, set by mask_set(). Not for direct use.

...

Must be empty. An unused argument (for example a misspelled name) errors rather than being silently ignored.

Details

mode = "local" marks the synthetic for owner development only; the reminder is recorded on the recipe and shown when the object prints. mode = "collaborate" additionally jitters re-simulated numerics within their measurement resolution (stochastically rounding integers) and runs audit_mask() automatically; a HIGH finding is raised as a classed warning (masque_high_leakage) and blocks the package-managed writers (masque()'s out, write_set()) until it is resolved or explicitly overridden. Which columns are aliased, kept, or dropped is decided by the action column of roles - propose_roles() resolves mode-appropriate defaults, so the table you reviewed is the plan that runs.

Collaborate mode adjusts the transformations and runs the audit; it does not model where the output will go. Whether a synthetic table is appropriate for a given collaborator, environment, or jurisdiction is a release decision that stays with the data custodian - masque informs that decision, it does not make it.

Value

A masque S7 object. Use synthetic() and recipe() to extract the components.

Behaviour by action

keep: Byte-identical pass-through, both modes.
scramble: Numeric outcome / covariate columns are re-simulated jointly via a Gaussian copula on the global Pearson covariance, with empirical-quantile marginals. Categorical, date, and text columns are row-permuted within non-NA positions, class preserved. Treatment columns get a seeded label permutation - the assignment structure never moves.
alias: As scramble where applicable, plus opaque label substitution: treatments become trt_NNN (⁠<col>_trt_NNN⁠ when two or more treatment factors are aliased), categorical covariates ⁠<col>_LNNN⁠, design labels ⁠<col>_DNNN⁠ (in place - structure intact), ids ⁠<col>_INNN⁠ (in place - row linkage intact), text values ⁠<col>_TNNN⁠.
drop: Column excluded from the synthetic, both modes.

The NA mask of every retained column is preserved cell-by-cell. RNG state is preserved across the call.

Examples

r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
m <- mask(iris, r, seed = 1)
head(synthetic(m))

r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
m <- mask(iris, r, seed = 1)
head(synthetic(m))

Mask a multi-table set with cross-table-consistent aliasing

Description

The set-level counterpart to mask(). Takes a folder of tabular files, an Excel workbook, or a named list of data frames and produces one synthetic table per input table plus a single private recipe bundle. Columns that are shared across tables - a site code, a genotype name, a plot id appearing in several tables - are aliased identically everywhere they occur, so a join written against the synthetic set still resolves on the masked data.

Usage

mask_set(
  input,
  roles = NULL,
  links = NULL,
  mode = c("local", "collaborate"),
  seed = NULL,
  clean = c("auto", "report", "off"),
  alias_names = FALSE,
  conditional = FALSE,
  quiet = FALSE
)
mask_set(
  input,
  roles = NULL,
  links = NULL,
  mode = c("local", "collaborate"),
  seed = NULL,
  clean = c("auto", "report", "off"),
  alias_names = FALSE,
  conditional = FALSE,
  quiet = FALSE
)

Arguments

input

A folder path, an .xlsx / .xls path, or a named list of data frames - anything read_set() accepts.

roles

Optional named list of roles tables, one per input table (names must match). When NULL (default), propose_roles() runs on each table for the chosen mode. Supply edited tables for full control.

links

Optional override for link detection: a character vector of column names to treat as cross-table links, or FALSE to disable linking entirely (each table masked independently). When NULL (default), links are detected automatically.

mode

Either "local" or "collaborate". When omitted and roles are supplied, inherit their common mode. Tables prepared for different modes must be reconciled explicitly. Otherwise default to "local".

seed

Optional integer for reproducibility.

clean

Hygiene mode passed to clean_table() for every table ("auto", "report", or "off").

alias_names

Hide column names. FALSE (default) keeps them; TRUE aliases every non-link column (linked join keys keep their names so the synthetic set stays joinable).

conditional

Logical scalar (default FALSE). Passed through to each per-table mask() call: when TRUE, every table's numeric block is re-simulated within its own treatment-by-design strata so the treatment-to-outcome relationship survives the clone. See mask() for the full account.

quiet

Logical; suppress the link / hygiene report.

Value

A masque_set S7 object. Use synthetic() for the named list of synthetic tables and recipe() for the private recipe bundle.

Links

A link is a column name that appears in two or more tables with a compatible kind and overlapping values - the join keys of the set. mask_set() proposes links automatically and prints them; pass links to override. Linked columns are aliased in place with a shared map (built over the union of values across all tables), so identical original values map to identical aliases in every table. Set a linked column's action to keep in its roles table to pass it through unmasked instead.

Examples

tables <- list(
  plots = data.frame(
    site = c("A", "A", "B", "B"),
    gen = c("x", "y", "x", "y"),
    yield = c(3.1, 2.9, 4.0, 3.7)
  ),
  sites = data.frame(
    site = c("A", "B"),
    rainfall = c(420, 560)
  )
)
m <- mask_set(tables, mode = "collaborate", seed = 1, quiet = TRUE)
synthetic(m)$plots$site
synthetic(m)$sites$site # same aliases -> the join still works

tables <- list(
  plots = data.frame(
    site = c("A", "A", "B", "B"),
    gen = c("x", "y", "x", "y"),
    yield = c(3.1, 2.9, 4.0, 3.7)
  ),
  sites = data.frame(
    site = c("A", "B"),
    rainfall = c(420, 560)
  )
)
m <- mask_set(tables, mode = "collaborate", seed = 1, quiet = TRUE)
synthetic(m)$plots$site
synthetic(m)$sites$site # same aliases -> the join still works

Mask a dataset end to end with one guided call

Description

The front door. masque() walks the whole procedure - read the data, propose column roles, (in an interactive session) pause for you to review the plan, mask, audit, and optionally write the result - from a single call. It dispatches on the input: a single file or data frame goes through mask(); a folder, an Excel workbook, or a named list of tables goes through mask_set().

Usage

masque(
  input,
  roles = NULL,
  out = NULL,
  mode = c("local", "collaborate"),
  seed = NULL,
  clean = c("auto", "report", "off"),
  alias_names = FALSE,
  conditional = FALSE,
  ask = interactive(),
  overwrite = FALSE,
  quiet = FALSE,
  allow_high = FALSE
)
masque(
  input,
  roles = NULL,
  out = NULL,
  mode = c("local", "collaborate"),
  seed = NULL,
  clean = c("auto", "report", "off"),
  alias_names = FALSE,
  conditional = FALSE,
  ask = interactive(),
  overwrite = FALSE,
  quiet = FALSE,
  allow_high = FALSE
)

Arguments

input

A data frame, a single tabular file (.csv / .tsv / .fst), a folder of such files, an Excel workbook, or a named list of data frames.

roles

Optional. A roles table (single-table input) or named list of roles tables (set input). When supplied, the interactive review is skipped.

out

Optional output path. For a single table, a .csv file (or .xlsx with writexl). For a set, a folder (one CSV per table) or an .xlsx workbook. When NULL (default) nothing is written.

mode

Either "local" (default) or "collaborate".

seed

Optional integer for reproducibility.

clean

Hygiene mode passed to clean_table() ("auto", "report", "off").

alias_names

Hide column names; see mask() / mask_set().

conditional

Logical scalar (default FALSE). The conditional clone mode passed through to mask() / mask_set(): when TRUE, numeric columns are re-simulated within each treatment-by-design stratum so the treatment-to-outcome relationship survives the clone. See mask() for the full account.

ask

Whether to pause for interactive review when roles is not supplied. Defaults to interactive(). Set FALSE to proceed with the proposed plan without prompting.

overwrite

Passed to the writer when out is set.

quiet

Suppress progress messages. Warnings - including the HIGH-leakage finding - are never suppressed.

allow_high

Logical (default FALSE). When out is set and the collaborate-mode audit flagged HIGH leakage, the write is refused. Pass TRUE to write anyway after your own review; the override raises a masque_high_override warning and is recorded in the recipe's warnings, so the exception stays auditable.

Details

It is also fully scriptable. Pass an edited roles table (or named list of them) to skip the interactive review, and an out path to write the masked result in one go. The returned object is the same masque / masque_set you would get from the lower-level verbs, so anything you can do with those you can do with the result here.

Value

A masque object (single-table input) or a masque_set object (set input), invisibly. Use synthetic() and recipe().

The guided flow

Read the input into one or more clean rectangular tables.
Propose roles for every column (skipped if you pass roles).
Review - in an interactive session with no roles supplied, the proposed plan is printed and you are asked to proceed, edit, or stop. Editing opens the roles table in the spreadsheet editor (utils::edit()) where the platform provides one; when it cannot start (for example macOS without XQuartz, or a headless session), a console editor takes over - pick a column, then a role and an action from numbered menus, with every change applied through set_role() so default actions re-resolve exactly as on the scriptable path. With ask = FALSE (the default in non-interactive use) the proposed plan is used as-is, with a note.
Mask the data in the chosen mode.
Audit - in collaborate mode the leakage audit runs and its headline is printed. A HIGH finding surfaces as a classed warning (masque_high_leakage) - it is never silenced by the guided flow.
Write - if out is set, the masked data is written there (mirroring the input format), unless the audit left unresolved HIGH findings: then the write is refused (nothing is written) and the flagged columns are listed. Resolve them and mask again, or pass allow_high = TRUE after your own review - the override is warned (masque_high_override) and recorded on the recipe. The private recipe is never written automatically; persist it yourself with save_recipe().

Examples

# Scripted single-table use (no prompt because roles are supplied):
r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
m <- masque(iris, roles = r, seed = 1, ask = FALSE, quiet = TRUE)
head(synthetic(m))

# Scripted single-table use (no prompt because roles are supplied):
r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
m <- masque(iris, roles = r, seed = 1, ask = FALSE, quiet = TRUE)
head(synthetic(m))

Sanity-check visualisation for detected scope and design

Description

Plots the structure that drove the detect_design() verdict. A detected MET uses a compact environment-coverage overview by default. Set environment to draw one selected environment. A single-trial panel depends on the detected design class:

Usage

plot_design_summary(
  x,
  df,
  engine = c("base", "ggplot2"),
  environment = NULL,
  ...
)
plot_design_summary(
  x,
  df,
  engine = c("base", "ggplot2"),
  environment = NULL,
  ...
)

Arguments

x

A design_summary object from detect_design().

df

The data frame that was passed to detect_design(). It is used to draw replication tiles, spatial layouts, and the NA-pattern. The data are deliberately not stored in the summary.

engine

"base" (default) or "ggplot2". The latter requires ggplot2 and falls back to "base" with a warning if unavailable.

environment

Optional single environment label. For a detected MET, the default is a compact all-environment overview. Supplying a label draws the selected environment's field layout or legacy design diagnostic.

...

Ignored.

Details

CRD, factorial, none -> frequency-of-treatment + NA-pattern.
RCBD, IBD/alpha-lattice -> treatment x block replication tile.
row-column -> spatial layout tile (row x col, fill = treatment).
split-plot -> factor-nesting tree + within-block replication.

Output is purely diagnostic. Do not use it as a publication figure (use desplot::desplot() or ggplot2-based packages for that).

Value

The input x, invisibly. Called for the plot side-effect.

Examples

if (requireNamespace("agridat", quietly = TRUE)) {
  d <- agridat::john.alpha
  ds <- detect_design(d)
  plot(ds, df = d)
}

met <- expand.grid(
  env = factor(c("E1", "E2")),
  rep = factor(seq_len(2L)),
  gen = factor(c("G1", "G2", "G3"))
)
met$yield <- seq_len(nrow(met))
met_design <- detect_design(met, env = "env")
plot(met_design, df = met)

if (requireNamespace("agridat", quietly = TRUE)) {
  d <- agridat::john.alpha
  ds <- detect_design(d)
  plot(ds, df = d)
}

met <- expand.grid(
  env = factor(c("E1", "E2")),
  rep = factor(seq_len(2L)),
  gen = factor(c("G1", "G2", "G3"))
)
met$yield <- seq_len(nrow(met))
met_design <- detect_design(met, env = "env")
plot(met_design, df = met)

Propose role and action classifications for the columns of a data frame

Description

Generates a heuristic two-axis classification for every column of df: a role (what the column is) and an action (what mask() will do to it). The user is expected to inspect this table and edit it - directly or via set_role() - before passing it to mask(). Heuristics are seeds, not law.

Usage

propose_roles(df, mode = c("local", "collaborate"), detect = TRUE)
propose_roles(df, mode = c("local", "collaborate"), detect = TRUE)

Arguments

df

A data frame. Must have at least one column.

mode

The masking mode the table is being prepared for: "local" (default) or "collaborate". Stored as attr(roles, "mode") and used to resolve default actions.

detect

Logical scalar (default TRUE). When TRUE, run detect_design() and overlay its recommended role hints.

Value

A tibble with one row per column: col, role, action, kind (storage kind: numeric, integer, factor, character, logical, date, datetime, other), freq_or_range, pii_suspected, and notes. The target mode is stored as attr(roles, "mode").

The two axes

role describes the column and determines the mechanics of any synthesis:

design: Experimental / structural columns (site, block, rep, plot, year). Mechanics of alias: labels are substituted in place, structure intact.
treatment: Assignment columns (variety, genotype, dose). Labels are remapped in place - the assignment structure never moves. scramble = seeded label permutation; alias = opaque labels (trt_001).
outcome: Numeric response columns. scramble re-simulates via the Gaussian copula, jointly with scrambled numeric covariates. Multiple outcomes are supported.
covariate: Everything measured alongside. Numeric: copula re-simulation. Categorical: row permutation, plus opaque label aliasing under alias.
date: Date / POSIX / difftime columns. scramble row- permutes within the observed values; class and NA pattern are preserved.
id: Identifier columns. Never scrambled (that would break row linkage); alias substitutes opaque per-value labels in place, preserving linkage.
text: Free-text columns. scramble row-permutes; alias tokenises each distinct string.
other: Classes masque cannot synthesise (list columns, exotic S4, ...). Keep or drop only.

action sets the masking depth per column:

keep: Byte-identical pass-through, both modes.
scramble: Re-simulate (numeric) or row-permute (categorical / date / text); original label vocabulary remains visible.
alias: Scramble where applicable, plus opaque label substitution - the vocabulary itself is hidden.
drop: Column excluded from the synthetic, both modes.

The proposed action column is resolved for mode, so the table you edit shows the actual masking plan. Re-assigning a column's role with set_role() re-resolves its default action; a direct roles$role[...] <- ... edit leaves action untouched (set it to NA to have mask() re-resolve the default).

Default classification rules, applied in order

PII-pattern column names (contact, email, phone, gps, latitude / longitude, postcode, ssn, password, owner, farmer, operator, etc., case-insensitive substring) -> pii_suspected = TRUE and action drop in both modes. Re-role deliberately if the column must survive.
Date / POSIXct / POSIXlt / difftime columns -> role date, action scramble (row permutation).
ID-pattern names (⁠\\bid\\b⁠, ⁠_id$⁠, ⁠^id_⁠) -> role id; kept in local mode, dropped in collaborate mode.
Design-pattern names (rep, block, row, ⁠col(umn)?⁠, range, ⁠plot(no)?⁠, site, ⁠env(ironment)?⁠, trial, year, season, colrep, tos) -> role design, action keep.
Treatment-pattern names (treatment, variety, cultivar, genotype, ⁠^trt⁠, ⁠^dose⁠) -> role treatment; kept in local mode, aliased in collaborate mode.
Character columns with > 50% unique values on non-NA -> role text; kept in local mode, dropped in collaborate mode.
Unsupported classes -> role other, action keep, with a note.
Everything else -> role covariate, action scramble. Re-role response variables as outcome.

No outcome is required: with no column roled outcome, the copula simply re-simulates all scrambled numeric columns jointly.

Since masque 0.3.0, propose_roles() also calls detect_design() by default (detect = TRUE) and applies the detected design's recommended_roles on top of the name-based heuristic, re-resolving the default action for any promoted column. The design summary is stashed as attr(roles, "design"). Pass detect = FALSE for the name-only heuristic.

Multi-environment trials

Since masque 0.9.1, high-confidence environment columns detected by detect_design() are promoted to role design. In local mode their values remain byte-identical. In collaborate mode categorical environment labels default to alias, which preserves row assignment and factor codes while hiding the vocabulary. Numeric environment columns default to keep and raise a masque_environment_disclosure warning for explicit review.

Weak or competing environment candidates never change roles automatically. Inspect attr(roles, "design") and use set_role() when domain knowledge should override an uncertain result. Preserving environment allocation does not imply that synthesised outcomes preserve treatment-by-environment effects. An explicitly chosen action is pinned and is not overwritten by a later design-role promotion.

Examples

propose_roles(iris)
propose_roles(iris, mode = "collaborate")

met <- expand.grid(
  env = factor(c("E1", "E2")),
  rep = factor(seq_len(2L)),
  gen = factor(c("G1", "G2", "G3"))
)
propose_roles(met, mode = "collaborate")[, c("col", "role", "action")]

propose_roles(iris)
propose_roles(iris, mode = "collaborate")

met <- expand.grid(
  env = factor(c("E1", "E2")),
  rep = factor(seq_len(2L)),
  gen = factor(c("G1", "G2", "G3"))
)
propose_roles(met, mode = "collaborate")[, c("col", "role", "action")]

Read a masque recipe from disk

Description

Loads a recipe written by save_recipe(). Validates that the file contains a masque_recipe and informs (does not error) if the recipe was written by a different package version than the one currently installed.

Usage

read_recipe(path)
read_recipe(path)

Arguments

path

File path.

Value

A masque_recipe object.

Read a set of tables from a folder, an Excel workbook, or a list

Description

Ingests a multi-table dataset into a named list of data frames, ready for mask_set(). Three input shapes are accepted:

Usage

read_set(input, sheets = NULL, pattern = NULL)
read_set(input, sheets = NULL, pattern = NULL)

Arguments

input

A folder path, an .xlsx / .xls path, or a named list of data frames.

sheets

For an Excel workbook, an optional character vector of sheet names to read (default: all sheets).

pattern

For a folder, an optional regular expression to select files (default: all .csv / .tsv / .fst).

Details

a folder path: Every .csv, .tsv, and .fst file in the folder becomes one table, named by its file name (without extension). CSV / TSV are read with data.table::fread(); .fst needs the Suggested fst package.
an Excel workbook (.xlsx / .xls): Every sheet becomes one table, named by its sheet name. Needs the Suggested readxl package.
a named list of data frames: Returned as-is after validation.

Only clean rectangular tables are supported. A sheet or file that does not read as a rectangle - missing header row (auto-named ...1 columns), zero rows, or zero columns - fails with an explanatory error naming the offending table. masque does not attempt to recover merged cells, multi-row headers, or junk rows; tidy the source first.

Value

A named list of data frames.

Examples

tables <- list(
  plots = data.frame(site = c("A", "B"), yield = c(3.1, 4.2)),
  sites = data.frame(site = c("A", "B"), rainfall = c(400, 550))
)
read_set(tables)

tables <- list(
  plots = data.frame(site = c("A", "B"), yield = c(3.1, 4.2)),
  sites = data.frame(site = c("A", "B"), rainfall = c(400, 550))
)
read_set(tables)

Extract the recipe from a masque object

Description

The recipe is private: at least as sensitive as the original data. Never share alongside the synthetic. By default print(recipe(m)) redacts all level maps; use reveal_maps() for an explicit reveal.

Usage

recipe(m)
recipe(m)

Arguments

m

A masque object from mask(), or a masque_set from mask_set().

Value

A masque_recipe, or for a set a masque_recipe_set bundle.

Examples

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
m <- mask(iris, r, seed = 1)
recipe(m)
r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
m <- mask(iris, r, seed = 1)
recipe(m)

Reveal the level maps held inside a recipe

Description

Explicit, audited reveal: prints each original -> synthetic level map held by the recipe. Use sparingly. Recipe maps are at least as sensitive as the original data; printing them defeats the redaction built into print() and summary().

Usage

reveal_maps(rec)
reveal_maps(rec)

Arguments

rec

A masque_recipe object (e.g., from recipe(m)).

Value

rec, invisibly.

Examples

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
r$role[r$col == "Species"] <- "covariate"
m <- mask(iris, r, mode = "collaborate", seed = 1)
rec <- recipe(m)
reveal_maps(rec)
r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
r$role[r$col == "Species"] <- "covariate"
m <- mask(iris, r, mode = "collaborate", seed = 1)
rec <- recipe(m)
reveal_maps(rec)

List every role and action combination masque accepts

Description

Renders the two-axis vocabulary as a table: every role paired with every action, the storage kinds the pair works for, and the reason the pair is constrained when it is. The table is generated from the same compatibility rules roles_validate() applies at mask() time, so what it shows is exactly what a roles table is allowed to say.

Usage

role_options(kind = NULL)
role_options(kind = NULL)

Arguments

kind

Optional storage kind to filter for: one of "numeric", "integer", "factor", "character", "logical", "date", "datetime", or "other". When supplied, only the combinations workable for a column of that kind are returned. The default NULL returns the full grid.

Details

A column's kind is never chosen: propose_roles() derives it from the column's class. To move a column to a different kind, convert the column in the data and re-propose.

Value

A tibble with one row per role-action pair and four columns: role, action, kinds (the storage kinds the pair is valid for: "all", "none", "all except other", or a comma-separated list), and notes (why the pair is constrained, empty when it is not).

Examples

role_options()
role_options(kind = "factor")
subset(role_options(), role == "design")

role_options()
role_options(kind = "factor")
subset(role_options(), role == "design")

Validate a roles table

Description

Fail-closed validation of a two-axis roles table before mask() consumes it. Returns the validated table - with any NA actions resolved to their (role, kind, mode) defaults - so callers can use the return value directly.

Usage

roles_validate(roles, df = NULL, mode = NULL)
roles_validate(roles, df = NULL, mode = NULL)

Arguments

roles

A roles table from propose_roles() (possibly edited), or a v1 roles tibble (deprecated, upgraded with a warning).

df

Optional data frame. If supplied, roles is checked for one-to-one column-name correspondence with df.

mode

Optional mode ("local" or "collaborate") used to resolve NA actions and to upgrade v1 tables. Defaults to attr(roles, "mode"), falling back to "local".

Details

Tables produced by masque 0.5.0 and earlier (no action column; v1 vocabulary with keep / ignore roles and the optional mask_levels column) are upgraded in place with a deprecation warning. The upgrade preserves the v1 semantics exactly: v1 keep becomes action keep; v1 ignore becomes role id / text / other with action keep in local mode and drop in collaborate mode; v1 treatment mask_levels = "permute" becomes action scramble.

Hard errors:

missing required columns (col, role, action, kind);
unknown role (not in design, treatment, outcome, covariate, date, id, text, other) or unknown action (not in keep, scramble, alias, drop);
any NA role (an NA action is allowed - it resolves to the default for the row's role and kind);
an incompatible (role, action, kind) combination, e.g. design + scramble, numeric + alias, id + scramble, other + anything but keep / drop;
duplicate col entries;
if df supplied: any df column missing from roles, or any roles column missing from df.

Loud advisories (warnings, not errors):

every action is keep - the "synthetic" would equal the original byte-for-byte;
the table was proposed for one mode but is being validated for another (actions are taken as-is; defaults are not re-resolved).

Value

The validated (and possibly upgraded / resolved) roles table, invisibly.

Examples

r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
roles_validate(r, iris)

r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
roles_validate(r, iris)

Save a masque recipe to disk

Description

Writes the recipe to a single .rds file. The default is runtime-minimal: no simulator state (copula covariance, raw margins) is written, only the translation maps, factor metadata, storage classes, integrity fingerprint, and warnings. This keeps the saved artefact small and reduces the information that would leak if the recipe file alone were shared.

Usage

save_recipe(rec, path, include_simulator = FALSE)
save_recipe(rec, path, include_simulator = FALSE)

Arguments

rec

A masque_recipe object, e.g. from recipe(m).

path

File path. By convention, .rds extension.

include_simulator

Logical. Reserved for a future release. Currently a no-op (recipe is always written runtime-minimal).

Details

include_simulator = TRUE is accepted but is currently a no-op: the recipe does not carry simulator state. The flag is reserved for a future release that will let read_recipe() regenerate fresh synthetic samples without access to the original data.

Recipes are at least as sensitive as the original data. Protect the saved file at the same security class as the original. Note that save_recipe() writes plain R serialisation - it does not encrypt. The saved recipe is a re-identification key: store it under your organisation's access controls and key-management practice, not alongside the synthetic output.

Value

path, invisibly.

Examples

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
m <- mask(iris, r, mode = "collaborate", seed = 1)
tmp <- tempfile(fileext = ".rds")
save_recipe(recipe(m), tmp)
rec2 <- read_recipe(tmp)

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
m <- mask(iris, r, mode = "collaborate", seed = 1)
tmp <- tempfile(fileext = ".rds")
save_recipe(recipe(m), tmp)
rec2 <- read_recipe(tmp)

Set the role and action of one or more columns in a roles table

Description

Ergonomic editor for the two-axis roles table returned by propose_roles(). Setting a role without an explicit action re-resolves the action to the default for the new role, the column's kind, and the mode the table was proposed for — so a re-roled column never silently carries a stale action from its previous role.

Usage

set_role(roles, cols, role = NULL, action = NULL)
set_role(roles, cols, role = NULL, action = NULL)

Arguments

roles

A roles table from propose_roles() (possibly edited).

cols

Character vector of column names to edit. Every entry must be present in roles$col.

role

Optional single role to assign to all of cols. One of "design", "treatment", "outcome", "covariate", "date", "id", "text", "other".

action

Optional single action to assign to all of cols. One of "keep", "scramble", "alias", "drop". When NULL (the default) and role was supplied, the action is re-resolved to the default for the new role.

Details

Direct edits (roles$role[roles$col == "x"] <- "outcome") remain fully supported; this helper exists because a direct role edit leaves roles$action untouched, which is occasionally what you want and frequently not.

Value

The edited roles table.

Examples

r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
r <- set_role(r, "Species", action = "keep")
r[, c("col", "role", "action")]

r <- propose_roles(iris)
r <- set_role(r, "Sepal.Length", role = "outcome")
r <- set_role(r, "Species", action = "keep")
r[, c("col", "role", "action")]

Re-anchor synthetic geospatial coordinates at plausible-but-fake locations

Description

Replaces the latitude / longitude values in a masqued data frame with coordinates anchored at user-supplied centroids and clustered to preserve the original's site-count-per-anchor structure. The function never reads the real coordinates beyond counting how many distinct sites the original holds per anchor level — so it leaks the replication-per-site distribution and the count of distinct sites, nothing more.

Usage

synthesise_geospatial(
  synth,
  original,
  anchor_col,
  lat_col,
  lon_col,
  anchor_centroids,
  site_spread_deg = 0.6,
  jitter_deg = 0.05,
  seed = NULL
)
synthesise_geospatial(
  synth,
  original,
  anchor_col,
  lat_col,
  lon_col,
  anchor_centroids,
  site_spread_deg = 0.6,
  jitter_deg = 0.05,
  seed = NULL
)

Arguments

synth

A synthetic data frame (typically synthetic(mask(...))).

original

The original data frame from which synth was derived (needed only to count distinct sites per anchor).

anchor_col

Name of the column whose levels anchor each cluster (e.g., "M_STATE"). Must exist in both synth and original.

lat_col, lon_col

Column names of the latitude and longitude fields to overwrite in synth.

anchor_centroids

Named list keyed by anchor levels; each element is a length-2 numeric named c(lat, lon). The user supplies plausible centroids (e.g., state centroids); the function never infers them from the original to avoid leaking position information.

site_spread_deg

Half-width of the box (in decimal degrees) around each anchor centroid within which fake site centroids are uniformly placed. Default 0.6.

jitter_deg

Within-site uniform jitter (in decimal degrees) added to each row's assigned site centroid. Default 0.05.

seed

Optional integer seed for reproducibility. The function uses withr::local_preserve_seed() so the caller's RNG state is left untouched.

Details

Typical use: after mask() produces a synthetic with copula-drawn or missing coordinates, call synthesise_geospatial() to substitute plausible points. The synthetic ends up with:

the same number of distinct sites per anchor level: (e.g., if the original has five distinct trial sites in NSW, the synthetic will have five fake sites in NSW);
the original's per-site replication distribution: (each fake site receives a share of the synthetic rows proportional to its real counterpart's count);
within-site tight clustering and between-site spread: (small jitter within site; larger spread between sites within each anchor centroid's neighbourhood).

What the function does not preserve:

the real positions of the sites (they are random within a user-defined neighbourhood of each anchor centroid);
the relative spacing or bearings between real sites;
any spatial autocorrelation in the outcome.

Coordinates that are NA in the original remain NA in the synthetic — the NA pattern is preserved cell-by-cell.

Value

synth, with lat_col and lon_col overwritten by the re-anchored coordinates.

Examples


# Toy example: 50 rows split across two states.
set.seed(1)
n <- 50
df <- data.frame(
  state = sample(c("NSW", "VIC"), n, replace = TRUE),
  lat   = stats::rnorm(n, -33, 0.3),
  lon   = stats::rnorm(n, 145, 0.3),
  y     = stats::rnorm(n)
)
roles <- propose_roles(df, detect = FALSE)
roles$role[roles$col == "y"] <- "outcome"
roles$role[roles$col %in% c("lat", "lon")] <- "covariate"
roles$role[roles$col == "state"] <- "design"
m <- mask(df, roles, mode = "collaborate", seed = 1L)
centroids <- list(
  NSW = c(lat = -32.5, lon = 147),
  VIC = c(lat = -36.5, lon = 144)
)
synth_geo <- synthesise_geospatial(
  synthetic(m), df,
  anchor_col = "state", lat_col = "lat", lon_col = "lon",
  anchor_centroids = centroids, seed = 2L
)
head(synth_geo[, c("state", "lat", "lon")])


# Toy example: 50 rows split across two states.
set.seed(1)
n <- 50
df <- data.frame(
  state = sample(c("NSW", "VIC"), n, replace = TRUE),
  lat   = stats::rnorm(n, -33, 0.3),
  lon   = stats::rnorm(n, 145, 0.3),
  y     = stats::rnorm(n)
)
roles <- propose_roles(df, detect = FALSE)
roles$role[roles$col == "y"] <- "outcome"
roles$role[roles$col %in% c("lat", "lon")] <- "covariate"
roles$role[roles$col == "state"] <- "design"
m <- mask(df, roles, mode = "collaborate", seed = 1L)
centroids <- list(
  NSW = c(lat = -32.5, lon = 147),
  VIC = c(lat = -36.5, lon = 144)
)
synth_geo <- synthesise_geospatial(
  synthetic(m), df,
  anchor_col = "state", lat_col = "lat", lon_col = "lon",
  anchor_centroids = centroids, seed = 2L
)
head(synth_geo[, c("state", "lat", "lon")])

Extract the synthetic data from a masque object

Description

Extract the synthetic data from a masque object

Usage

synthetic(m)
synthetic(m)

Arguments

m

A masque object from mask(), or a masque_set from mask_set().

Value

For a masque, a tibble (the synthetic data frame); for a masque_set, a named list of synthetic tables.

Examples

r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
m <- mask(iris, r, seed = 1)
head(synthetic(m))
r <- propose_roles(iris)
r$role[r$col == "Sepal.Length"] <- "outcome"
m <- mask(iris, r, seed = 1)
head(synthetic(m))

Translate data from the synthetic namespace back to the original

Description

Inverse of apply_recipe(). Accepts a data frame or an atomic vector. For an atomic factor / character vector with a recipe that holds multiple level maps, column must name which map to invert. Atomic numeric, integer, logical, and Date / POSIXct vectors are returned unchanged (no inverse map applies — these are pass-through under apply_recipe() too).

Usage

unmask(x, rec, column = NULL)
unmask(x, rec, column = NULL)

Arguments

x

A data frame or an atomic vector to translate from synthetic-namespace to original-namespace.

rec

A masque_recipe object.

column

Optional column name. Only consulted for atomic factor / character x when the recipe holds more than one level map. Ignored for atomic numeric / logical / Date-like input (pass-through), but if supplied is validated against the recipe's known columns.

Details

The most common pattern is round-tripping pipeline predictions:

fit                 <- my_model(synthetic(m))            # train on synthetic
orig_in_synth_space <- apply_recipe(original, recipe(m)) # forward
preds_synth         <- predict(fit, orig_in_synth_space)
preds_orig          <- unmask(preds_synth, recipe(m))    # inverse

Unknown levels (synthetic aliases not in the recipe's map) fail closed with an informative error rather than silently coercing to NA.

Value

An object of the same type as x, in the original namespace.

Write a masked set to disk, mirroring the input format

Description

Writes the synthetic tables of a mask_set() result to disk. The output format mirrors the request: a .xlsx path produces one workbook with one sheet per table; a folder path produces one .csv per table. The private recipe bundle is never written by this function - persist it separately with save_recipe() and protect it at the same security class as the original data.

Usage

write_set(m, path, overwrite = FALSE, allow_high = FALSE)
write_set(m, path, overwrite = FALSE, allow_high = FALSE)

Arguments

m

A masque_set object from mask_set().

path

Output location. A path ending in .xlsx writes a workbook (needs the Suggested writexl package); any other path is treated as a folder, created if necessary, and receives one ⁠<table>.csv⁠ per table.

overwrite

Logical. When FALSE (default), writing over an existing file or a non-empty folder errors.

allow_high

Logical (default FALSE). Override the HIGH-leakage write refusal after your own review; the override is raised as a masque_high_override warning.

Details

In collaborate mode the mask-time audit gates the write: unresolved HIGH leakage findings refuse the write (nothing is written) until the flagged columns are re-roled, aliased, or dropped - or the refusal is explicitly overridden with allow_high = TRUE, which raises a masque_high_override warning so the exception stays visible.

Value

path, invisibly.

Examples

tables <- list(
  plots = data.frame(site = c("A", "B"), yield = c(3.1, 4.2)),
  sites = data.frame(site = c("A", "B"), rain = c(420, 560))
)
m <- mask_set(tables, seed = 1, quiet = TRUE)
dir <- file.path(tempdir(), "masked_set")
write_set(m, dir)
list.files(dir)

tables <- list(
  plots = data.frame(site = c("A", "B"), yield = c(3.1, 4.2)),
  sites = data.frame(site = c("A", "B"), rain = c(420, 560))
)
m <- mask_set(tables, seed = 1, quiet = TRUE)
dir <- file.path(tempdir(), "masked_set")
write_set(m, dir)
list.files(dir)

Package 'masque'

Help Index

Translate a data frame into the synthetic namespace

Description

Usage

Arguments

Details

Value

See Also

Examples

Audit a masque object for leakage and shareability risks

Description

Usage

Arguments

Details

Value

See Also

Examples

Tidy a dirty table's column names and category labels before masking

Description

Usage

Arguments

Details

Value

See Also

Examples

Detect environment scope and experimental-design structure

Description

Usage

Arguments

Details

Value

See Also

Examples

Coarsen geographic coordinates by an on-land privacy jitter

Description

Usage

Arguments

Details

Value

Choosing the magnitude

References

See Also

Examples

Mask a tabular dataset into a structurally faithful development surrogate

Description

Usage

Arguments

Details

Value

Behaviour by action

See Also

Examples

Mask a multi-table set with cross-table-consistent aliasing

Description

Usage

Arguments

Value

Links

See Also

Examples

Mask a dataset end to end with one guided call

Description

Usage

Arguments

Details

Value

The guided flow

See Also

Examples

Sanity-check visualisation for detected scope and design

Description

Usage

Arguments

Details

Value

Examples

Propose role and action classifications for the columns of a data frame

Description

Usage