--- title: "Cohort specification --- the canonical study cohort" author: "Max Moldovan, Usman Iqbal" date: "2026-05-08" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Cohort specification --- the canonical study cohort} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- > **Subtitle.** PIC v1.1.0 pediatric ICU mortality, canonical study cohort. Authors: Max Moldovan (Adelaide, ORCID 0000-0001-9680-8474), Usman Iqbal (Bond). Spec version **1.1.0** (2026-05-08). > > **Abstract.** Authoritative cohort definition for the PIC v1.1.0 pediatric ICU mortality analysis. The calibration-first manuscript pipeline inherits this specification verbatim. Deviations are forbidden; sensitivity analyses operate on the same base cohort with documented filters. # Status **Frozen contract --- version 1.1.0 (2026-05-08).** Inherited by the calibration-first manuscript pipeline. Changes require a new minor version, an entry in the changelog at the foot of this vignette, and an explicit Methods paragraph in any paper that adopts the new version. Sensitivity analyses do **not** require a version bump --- they operate on the base cohort with documented filters. The cohort builder lives in `R/cohort.R::build_cohort()`; this vignette is the human-readable specification it implements. The `testthat` invariants in `tests/testthat/test-cohort-invariants.R` are the executable contract. # Data source - **Database**: Pediatric Intensive Care (PIC) v1.1.0, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology (Wuhan; data center: Shanghai). Open-access under registered DUA via PhysioNet. - **Cohort window**: ICU admissions 2010-01 — 2018-12 (the V1.1.0 release span; confirmed against `ADMISSIONS.csv` at gate G1). - **Mounting**: read-only via `data_links/pic_v110/` symlink. `picMort::pic_paths()` is the only sanctioned access route. The Box-synced source folder is never read directly by package code. # Unit of analysis One row per **first ICU stay** per patient. Multiple ICU stays within the same hospitalization collapse to the index stay; ICU readmissions in later hospitalizations are excluded from this cohort entirely. Rationale: the prediction-window framing requires a well-defined T0; mixing first and subsequent stays leaks information through the patient's prior trajectory. # Inclusion criteria | # | Criterion | Source | |---|---|---| | I1 | Age at hospital admission $\in [0, 18]$ years | `PATIENTS.DOB`, `ADMISSIONS.ADMITTIME` | | I2 | Valid ICU admission and discharge timestamps (non-null, in window) | `ICUSTAYS.INTIME`, `ICUSTAYS.OUTTIME` | | I3 | First ICU stay per patient (`row_number() == 1` ordered by `INTIME`) | `ICUSTAYS` | | I4 | ICU length of stay $\geq$ `min_los_hours` (default **24 h**) | derived | # Exclusion criteria | # | Criterion | Note | |---|---|---| | E1 | Age $> 18$ y at admission | safety margin: also exclude rows with implausible negative age | | E2 | Any of `INTIME`, `OUTTIME`, `ADMITTIME`, `DISCHTIME` missing | hard fail | | E3 | `OUTTIME < INTIME` or `DISCHTIME < ADMITTIME` | data-quality flag, dropped | | E4 | ICU LOS $<$ `min_los_hours` | excluded from prediction-window analysis; counted in attrition | | E5 | Duplicate `subject_id` after first-stay selection | sanity check; should be empty | Each exclusion is logged with a reason in `cohort_attrition()` and rendered as a CONSORT-style flow diagram in the manuscript. # Outcome **Primary**: `hospital_expire_flag` --- in-hospital mortality, derived from `ADMISSIONS.HOSPITAL_EXPIRE_FLAG`. Cross-checked against `ADMISSIONS.DEATHTIME` (any non-null `DEATHTIME` within the index hospitalization $\Rightarrow$ flag = 1; mismatches logged and reviewed). **Rationale (D1, locked default)**: in-hospital mortality is the most defensible pediatric-ICU outcome on PIC because (i) the database has no post-discharge follow-up, so 30-day mortality requires opaque imputation for any patient discharged before day 30 alive; (ii) it is the outcome PIM3 was developed to predict, so the comparator is principled; (iii) earlier reference pipelines used 30-day mortality with unclear linkage --- a defect this specification explicitly fixes. # Prediction window Features lock at **T0 + 24 h**, where T0 = `ICUSTAYS.INTIME`. No feature in the matrix may be derived from any timestamp at or after $T_0 + 24$ h. The hard rule: **no LOS, no discharge-time, no post-window vital, no post-window lab, ever.** `audit_no_leakage()` runs as a runtime invariant in the targets pipeline (gate G2). **Sensitivity arm (D2 secondary default)**: T0 + 12 h. Reported as a row-level robustness check in the manuscript supplement; no separate cohort build --- only the feature window changes. # Variable contract The cohort table returned by `build_cohort()` has exactly these columns, in this order, with these dtypes: | Variable | Type | Source | Notes | |---|---|---|---| | `subject_id` | integer | `PATIENTS` | primary key | | `hadm_id` | integer | `ADMISSIONS` | join key | | `icustay_id` | integer | `ICUSTAYS` | join key for events | | `intime` | POSIXct (UTC) | `ICUSTAYS.INTIME` | T0 | | `outtime` | POSIXct (UTC) | `ICUSTAYS.OUTTIME` | excluded from features | | `los_hours` | double | derived | excluded from features | | `age_months` | integer | derived from `DOB`, `ADMITTIME` | for pediatric subgroups | | `age_years` | double | derived | float for plotting | | `sex` | factor `c("F","M")` | `PATIENTS.GENDER` | unknown $\Rightarrow$ NA, excluded | | `hospital_expire_flag` | integer 0/1 | `ADMISSIONS` | outcome | | `admit_year` | integer | `ADMITTIME` | for stratification + internal-external split | | `is_surgical` | logical | derived from `SURGERY_VITAL_SIGNS` non-empty | binary | | `primary_icd_chapter` | factor | `DIAGNOSES_ICD` $\bowtie$ `D_ICD_DIAGNOSES`, top-N + "other" | | Time-varying features (vitals, labs, interventions) live in the **feature matrix** built by `build_features()`, not in the cohort table. The cohort table is row-static. Weight at admission is a PIM3 input and is sourced inside `compute_pim3()`, not the cohort table (spec v1.1.0 change; see Changelog). # Splits and resampling - Stratified train/test split: 70/30, stratified jointly by `hospital_expire_flag` and `admit_year` quartile. Test set frozen at study start; touched once for final reporting. - Inner resampling (training set only): 5×5 nested CV --- outer 5-fold for performance estimation, inner 5-fold for hyperparameter selection. Bootstrap percentile / BCa CIs (1,000 reps) on outer-fold metrics. - Internal-external arm (D4 default): train on $\text{admit\_year} \le t$, test on $\text{admit\_year} = t + 1$, sliding through the cohort years. Reported alongside the random split as a temporal-generalisation robustness check. # Invariants (executable contract) Every cohort build must satisfy the invariants below. Violations raise errors in `assert_cohort_invariants()` and fail the targets pipeline. ```r list( n_min = 8500L, # PIC v1.1.0 + min_los_hours = 24 n_max = 9000L, mortality_rate = c(0.075, 0.095), age_range_years = c(0, 18), sex_levels = c("F", "M"), distinct_subject = TRUE, # one row per subject_id no_overlap_stays = TRUE # ICU stays do not overlap within subject ) ``` **Observed at G1 (2026-05-08, locked).** n = **8,736**; mortality **0.0844** (737 / 8,736 deaths); age range [0, 17.842] y; admit-year span 2060–2118 (PIC date-anonymisation preserves relative ordering); sex F/M = 3,684 / 5,052; surgical fraction 36 %. Cohort attrition recorded in `attr(cohort, "attrition")`: the LOS ≥ 24 h filter removes **4,075 of 12,811** (32 %) eligible stays — the largest single exclusion. The 32 % short-stay attrition is a manuscript Discussion-point on selection bias for the prediction-window framing (many of the excluded stays are likely "well kid briefly observed" or "very-early death"; the T+12 h sensitivity arm probes this). # Reproducibility - Seed: `20260508` (set at every stochastic step). - Build: `targets::tar_make()` reproduces every number in the manuscript from raw CSVs in $<$30 min on a 16 GB laptop after `data_links/` is mounted. - Environment: `renv.lock` at the package root pins every dependency. - Provenance: `picMort::pic_paths()` raises if any source CSV is missing, so a half-mounted Box folder cannot silently produce a partial cohort. # Out of scope - Discharge disposition (alive-with-impairment, alive-without-impairment). - ICU readmission risk. - Length-of-stay prediction. - `EMR_SYMPTOMS` text features. - PRISM-IV (kept conditionally; included only if every required field is reconstructable from PIC without proxy --- decision logged at G3). # Reporting The TRIPOD+AI checklist for the calibration-first manuscript points to this specification for items 4a (study design), 5a (data source), 5b (inclusion / exclusion), 6a (outcome) and 6b (predictors). # Pedagogy (D5 default --- balanced) The package is a teaching artefact as well as the engine behind the manuscript: function names mirror the clinical concepts (`build_cohort`, `compute_pim3`, `decision_curve`), and the targets DAG mirrors the methods narrative one-to-one. Where pipeline mechanics could obscure clinical interpretation, callout boxes ("Why this matters clinically") sit alongside callouts for mechanics ("Why this matters mechanically"). Both are first-class. # Changelog - **1.0.0 — 2026-05-08** --- initial frozen spec. D1–D4 defaults locked per series kickoff; D5 balanced clinical-and-mechanical pedagogy. - **1.1.0 — 2026-05-08** --- minor revision after gate G1 land. (a) Removed `weight_kg_admit` from the cohort table — sourced in `compute_pim3()` instead of `build_cohort()`, keeps the cohort contract a thin row-static frame and avoids reading the 190 MB `CHARTEVENTS` table at cohort-build time. (b) Concrete invariants committed against PIC v1.1.0 (n = 8,736, mortality 0.0844). (c) Attrition cascade documented; LOS ≥ 24 h filter is the dominant exclusion (32 %) and a Discussion-point for selection bias.