---
title: "Kernel-Based Downscaling: Vector and Distribution Inputs"
author: "Max Moldovan"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Kernel-Based Downscaling: Vector and Distribution Inputs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4,
  dpi = 150
)
set.seed(1)
```

## Three flavours of kernel downscaling

kernR ships three downscaling primitives, covering distinct
data-availability regimes:

1. **`kernel_downscale()`** -- vector-in / vector-out. Conditional mean
   embedding (CME) regression: each training observation is a single
   coarse-resolution vector; predict fine-resolution outputs at new
   coarse inputs. Standard kernel ridge regression in
   conditional-distribution form. Used when "coarse" and "fine"
   live on the same kind of object (e.g. one row per year).

2. **`dist_regression()`** -- bag-in / vector-out. Distribution
   regression: each training input is a *bag of points* (a sample from
   a distribution); predict a scalar or vector at the bag level. Used
   when the coarse covariate is itself a distribution rather than a
   single vector -- e.g. a paddock's soil-core measurements, an
   ensemble of climate-model realisations.

3. **`aggregate_downscale()`** -- aggregate-likelihood inversion of a
   *known* aggregator. No paired (coarse, fine) training data — only
   the aggregate observation `y` and the operator `T` such that
   `Y = T(X) + eps`. A Gaussian-mixture prior on the latent `X` is
   inverted in closed form (linear-Gaussian `T`) or by importance
   sampling within each prior component (non-linear `T`). The
   Sejdinovic kernel-downsizing / aggregate-likelihood direction;
   companion to the proxymix `from_aggregate_likelihood()` Tier-2 stub.

The first two have closed-form solutions; the third has a closed form
for linear aggregators and a Monte-Carlo path for non-linear. All
three use orders of magnitude less data than deep-learning
downscalers and carry interpretable bandwidth / regularisation /
component-count knobs.

## (1) `kernel_downscale()`: coarse climate → paddock yield

```{r downscale-setup}
library(kernR)

# Training: 8 years of coarse climate -> paddock-level yield/biomass
n <- 80L
coarse <- matrix(
  c(stats::rnorm(n, mean = 18, sd = 2),    # mean monthly temperature
    stats::rnorm(n, mean = 450, sd = 80)), # cumulative rainfall (mm)
  ncol = 2L,
  dimnames = list(NULL, c("temp", "rainfall"))
)
truth <- function(z) {
  cbind(
    yield   = 0.02 * z[, "rainfall"] - 0.1 * (z[, "temp"] - 18)^2 +
              stats::rnorm(nrow(z), sd = 0.5),
    biomass = 0.03 * z[, "rainfall"] + 0.05 * z[, "temp"] +
              stats::rnorm(nrow(z), sd = 0.7)
  )
}
fine <- truth(coarse)
```

Predict at a held-out coarse grid:

```{r downscale-fit}
new_coarse <- matrix(
  c(stats::rnorm(20L, 18, 2),
    stats::rnorm(20L, 450, 80)),
  ncol = 2L,
  dimnames = list(NULL, c("temp", "rainfall"))
)
fit <- kernel_downscale(coarse, fine, new_coarse)
fit
```

```{r downscale-preds}
head(fit$prediction)
```

The `lambda` was CV-selected via leave-one-out; the bandwidths follow
the median heuristic per kernel. Both can be overridden.

### When to reach for `kernel_downscale()`

- Both training and prediction have the same coarse-resolution
  *shape*: each row is one coarse vector.
- You want point predictions at the fine resolution, not a full
  conditional distribution.
- Sample size in the tens-to-thousands range; closed-form ridge
  scales as `O(n^3)` for the train-time solve but predicts in
  `O(n_new * n_train)`.

For very large training sets, plug `nystrom_factor()` outputs into a
custom CME by writing `K_x = F F^T` and re-solving the ridge in
factor space -- see [`fit_cme()`] for the operator-level access.

## (2) `dist_regression()`: distribution → scalar

The simplest illustrative case: predict the *mean* of a distribution
from the distribution itself, given a finite sample.

```{r dr-mean-stub}
M <- 60L
mu_train <- stats::runif(M, -3, 3)
bags_train <- lapply(mu_train, function(m) {
  matrix(stats::rnorm(40L, mean = m, sd = 1), ncol = 1L)
})

fit_dr <- dist_regression(bags_train, y = mu_train, outer = "linear")
fit_dr
```

```{r dr-mean-predict}
# Predict on fresh bags
M_new <- 10L
mu_new <- stats::runif(M_new, -3, 3)
bags_new <- lapply(mu_new, function(m) {
  matrix(stats::rnorm(40L, mean = m, sd = 1), ncol = 1L)
})
pred <- predict(fit_dr, bags_new)
data.frame(
  truth      = round(mu_new, 3L),
  prediction = round(pred,    3L),
  abs_error  = round(abs(pred - mu_new), 3L)
)
```

### Bag-level vector targets

`y` can also be a matrix when each bag carries multiple scalar
targets:

```{r dr-multivariate}
y_mat <- cbind(mean = mu_train, mean_sq = mu_train^2)
fit_mv <- dist_regression(bags_train, y = y_mat, outer = "linear")
predict(fit_mv, bags_new[1:3])
```

### Outer kernel choice

- **`outer = "linear"`** -- inner-product of empirical embeddings. Most
  appropriate when the target is well captured by the first moment of
  the bag (e.g. predicting bag mean).
- **`outer = "rbf"`** -- Gaussian on embedding-space distance. Picks
  up higher-order distributional features. Use when the target depends
  on more than the bag mean (variance, skew, tails). Bandwidth follows
  the median heuristic on the embedding-space pairwise distances.

```{r dr-variance}
# Variance-targeting example: zero-mean bags with varying SDs
s_train <- stats::runif(M, 0.5, 2.5)
bags_var <- lapply(s_train, function(s) {
  matrix(stats::rnorm(60L, mean = 0, sd = s), ncol = 1L)
})

# Linear outer: insensitive to bag SD (means are all near zero)
fit_lin <- dist_regression(bags_var, y = s_train, outer = "linear")
fit_lin

# RBF outer: embedding-space distance differs even with equal means
fit_rbf <- dist_regression(bags_var, y = s_train, outer = "rbf")
fit_rbf
```

### Variable bag sizes

Bag sizes may differ across the design (e.g. paddocks with different
numbers of soil cores). The double-sum embedding handles this
naturally via `mean()` over the inner Gram:

```{r dr-variable-size}
mu_v <- stats::runif(20L, -2, 2)
bags_v <- lapply(mu_v, function(m) {
  n_i <- sample(20:80, 1L)
  matrix(stats::rnorm(n_i, mean = m), ncol = 1L)
})
fit_v <- dist_regression(bags_v, y = mu_v, outer = "linear")
range(vapply(fit_v$bags_train, nrow, integer(1L)))
```

### When to reach for `dist_regression()`

- Each training row is itself a *collection of measurements* rather
  than a single vector (paddock soil cores, ensemble realisations,
  spectral samples, repeated experimental measurements).
- The target is a property of the underlying distribution, not of a
  single point.
- Sample size in bags: 10s-100s of points per bag; number of bags
  10s-low 1000s. Cost is `O(M^2 \cdot \bar{m}^2)` for the inner Gram
  where `M` is bag count, `\bar{m}` mean bag size.

## (3) `aggregate_downscale()`: aggregate-likelihood inversion

Often the downscaling problem looks **nothing** like supervised
regression: no paired (coarse, fine) data is available. Instead,
the observation is `y = T(x) + eps` where `T` is a *known* operator
(spatial average, temporal average, satellite footprint convolution,
a non-linear sensor model), `eps` is observation noise, and a
parametric prior on the fine-scale latent `x` is available — for
example, fitted via `proxymix::fit_proxymix()` on historical
fine-scale data, or specified directly. The job is to recover the
posterior `p(x | y)`.

`aggregate_downscale()` dispatches on the aggregator's class:

- **Linear matrix `A`** (`y = A x + eps`): closed-form per-component
  Kalman update plus reweighted mixture weights. Cost `O(N (dim_x^3 +
  dim_y^3))`.
- **Function `T(x)`** (non-linear): importance sampling within each
  prior component (`n_samples_per_component` draws, default 200).
  Per-component effective sample size is reported and an ESS-floor
  warning fires when IS collapses. Cost `O(N M (dim_x + dim_y))` per
  call.

Linear example — spatial averaging of two adjacent paddocks against a
two-cluster prior:

```{r aggregate-linear}
A <- matrix(c(0.5, 0.5), nrow = 1L)   # spatial average
prior <- list(
  means       = list(c(0, 0), c(2, 2)),
  covariances = list(diag(2L), diag(2L)),
  weights     = c(0.5, 0.5)
)
# Observe an aggregate that points to the second cluster
fit_lin <- aggregate_downscale(y = 1.8, aggregator = A,
                               latent_prior = prior, sigma_y = 0.15)
fit_lin
```

The posterior mixture weights shift toward the cluster consistent
with the observation (here, the `(2, 2)` cluster). The per-component
posterior is exact (Kalman update) and the cross-component law-of-
total-covariance is honest.

Non-linear example — a sinusoidal aggregator (sensor non-linearity):

```{r aggregate-nonlinear}
agg_fn <- function(x) matrix(sin(rowSums(x)), ncol = 1L)
fit_nl <- aggregate_downscale(y = 0.5, aggregator = agg_fn,
                              latent_prior = prior, sigma_y = 0.1,
                              n_samples_per_component = 400L,
                              seed = 1L)
fit_nl$ess_per_component
fit_nl$posterior_mean
```

The per-component ESS is the reliability gate — if it collapses,
the function warns and the posterior should be treated as
exploratory. Draw posterior samples for downstream uncertainty
propagation:

```{r aggregate-sample}
samples <- posterior_sample_aggregate(fit_nl, n = 500L, seed = 2L)
head(samples)
```

### When to reach for `aggregate_downscale()`

- Only **aggregate** observations are available (no paired
  fine-scale ground truth).
- The aggregation operator `T` is known (spatial averaging,
  temporal averaging, sensor convolution, mass-balance constraint).
- A parametric latent prior is available — e.g. from
  `proxymix::fit_proxymix()` on historical fine-scale data, or
  user-supplied from a domain model.
- You want a closed-form posterior when `T` is linear, or a clean
  importance-sampling pass when it is not — without committing to
  MCMC.

Pair with `proxymix::fit_proxymix()` (the canonical prior fitter)
and PESTO's manifest contract (the cross-package handoff) for the
full APSIM → posterior-prior → downscaled-fine-scale chain.

## Comparison

| Aspect | `kernel_downscale()` | `dist_regression()` | `aggregate_downscale()` |
|---|---|---|---|
| Training data | Paired (coarse, fine) | Bags + scalar/vector targets | None paired; aggregate `y` only |
| Coarse-fine map | Learned (regression) | Learned (regression) | **Known** (`y = T(x) + eps`) |
| Latent prior | Implicit (empirical) | Implicit | Explicit (parametric GMM) |
| Closed form | Yes | Yes | Yes (linear `T`) / IS (non-linear) |
| Multi-output | Yes | Yes | Yes |
| Posterior uncertainty | Point + ridge CI | Point + ridge CI | Full mixture posterior |

## References

- Park, J., Muandet, K., Fukumizu, K., & Sejdinovic, D. (2013).
  *Kernel embeddings of conditional distributions.* IEEE Signal
  Processing Magazine, 30(4), 98-111.
- Szabó, Z., Sriperumbudur, B. K., Póczos, B., & Gretton, A. (2016).
  Learning theory for distribution regression. *Journal of Machine
  Learning Research*, 17(152), 1-40.
- Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B.
  (2017). *Kernel mean embedding of distributions: A review and
  beyond.* Foundations and Trends in Machine Learning, 10(1-2),
  1-141.