When the Ground Truth Is Missing: Validating Generative Model Outputs Through Downstream Task Performance and Predictive Entropy Calibration
Main article
Abstract
Generative deep learning models are increasingly used to bridge distribution shifts between data acquired in real-world conditions and the data on which clinical or operational predictors were trained. However, when these generative tools are applied as a domain-adaptation step, two practical questions become difficult to answer rigorously: (i) is a particular generated sample faithful enough to be safely consumed by the downstream model, and (ii) when no clean reference signal is available at inference time, what objective evidence supports trusting the generation at all? We address both questions by reframing the validation problem as a decision-theoretic one. Rather than measuring how closely a generated waveform resembles an unobserved ground truth, we evaluate trustworthiness through the predictive entropy of a fixed downstream classifier consuming the generated input. We instantiate the framework on a wearable photoplethysmography (PPG) atrial fibrillation (AF) detection task, augment the test domain with additive noise to enlarge the train-test domain gap, and use a one-dimensional Pix2pix-style generator with a UNet backbone to denoise inputs back toward the source domain. Across 15,377 held-out PPG segments, denoising recovers 5 percentage points of AUC and 4.5 points of balanced accuracy lost to noise injection, while filtering on entropy retains a low-uncertainty subset that exceeds the clean-source baseline (AUC 0.85 vs. 0.84). Reliability diagrams confirm that the entropy estimate behaves as a calibrated decision cost, not merely a heuristic. The approach generalizes to any setting where a generative model feeds a downstream predictor, and offers a principled answer when standard error metrics are unavailable.
