When the Ground Truth Is Missing: Validating Generative Model Outputs Through Downstream Task Performance and Predictive Entropy Calibration

Diogo  Ferreira; Cláudia  Mendes; Tiago  Almeida

doi:10.63646/jaiaa.2024.020402

Open Access PDF

Published 2024-12-30

Diogo Ferreira

Department of Computer Science, University of Beira Interior, Covilhã, Portugal

Cláudia Mendes

School of Engineering (ISEP), Polytechnic Institute of Porto, Porto, Portugal

Tiago Almeida*

School of Technology and Management, Polytechnic of Leiria, Leiria, Portugal
tiago.almeida@ipleiria.pt

DOI: https://doi.org/10.63646/jaiaa.2024.020402

Abstract

Generative deep learning models are increasingly used to bridge distribution shifts between data acquired in real-world conditions and the data on which clinical or operational predictors were trained. However, when these generative tools are applied as a domain-adaptation step, two practical questions become difficult to answer rigorously: (i) is a particular generated sample faithful enough to be safely consumed by the downstream model, and (ii) when no clean reference signal is available at inference time, what objective evidence supports trusting the generation at all? We address both questions by reframing the validation problem as a decision-theoretic one. Rather than measuring how closely a generated waveform resembles an unobserved ground truth, we evaluate trustworthiness through the predictive entropy of a fixed downstream classifier consuming the generated input. We instantiate the framework on a wearable photoplethysmography (PPG) atrial fibrillation (AF) detection task, augment the test domain with additive noise to enlarge the train-test domain gap, and use a one-dimensional Pix2pix-style generator with a UNet backbone to denoise inputs back toward the source domain. Across 15,377 held-out PPG segments, denoising recovers 5 percentage points of AUC and 4.5 points of balanced accuracy lost to noise injection, while filtering on entropy retains a low-uncertainty subset that exceeds the clean-source baseline (AUC 0.85 vs. 0.84). Reliability diagrams confirm that the entropy estimate behaves as a calibrated decision cost, not merely a heuristic. The approach generalizes to any setting where a generative model feeds a downstream predictor, and offers a principled answer when standard error metrics are unavailable.

Keywords: Uncertainty quantification; Generative deep learning; Domain adaptation; Photoplethysmography; Atrial fibrillation; Calibration; Decision theory

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Ferreira, D. ., Mendes, C. ., & Almeida, T. (2024). When the Ground Truth Is Missing: Validating Generative Model Outputs Through Downstream Task Performance and Predictive Entropy Calibration. Journal of AI Analytics and Applications, 2(4), 10-31. https://doi.org/10.63646/jaiaa.2024.020402

Article sidebar

Main article

Abstract

Article details

How to Cite