Task-Guided Confidence Scoring for Synthetic Time-Series Outputs in Health-Oriented Machine Learning Systems
Main article
Abstract
Generative models are increasingly used to produce synthetic physiological time series in health-oriented machine learning, whether to denoise wearable recordings, adapt signals across acquisition domains, augment scarce training data, or impute missing segments. Yet the same flexibility that makes these models useful also lets them introduce plausible-looking but misleading artefacts, which is a serious liability when the synthetic signal feeds a clinical decision. This review argues that the trustworthiness of a synthetic output cannot be judged in isolation from the task it is meant to support, and it develops a task-guided confidence scoring perspective that grounds the quality of each generated signal in the expected cost of the downstream decision it influences. We organise the argument around four ideas: that conventional distributional and realism metrics answer the wrong question for deployment; that a useful confidence signal must be per-instance, available before ground truth, and aligned with the decision at hand; that such a signal can be derived from the behaviour of the downstream task model and externally grounded by checking whether higher scores track higher realised decision cost; and that the resulting scores enable principled gating of low-confidence outputs. Using wearable photoplethysmography and atrial-fibrillation screening as a running example, we synthesise reporting strategies across modalities, contrast their properties, and map the deployment, governance, and clinical-translation considerations that determine whether confidence scoring delivers value in practice. The perspective offers a transferable diagnostic for deciding when a synthetic time-series output is safe to use.
