Hallucination Risks in Generative Deep Learning for Wearable Cardiovascular Monitoring: A Systematic Review of Quantitative Evaluation Methods
Main article
Abstract
Background: Generative deep learning is increasingly used to denoise, complete, translate, and synthesize wearable photoplethysmography and electrocardiography signals. These models can also create hallucinated cardiovascular structure that appears plausible while changing rhythms, morphology, or downstream risk estimates. Objective: This systematic methods review evaluates quantitative approaches for detecting and managing hallucination risk in generative wearable cardiovascular monitoring. Methods: We followed PRISMA-aligned evidence mapping and coded 46 reports published from 2017 to 2026 that involved generative time-series modeling, wearable cardiovascular signals, or decision-linked uncertainty evaluation. Metrics were grouped by signal fidelity, physiological feature preservation, distributional realism, downstream task utility, calibration, out-of-distribution stress testing, fairness, and expert review. Results: Pointwise error and signal-to-noise measures were the most common evaluation tools, but they were weak proxies for local clinical harm when paired clean targets were unavailable. Physiological feature metrics and downstream classifiers were more decision-relevant, yet they could miss subgroup failures and model-induced rhythm artifacts. Only a small subset of reports quantified uncertainty calibration or used deferral analysis. Conclusion: No single metric adequately evaluates hallucination risk. We propose a layered evaluation framework that combines paired fidelity, physiological constraints, task-specific decision loss, uncertainty calibration, and stress testing before generative models are used in wearable cardiovascular monitoring.
