Main article

Mei-Ling Zhou*
Brain and Cognitive Science Department, MIT, Cambridge, MA, USA, 02139
meizhou@mit.edu
Kwame Asante
Department of Electrical Engineering, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana, AK385
Irina Petrov
Higher School of Economics, Faculty of Computer Science, Moscow, Russia, 101000

Abstract

Affective computing — the computational recognition, modelling, and response to human emotion — has been a research area for nearly three decades, but the convergence of multimodal deep learning, wearable sensor technology, and large-scale pre-trained models has opened new possibilities and surfaced old problems in sharper relief. This review surveys multimodal fusion strategies for affective computing, focusing on the combination of three signal modalities that together cover the principal pathways through which emotion is expressed: audio (speech prosody, paralinguistic features), visual (facial action units, body posture, gaze), and physiological signals (EEG, EDA, heart rate variability). We analyse fusion architectures across four families — feature-level, decision-level, model-level, and hybrid — and evaluate their performance on six benchmark datasets spanning three affective recognition tasks: valence-arousal regression, discrete emotion classification, and pain intensity estimation. A key finding is that physiological signal modalities are systematically underrepresented in multimodal fusion research relative to their information value, partly due to data collection constraints and partly due to cultural assumptions about which emotion expressions are universal versus culturally specific. We identify cross-cultural generalisation and privacy-preserving physiological sensing as the most critical open research directions.

Article details

How to Cite

Zhou, M.-L., Asante, K., & Petrov, I. (2026). Multimodal Fusion Strategies for Affective Computing: Audio, Visual, and Physiological Signals. DATAMIND, 4(1), 1-5. https://doi.org/10.63646/