Multimodal Fusion Strategies for Affective Computing: Audio, Visual, and Physiological Signals

Mei-Ling Zhou; Kwame Asante; Irina Petrov

doi:10.63646/

Open Access PDF

Published 2026-03-27

Mei-Ling Zhou*

Brain and Cognitive Science Department, MIT, Cambridge, MA, USA, 02139
meizhou@mit.edu

Kwame Asante

Department of Electrical Engineering, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana, AK385

Irina Petrov

Higher School of Economics, Faculty of Computer Science, Moscow, Russia, 101000

Abstract

Affective computing — the computational recognition, modelling, and response to human emotion — has been a research area for nearly three decades, but the convergence of multimodal deep learning, wearable sensor technology, and large-scale pre-trained models has opened new possibilities and surfaced old problems in sharper relief. This review surveys multimodal fusion strategies for affective computing, focusing on the combination of three signal modalities that together cover the principal pathways through which emotion is expressed: audio (speech prosody, paralinguistic features), visual (facial action units, body posture, gaze), and physiological signals (EEG, EDA, heart rate variability). We analyse fusion architectures across four families — feature-level, decision-level, model-level, and hybrid — and evaluate their performance on six benchmark datasets spanning three affective recognition tasks: valence-arousal regression, discrete emotion classification, and pain intensity estimation. A key finding is that physiological signal modalities are systematically underrepresented in multimodal fusion research relative to their information value, partly due to data collection constraints and partly due to cultural assumptions about which emotion expressions are universal versus culturally specific. We identify cross-cultural generalisation and privacy-preserving physiological sensing as the most critical open research directions.

Keywords: affective computing; multimodal fusion; emotion recognition; EEG; audio-visual; physiological signals; cross-cultural

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Zhou, M.-L., Asante, K., & Petrov, I. (2026). Multimodal Fusion Strategies for Affective Computing: Audio, Visual, and Physiological Signals. DATAMIND, 4(1), 1-5. https://doi.org/10.63646/

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite