Abstract

Human emotion is a core link between cognition, behavior, and physiology, and its accurate recognition is crucial for advancing the development of intelligent human-computer interaction, mental health diagnosis, and other related fields. Current research mostly achieves multi-domain feature fusion through simple concatenation or weighted fusion at the algorithmic level, failing to fully reveal and exploit the feature contribution rates associated with emotional processing. To address the aforementioned issues, this study first conducts sufficient feature extraction across the temporal, frequency, and spatial domains, and then proposes a hybrid encoder model integrating 3D-CNN, LSTM, and an attention mechanism, named CLANet. This model can not only capture local dynamic patterns but also integrate global spatial configurations, thereby providing a novel approach for emotion recognition and improving recognition accuracy. Experiments conducted on the SEED IV dataset demonstrate that the proposed CLANet model achieves a test accuracy of 93.8%, outperforming state-of-the-art models such as support vector machines (SVMs), BiLSTM, Hierarchical LSTM, and EEGNet. Furthermore, the fusion of multi-domain features (temporal, frequency, and spatial) significantly enhances recognition performance, achieving a maximum accuracy of 94.0% in the θ band. This study provides a more physiologically relevant architecture for EEG-based emotion recognition and offers technical support for its practical applications in related fields.