A Transfer Learning-Based Text-Centric Model for Multimodal Sentiment Analysis

Shaowei YI; Suhaila Zainudin

doi:10.22399/ijcesen.1548

Authors

Shaowei YI Center for Artificial Intelligence Technology, Faculty of Information Science and TechnologyThe National University of Malaysia, Selangor, Malaysia
Suhaila Zainudin

DOI:

https://doi.org/10.22399/ijcesen.1548

Keywords:

Multimodal Sentiment Analysis, Transfer Learning, Text-Centric Model, Information Fusion

Abstract

Multimodal sentiment analysis (MMSA) is a research method that extracts effective information from heterogeneous modal information. Then, MMSA processes the multimodal data and performs sentiment analysis. Along with big data and machine learning development, multimodal sentiment analysis has become a hot research direction in multimodal learning and natural language processing. Although various feature extraction methods and information fusion methods have been continuously proposed, challenges exist in MMSA research. First, in terms of feature extraction, pre-trained models trained with many data sets can obtain higher quality features, but research on how to use these feature extraction methods to extract the best features is still needed. Currently, the more popular feature fusion methods do not focus on the interaction between multiple modal information and the retention of basic information. To overcome these problems a multimodal sentiment analysis model utilizes text features as core modal features, using video and audio modal features as auxiliary modal features, multimodal feature modality attention mechanism to extract the intrinsic connection between different modalities. The attention mechanism uses the features of video modality and audio modality as the focus and then enhances the text modality with the fusion of video modality and modality. To improve the quality of extracted features, this method chooses the transfer learning training method and uses the pre-trained model for processing. This research uses the CMU-MOSI dataset to test the proposed method. Experimental results show that the performance of the proposed model in emotion score prediction and emotion classification tasks exceeds traditional methods and baseline methods.

References

B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up? Sentiment classification using machine learning techniques," arxiv preprint cs/0205070, 2002.

Y. Kim, "Convolutional neural networks for sentence classification," in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1746–1751. DOI: https://doi.org/10.3115/v1/D14-1181

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proc. 2019 Conf. North American Chapter Association for Computational Linguistics: Human Language Technologies, vol. 1, Long and Short Papers, 2019, pp. 4171-4186.

S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, "Context-dependent sentiment analysis in user-generated videos," in Proc. 55th Annual Meeting of the Association for Computational Linguistics, vol. 1, Vancouver, Canada, 2017, pp. 873–883. DOI: https://doi.org/10.18653/v1/P17-1081

A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, "Tensor fusion network for multimodal sentiment analysis," arxiv preprint arxiv:1707.07250, 2017.

Y. H. H. Tsai, P. P. Liang, A. Zadeh, L.-P. Morency, and R. Salakhutdinov, (2018). Learning factorized multimodal representations, arxiv preprint arxiv:1806.06176,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., (2017). Attention is all you need," Advances in Neural Information Processing Systems, vol. 30.

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, (2020). A comprehensive survey on graph neural networks, IEEE Trans. Neural Networks Learn. Syst., 32(1);4-24. DOI: https://doi.org/10.1109/TNNLS.2020.2978386

A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, (2017). Tensor fusion network for multimodal sentiment analysis, arxiv preprint arxiv:1707.07250. DOI: https://doi.org/10.18653/v1/D17-1115

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, (2019). Bert: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter Association for Computational Linguistics: Human Language Technologies, vol. 1, Long and Short Papers, pp. 4171-4186.

K. He, X. Zhang, S. Ren, and J. Sun, (2016) Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 770-778. DOI: https://doi.org/10.1109/CVPR.2016.90

B. Schuller, A. Batliner, S. Steidl, and D. Seppi, (2011). Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge, Speech Communication, 53(9-10);1062-1087. DOI: https://doi.org/10.1016/j.specom.2011.01.011

P. P. Liang, A. Zadeh, and L.-P. Morency, (2018). Multimodal local-global ranking fusion for emotion recognition, in Proc. 20th ACM Int. Conf. Multimodal Interaction, pp. 472-476. DOI: https://doi.org/10.1145/3242969.3243019

L.-P. Morency, R. Mihalcea, and P. Doshi, (2011). Towards multimodal sentiment analysis: Harvesting opinions from the web, in Proc. 13th Int. Conf. Multimodal Interfaces, pp. 169-176. DOI: https://doi.org/10.1145/2070481.2070509

A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph, in Proc. 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, Melbourne, Australia, pp. 2236-2246.

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, (2017). End-to-end multimodal emotion recognition using deep neural networks, IEEE J. Select. Topics Signal Process., 11(8);1301-1309. DOI: https://doi.org/10.1109/JSTSP.2017.2764438

K. Simonyan and A. Zisserman, (2014). Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv:1409.1556,

B. Hasani and M. H. Mahoor, (2017) Facial expression recognition using enhanced deep 3D convolutional neural networks, in Proc. IEEE Conf. Computer Vision Pattern Recognition Workshops, pp. 30-40. DOI: https://doi.org/10.1109/CVPRW.2017.282

A. Satt, S. Rozenberg, and R. Hoory, (2017) Efficient emotion recognition from speech using deep learning on spectrograms, in Interspeech, pp. 1089-1093. DOI: https://doi.org/10.21437/Interspeech.2017-200

X. Li, W. Zheng, Y. Zong, H. Chang, and C. Lu, (2021). Attention-based spatio-temporal graphic LSTM for EEG emotion recognition, in 2021 Int. Joint Conf. Neural Networks (IJCNN), 2021, pp. 1-8. DOI: https://doi.org/10.1109/IJCNN52387.2021.9534443

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., (2020). An image is worth 16x16 words: Transformers for image recognition at scale, arxiv preprint arxiv:2010.11929,

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, (2017) Adversarial discriminative domain adaptation, in Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 7167-7176. DOI: https://doi.org/10.1109/CVPR.2017.316

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., (2021) Learning transferable visual models from natural language supervision, in Int. Conf. Machine Learning, pp. 8748-8763.

F. Zhuang, et al., (2020). A comprehensive survey on transfer learning, Proc. IEEE, 109(1);43-76. DOI: https://doi.org/10.1109/JPROC.2020.3004555

A. A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, "Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph," in Proc. 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, Melbourne, Australia, 2018, pp. 2236-2246.

Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, (2019) Multimodal transformer for unaligned multimodal language sequences, in Proc. Assoc. Comput. Linguistics, p. 6558. DOI: https://doi.org/10.18653/v1/P19-1656

D. Hazarika, R. Zimmermann, and S. Poria, (2020) MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis, in Proc. 28th ACM Int. Conf. Multimedia, 1122–1131. DOI: https://doi.org/10.1145/3394171.3413678

S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh, and A. Hussain, (2018). Multimodal sentiment analysis: Addressing key issues and setting up the baselines, IEEE Intell. Syst., 33(6);17-25 DOI: https://doi.org/10.1109/MIS.2018.2882362

Z. Shao, C. Wang, and X. Li, (2015) Facial expression recognition under varying lighting conditions, IEEE Trans. Affective Comput., 6(2);161-172.

N. P., (2024). Occlusion-aware facial expression recognition: A deep learning approach," Multimedia Tools Appl., 83(11);32895-32921, DOI: https://doi.org/10.1007/s11042-023-17013-1

P. R. Sackett, F. Lievens, C. H. Van Iddekinge, and N. R. Kuncel, (2017). Individual differences and their measurement: A review of 100 years of research, J. Appl. Psychol., 102(3);254, DOI: https://doi.org/10.1037/apl0000151

D. Matsumoto and H. S. Hwang, (2012). Culture and emotion: The integration of biological and cultural contributions, J. Cross-Cultural Psychol., 43 (1);91-118 DOI: https://doi.org/10.1177/0022022111420147

Z. Madhoushi, A. R. Hamdan, and S. Zainudin, (2015) Sentiment analysis techniques in recent works, in 2015 Science and Information Conf. (SAI), pp. 288-291. DOI: https://doi.org/10.1109/SAI.2015.7237157

Z. Madhoushi, A. R. Hamdan, and S. Zainudin, (2019). Aspect-based sentiment analysis methods in recent years, Asia-Pacific J. Inf. Technol. Multimedia, 8(01);79-96, DOI: https://doi.org/10.17576/apjitm-2019-0801-07

Z. Madhoushi, A. R. Hamdan, and S. Zainudin, (2023). Semi-supervised model for aspect sentiment detection, Information, 14(5);293, 2023. DOI: https://doi.org/10.3390/info14050293

E. Z. U. A. N. A. Sukawai and N. A. Z. L. I. A. Omar, (2020). Corpus development for Malay sentiment analysis using semi-supervised approach, Asia-Pacific J. Inf. Technol. Multimedia, 9(01);94-109, DOI: https://doi.org/10.17576/apjitm-2020-0901-08

M. A. Latiffi and M. R. Yaakub, (2018) Sentiment analysis: An enhancement of ontological-based using hybrid machine learning techniques," Asian J. Inf. Technol., 7;61-69. DOI: https://doi.org/10.17576/apjitm-2018-0702-05

NLP Town, (2023). bert-base-multilingual-uncased-sentiment (Revision edd66ab), Hugging Face, [Online]. Available: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment. [Accessed: Mar. 6, 2025].

G. Paraskevopoulos, E. Georgiou, and A. Potamianos, (2022). Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis," in ICASSP 2022–2022 IEEE Int. Conf. Acoustics, Speech Signal Process., Virtual and Singapore, 4573–4577. DOI: https://doi.org/10.1109/ICASSP43922.2022.9746418

S. Ranjana, & A. Meenakshi. (2025). Breast Cancer Detection using Convolutional Autoencoder with Hybrid Deep Learning Model. International Journal of Computational and Experimental Science and Engineering, 11(1). https://doi.org/10.22399/ijcesen.1225 DOI: https://doi.org/10.22399/ijcesen.1225

D. Naga Jyothi, & Uma N. Dulhare. (2025). Understanding and Analysing Causal Relations through Modelling using Causal Machine Learning. International Journal of Computational and Experimental Science and Engineering, 11(1). https://doi.org/10.22399/ijcesen.1018 DOI: https://doi.org/10.22399/ijcesen.1018

Olola, T. M., & Olatunde, T. I. (2025). Artificial Intelligence in Financial and Supply Chain Optimization: Predictive Analytics for Business Growth and Market Stability in The USA. International Journal of Applied Sciences and Radiation Research, 2(1). https://doi.org/10.22399/ijasrar.18 DOI: https://doi.org/10.22399/ijasrar.18

Johnsymol Joy, & Mercy Paul Selvan. (2025). An efficient hybrid Deep Learning-Machine Learning method for diagnosing neurodegenerative disorders. International Journal of Computational and Experimental Science and Engineering, 11(1). https://doi.org/10.22399/ijcesen.701 DOI: https://doi.org/10.22399/ijcesen.701

Ibeh, C. V., & Adegbola, A. (2025). AI and Machine Learning for Sustainable Energy: Predictive Modelling, Optimization and Socioeconomic Impact In The USA. International Journal of Applied Sciences and Radiation Research , 2(1). https://doi.org/10.22399/ijasrar.19 DOI: https://doi.org/10.22399/ijasrar.19

Sivananda Hanumanthu, & Gaddikoppula Anil Kumar. (2025). Deep Learning Models with Transfer Learning and Ensemble for Enhancing Cybersecurity in IoT Use Cases. International Journal of Computational and Experimental Science and Engineering, 11(1). https://doi.org/10.22399/ijcesen.1037 DOI: https://doi.org/10.22399/ijcesen.1037

Hafez, I. Y., & El-Mageed, A. A. A. (2025). Enhancing Digital Finance Security: AI-Based Approaches for Credit Card and Cryptocurrency Fraud Detection. International Journal of Applied Sciences and Radiation Research, 2(1). https://doi.org/10.22399/ijasrar.21 DOI: https://doi.org/10.22399/ijasrar.21

A Transfer Learning-Based Text-Centric Model for Multimodal Sentiment Analysis

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

Keywords

Announcements

Current Issue