Arabic Topic Detection:  A Comprehensive Review of Recent Advances

Noor S. Dawood; Salma A. Mahmood

doi:10.22399/ijcesen.3424

Authors

Noor S. Dawood Computer science department, College of Computer Science and Information Technology, Basrah, Iraq
Salma A. Mahmood

DOI:

https://doi.org/10.22399/ijcesen.3424

Keywords:

Natural Language Processing, Arabic language Processing, Topic detection, Large Language Models, Machine Learning

Abstract

Topic detection and short-text analysis have been significantly transformed by integrating machine learning (ML) techniques and large language models (LLMs) such as BERT and GPT, particularly in platforms like Twitter. These advanced models outperform traditional rule-based and statistical approaches by leveraging transformer architectures and semantic embedding techniques (e.g., word embeddings) to uncover text's latent themes and contextual relationships. Even in low-resource language settings, LLMs can capture semantic nuances and support robust text classification and dynamic topic modeling. However, Arabic-language applications face unique challenges, primarily due to the scarcity of high-quality, task-specific annotated datasets, especially for domains like synthetic content identification and fake news detection. Successful model training in Arabic requires extensive corpora, careful linguistic preprocessing, and sensitivity to morphological complexity and dialectal variability. Additionally, LLMs are limited by computational limitations related to input length, which restricts the capacity for scaling when working with large volumes of text. In conclusion, future research should focus on establishing hybrid frameworks with contextual fine-tuning for domains, cross-lingual transfer learning, and better management of computational memory to address these obstacles and completely tap into the possibilities of ML-driven text analytics in resource-constrained settings.

References

[1] Riccardo Cantini and Fabrizio Marozzo, (2023). Topic Detection and Tracking in Social Media Platforms. Springer, Cham, https://doi.org/10.1007/978-3-031-31469-8_3

[2] G. Singh, (2022). AraProp at WANLP 2022 Shared Task: Leveraging Pre-Trained Language Models for Arabic Propaganda Detection, in Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, 496–500. doi: 10.18653/v1/2022.wanlp-1.56.

[3] Z. Mottaghinia, M.-R. Feizi-Derakhshi, L. Farzinvash, and P. Salehpour, (2021). A review of approaches for topic detection in Twitter, J. Exp. Theor. Artif. Intell., vol. 33(5), 747–773, doi: 10.1080/0952813X.2020.1785019.

[4] Amani Aljehani and Syed Hamid Hasan, (2024). A BERT-based Prototypical Networks for Few-Shot Arabic Short-text Topic Detection, J. Eng. Sci., vol. 20(10s), https://journal.esrgroups.org/jes/article/view/5146

[5] F. Alderazi, A. Algosaibi, M. Alabdullatif, H. F. Ahmad, A. M. Qamar, and A. Albarrak, (2024). Generative artificial intelligence in topic-sentiment classification for Arabic text: a comparative study with possible future directions, PeerJ Comput. Sci., vol. 10, e2081, doi: 10.7717/peerj-cs.2081.

[6] H. Lamtougui, H. El Moubtahij, H. Fouadi, and K. Satori, (2023). An Efficient Hybrid Model for Arabic Text Recognition, Comput. Mater. Contin., vol. 74(2), 2871–2888, doi: 10.32604/cmc.2023.032550.

[7] S. Aouichaty, Y. Maleh, M. T. Mohtadi, A. Hajami, and H. Allali, (2024). Sustainable Topic Modeling for Legal Moroccan Arabic Language: A Challenging Study on BERTopic Technique, Procedia Comput. Sci., vol. 236, 582–588, doi: 10.1016/j.procs.2024.05.069.

[8] H. Alshammari and K. Elleithy, (2024). Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges, Information, vol. 15(7), 419, doi: 10.3390/info15070419.

[9] K. Nahar, R. Al-Khatib, M. Al-Shannaq, M. Daradkeh, and R. Malkawi, (2020). Direct Text Classifier for Thematic Arabic Discourse Documents, Int. Arab J. Inf. Technol., vol. 17(3), 394–403, doi: 10.34028/iajit/17/3/13.

[10] H. Liu, Z. Chen, J. Tang, Y. Zhou, and S. Liu, (2020). Mapping the technology evolution path: a novel model for dynamic topic detection and tracking, Scientometrics, vol. 125(3), 2043–2090, doi: 10.1007/s11192-020-03700-5.

[11] A. El Kah and I. Zeroual, (2021). Arabic Topic Identification: A Decade Scoping Review, E3S Web Conf., vol. 297, 01058, doi: 10.1051/e3sconf/202129701058.

[12] A. Abuzayed and H. Al-Khalifa, (2021). BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique, Procedia Comput. Sci., vol. 189, 191–194, doi: 10.1016/j.procs.2021.05.096.

[13] M. Grootendorst, (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure, arXiv: arXiv:2203.05794. doi: 10.48550/arXiv.2203.05794.

[14] A. Mansy, S. Rady, and T. Gharib, (2022). An Ensemble Deep Learning Approach for Emotion Detection in Arabic Tweets, Int. J. Adv. Comput. Sci. Appl., vol. 13(4), doi: 10.14569/IJACSA.2022.01304112.

[15] S. Ben Ali, Z. Kechaou, and A. Wali, (2022). Arabic fake news detection in social media Based on AraBERT, in 2022 IEEE 21st International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), Toronto, ON, Canada: IEEE, 214–220. doi: 10.1109/ICCICC57084.2022.10101635.

[16] L. George and P. Sumathy, (2023). An integrated clustering and BERT framework for improved topic modeling, Int. J. Inf. Technol., vol. 15(4), 2187–2195, doi: 10.1007/s41870-023-01268-w.

[17] Y. An, H. Oh, and J. Lee, (2023). Marketing Insights from Reviews Using Topic Modeling with BERTopic and Deep Clustering Network, Appl. Sci., vol. 13(16), 9443, doi: 10.3390/app13169443.

[18] H. Rahimi, H. Naacke, C. Constantin, and B. Amann, (2023). ATEM: A Topic Evolution Model for the Detection of Emerging Topics in Scientific Archives, arXiv: arXiv:2306.02221. doi: 10.48550/arXiv.2306.02221.

[19] S. Al-Khalifa, F. Alhumaidhi, H. Alotaibi, and H. S. Al-Khalifa, (2023). ChatGPT across Arabic Twitter: A Study of Topics, Sentiments, and Sarcasm, Data, vol. 8(11), 171, doi: 10.3390/data8110171.

[20] V. De Leo, M. Puliga, M. Bardazzi, F. Capriotti, A. Filetti, and A. Chessa, (2023). Topic detection with recursive consensus clustering and semantic enrichment, Humanit. Soc. Sci. Commun., vol. 10(1), 197, doi: 10.1057/s41599-023-01711-0.

[21] 2024-A BERT-based Prototypical.

[22] A. Boutaleb, J. Picault, and G. Grosjean, (2024). BERTrend: Neural Topic Modeling for Emerging Trends Detection, in Proceedings of the Workshop on the Future of Event Detection (FuturED), Miami, Florida, USA: Association for Computational Linguistics, 1–17. doi: 10.18653/v1/2024.futured-1.1.

[23] Y. Mu, C. Dong, K. Bontcheva, and X. Song, (2024). Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling, arXiv: arXiv:2403.16248. doi: 10.48550/arXiv.2403.16248.

[24] A. Abdelali et al., (2024). LAraBench: Benchmarking Arabic AI with Large Language Models, arXiv: arXiv:2305.14982. doi: 10.48550/arXiv.2305.14982.

[25] M. Alghaslan and K. Almutairy, (2024). MGKM at StanceEval2024 Fine-Tuning Large Language Models for Arabic Stance Detection, in Proceedings of The Second Arabic Natural Language Processing Conference, Bangkok, Thailand: Association for Computational Linguistics, 816–822. doi: 10.18653/v1/2024.arabicnlp-1.95.

[26] Md. R. Hossain, M. M. Hoque, N. Siddique, and M. A. A. Dewan, (2024). AraCovTexFinder: Leveraging the transformer-based language model for Arabic COVID-19 text identification, Eng. Appl. Artif. Intell., vol. 133, 107987, doi: 10.1016/j.engappai.2024.107987.

[27] A. Kirilenko and S. Stepchenkova, (2024). Automated Topic Analysis with Large Language Models, in Information and Communication

Technologies in Tourism 2024, K. Berezina, L. Nixon, and A. Tuomi, Eds., in Springer Proceedings in Business and Economics., Cham: Springer Nature Switzerland, 29–34. doi: 10.1007/978-3-031-58839-6_3.

[28] A. N. Tarekegn, (2024). Large Language Model Enhanced Clustering for News Event Detection, arXiv. doi: 10.48550/ARXIV.2406.10552.

[29] T. Doi, M. Isonuma, and H. Yanaka, (2024). Comprehensive Evaluation of Large Language Models for Topic Modeling, arXiv: arXiv:2406.00697. doi: 10.48550/arXiv.2406.00697.

[30] M. S. A. Alzaidi et al., (2025). Enhanced automated text categorization via Aquila optimizer with deep learning for Arabic news articles, Ain Shams Eng. J., vol. 16(1), 103189, doi: 10.1016/j.asej.2024.103189.

Arabic Topic Detection: A Comprehensive Review of Recent Advances

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Information

Keywords

Announcements

Current Issue