Sentiment Analysis for Transliterated Hindi and Marathi Language using Machine Learning Approach

Rishikesh Janardan Sutar; Kamalakar Ravindra Desai

doi:10.22399/ijcesen.3115

Authors

Rishikesh Janardan Sutar
Kamalakar Ravindra Desai

DOI:

https://doi.org/10.22399/ijcesen.3115

Keywords:

Sentiment analysis, Hindi-Marathi Transliterated Text, Spelling Variations, Sentiment words dictionary, Lexical analysis

Abstract

Sentiment analysis for local transliterated languages such as Hindi and Marathi has gained increasing research interest due to the linguistic diversity and informal nature of user-generated content. However, most existing approaches are limited by insufficient datasets that fail to capture the wide range of transliteration-based spelling variations inherent in such languages. To address this gap, the present study introduces a manually curated sentiment word dictionary for Hindi and Marathi, enriched with diverse transliterated spellings and associated sentiment weights. Using this resource, multiple sentence-level datasets were developed, including Hindi, Marathi, and real-world YouTube comment datasets, where each sentence is annotated with an average sentiment score derived from constituent sentiment words. A comprehensive sentiment classification framework was then designed using three feature extraction strategies: Count Vectorizer (CV), TF-IDF Vectorizer, and a Graph Embedding Technique (GET) combined with Rank-Based Selection (RBS). These features were used to train and evaluate three machine learning classifiers, Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF), which relies mainly on manually engineered linguistic features and graph-based representations. Experimental results demonstrate that SVM consistently outperforms LR and RF across all feature configurations. Among all combinations, SVM with TF-IDF achieved the highest accuracy, while SVM with GET+RBS demonstrated robust performance across datasets. Furthermore, the Hindi, Marathi, and mixed Hindi-Marathi datasets yielded comparable and higher accuracies than the YouTube comments dataset, confirming the advantage of structured transliterated corpora in sentiment analysis.

References

[1] M. Thomas, and C. Latha, (2020). “Sentimental analysis of transliterated text in Malayalam using recurrent neural networks”, Journal of Ambient Intelligence and Humanized Computing, 2020, doi: 10.1007/s12652-020-02305-3.

[2] S. Deshmukh, N. Patil, S. Rotiwar, and J. Nunes, (2017). “Sentiment Analysis of Marathi Language”, International Journal of Research Publications in Engineering and Technology [IJRPET], Vol. 3, Issue 6, 2017, pp. 93-97.

[3] Oxford Hindi-English Dictionary

[4] Salaamchaus’s Marathi-English Dictionary

[5] A. Ansari, and S. Govilkar, (2018). “Sentiment Analysis of Mixed code for the Transliterated Hindi and Marathi Texts”, International Journal on Natural Language Computing (IJNLC), Vol. 7, No.2, 2018, doi: 10.5121/ijnlc.2018.7202.

[6] R. Srinivasan, and C. Subalalitha, (2021). “Sentimental analysis from imbalanced code mixed data using machine learning approaches”, Distributed and Parallel Databases, doi: 10.1007/s10619-021-07331-4.

[7] B. Khare, and I. Khan, (2024). “Machine Learning Approaches for Sentiment Analysis in Hindi Text: A Comprehensive Survey”, International Journal of Innovative Research in Computer Science & Technology (IJIRCST), Vol. 12, Special Issue-1, 2024, doi: 10.55524/CSISTW.2024.12.1.62.

[8] P. Pandey, and S. Govilkar, (2015). “A survey of Sentiment Classification techniques used for Indian regional languages”, International Journal on Computational Science & Applications (IJCSA), Vol.5, No.2, 2015, pp. 13-26, doi:10.5121/ijcsa.2015.5202.

[9] [9] S. Alam, S. Mrida, and A. Rahman, (2025). “Sentiment Analysis in social media: How data science impacts public opinion knowledge integrates Natural Language Processing (NLP) with Artificial Intelligence (AI)”, American Journal of Scholarly Research and Innovation, 4(01), 2025, pp. 63-100, doi: 10.63125/r3sq6p80.

[10] [10] S. Sharma, S. Bharti, and R. Goel, (2018). “A Frame Study on Sentiment Analysis of Hindi Language Using Machine Learning”, International Journal of Trend in Scientific Research and Development, Vol. 2, 1603-1607, doi: 10.31142/ijtsrd14397.

[11] [11] M. Horvat, G. Gledec, and F. Leontić, (2024). “Hybrid Natural Language Processing Model for Sentiment Analysis during Natural Crisis”, Electronics 2024, 13, 1991, doi: 10.3390/electronics13101991.

[12] [12] S. Sidhu, S. Khurana, M. Kumar, P. Singh, and S. Bamber, (2023). “Sentiment analysis of Hindi language text: a critical review”, Multimedia Tools and Applications, 2023, doi: 10.1007/s11042-023-17537-6.

[13] S. Chanda, A. Mishra, and S. Pal, (2025). “Sentiment analysis of code-mixed Dravidian languages leveraging pretrained model and word-level language tag,” Natural Language Processing, Vol. 31, No. 2, pp. 477–499, 2025. doi:10.1017/nlp.2024.30.

[14] S. Mulatkar, and V. Bhojane, (2015). “Sentiment Classification in Hindi”, IOSR Journal of Computer Engineering (IOSR-JCE), Vol. 17, Issue 4, 2015, PP 100-102, doi: 10.9790/0661-1741100102.

[15] S. Shekhar, D. Sharma, D. Agarwal, and Y. Pathak, (2020). “Artificial Immune Systems-Based Classification Model for Code-Mixed Social Media Data”, IRBM, (2020), doi: 10.1016/j.irbm.2020.07.004.

[16] M. Kumar, L. Khan, and H-T Chang, (2025). “Evolving techniques in sentiment analysis: a comprehensive review”, PeerJ Comput. Sci. 11: e2592, 2025, doi:10.7717/peerj-cs.2592

[17] S. Rani, and P. Kumar, (2019). “Deep Learning Based Sentiment Analysis Using Convolution Neural Network”, Arabian Journal for Science

[18] and Engineering, 2019, doi: 10.1007/s13369-018-3500-z.

[19] [18] R. Ahamad, and K. Mishra, (2025). “Exploring sentiment analysis in handwritten and E-text documents using advanced machine learning techniques: a novel approach”, J Big Data 12, 2025, doi: 10.1186/s40537-025-01064-2.

[20] [19] N. Sharma, S. Ali, and A. Kabir, (2025). “A review of sentiment analysis: tasks, applications, and deep learning techniques”, Int J Data Sci Anal 19, 351–388, 2025, doi: 10.1007/s41060-024-00594-x.

[21] [20] R. Sharma, and K. Lakhwani, (2024). “A Systematic Literature Review on Cross Domain Sentiment Analysis Techniques: PRISMA Approach”, Annals of Emerging Technologies in Computing (AETiC), Vol. 8, No. 4, 2024, doi: 10.33166/AETiC.2024.04.002.

[22] [21] S. Sazan, M. Miraz, and M. Rahman, (2024). “Enhancing Depressive Post Detection in Bangla: A Comparative Study of TF-IDF, BERT and FastText Embeddings”, Annals of Emerging Technologies in Computing (AETiC), Vol. 8, No. 3, 2024, doi: 10.33166/AETiC.2024.03.003.

[23] [22] O. Yadav, R. Patel, Y. Shah, and S. Talim, (2020). “Sentiment Analysis on Hindi News Articles”, International Research Journal of Engineering and Technology (IRJET), Vol. 07 Issue: 05, 2020.

[24] [23] M. Shelke, and S. Deshmukh, (2020). “Recent Advances in Sentiment Analysis of Indian Languages”, International Journal of Future Generation Communication and Networking, Vol. 13, No. 4, (2020), pp. 1656–1675.

[25] [24] S. Pawar, and S. Mali, (2017). “Sentiment Analysis in Marathi Language”, International Journal on Recent and Innovation Trends in Computing and Communication, 2017, Vol. 5, Issue: 8, pp. 21-25.

[26] [25] S. Gupta, and G. Ansari, (2014). “Sentiment Analysis in Hindi Language: A Survey”, International Journal of Modern Trends in Engineering and Research (IJMTER), Vol. 01, Issue 05, 2014, pp. 82-88.

[27] N. Bhoir, A. Das, M. Jakate, S. Lavangare, and D. Kadam, (2021). “A Study on Sentiment Analysis of Twitter Data for Devnagari Languages”, International Research Journal of Engineering and Technology (IRJET), Vol. 08 Issue: 10, 2021.

[28] V. Lomte, P. Jadhav, O. Kalshetti, S. Deshmukh, and A. Jadhav, (2021). “Survey on Sentiment Analysis of Marathi Speech and Script”, International Research Journal of Engineering and Technology (IRJET), Vol. 08 Issue: 12, 2021, pp. 876-893.

[29] M. Thorat, and N. Guide, (2022). “Review Paper on Sentiment Analysis for Hindi Language”, Grenze International Journal of Engineering and Technology, Jan Issue, Grenze Scientific Society, 2022, Grenze ID: 01. GIJET.8.1.74.

[30] E. Ranjan, and N. Poddar, (2022). “Multilingual Abusiveness Identification on Code-Mixed Social Media Text”, arXiv:2204.01848v1 [cs.CL], 2022.

[31] Y. Liu, M. Wang, A. Kargaran, A. Imani, O. Xhelili, H. Ye, C. Ma, F. Yvon, and H. Schütze, (2024). “How Transliterations Improve Crosslingual Alignment”, arXiv:2409.17326, 2024.

[32] A. Eusha, S. Farsi, A. Hossain, S. Ahsan, and M. Hoque, (2024). “Sentiment Analysis using Transformer-Based Models in Code-Mixed and Transliterated Tamil and Tulu”, Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages,2024, pp. 205–211.

Sentiment Analysis for Transliterated Hindi and Marathi Language using Machine Learning Approach

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

Keywords

Announcements

Current Issue