Optimizing Type II Diabetes Prediction Through Hybrid Big Data Analytics and H-SMOTE Tree Methodology

K.S. Praveenkumar; R. Gunasundari

doi:10.22399/ijcesen.727

Authors

K.S. Praveenkumar Research Scholar
R. Gunasundari

DOI:

https://doi.org/10.22399/ijcesen.727

Keywords:

Hybrid Big Data Analytics, Type II Diabetes Prediction, H-SMOTE Tree, Data Preprocessing, Feature Selection, Healthcare Decision making

Abstract

In the last few years, Type II diabetes has become much more common worldwide, presenting major problems for both healthcare systems and individuals. Utilizing big data analytics has shown potential as a means of forecasting and managing persistent illnesses, like Type II diabetes. This paper proposes a novel hybrid approach that combines big data analytics techniques with an H-SMOTE tree algorithm for the prediction of Type II diabetes. The suggested method addresses the problems of class imbalance present in medical datasets and improves prediction accuracy by combining steps of feature selection, data preprocessing, and classification. In order to prepare raw data for analysis, it must first be cleaned, standardised, and transformed. Then, feature selection techniques are used to identify the most important factors that help predict Type II diabetes. This approach streamlines the predictive model and lowers its dimensionality. In the classification phase, an algorithm called the H-SMOTE tree is used. This method combines two existing techniques: the Hoeffding Adaptive Tree (HAT) and Synthetic Minority Oversampling Technique (SMOTE). The H-SMOTE tree tackles imbalanced data by creating synthetic samples for the under-represented class, while also adapting the decision tree structure as it receives new data. Experiments show that this approach is effective in accurately predicting Type II diabetes. The researchers found that the H-SMOTE tree model outperformed other machine learning methods, both classic and recent ones. In other words, it was more accurate in predicting T2DM cases. This was evident in terms of several metrics, including how well it identified true positives (sensitivity), how well it avoided false positives (specificity), and its overall performance captured by the AUC-ROC score. Additionally, the proposed method displays resilience and scalability, rendering it apt for managing extensive medical datasets frequently encountered within healthcare domains.

References

Alberti, K. G., & Zimmet, P. Z. (1998). Definition, diagnosis and classification of diabetes mellitus and its complications. Part 1: diagnosis and classification of diabetes mellitus provisional report of a WHO consultation. Diabetic medicine. 15(7), 539-553. DOI:10.1002/(SICI)1096-9136(199807)15:7<539::AID-DIA668>3.0.CO;2-S DOI: https://doi.org/10.1002/(SICI)1096-9136(199807)15:7<539::AID-DIA668>3.0.CO;2-S

American Diabetes Association. (2019). Classification and diagnosis of diabetes: standards of medical care in diabetes—2019. Diabetes care. 42(Supplement 1), S13-S28. DOI:10.2337/dc19-S002 DOI: https://doi.org/10.2337/dc19-S002

Balazs, J., & Victor, J. (2016). Understanding machine learning: From theory to algorithms. Cambridge University Press.

Breiman, L. (2001). Random forests. Machine learning. 45(1): 5-32. DOI: 10.1023/A:1010933404324 DOI: https://doi.org/10.1023/A:1010933404324

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of artificial intelligence research. 16: 321- 357. DOI: 10.1613/jair.953 DOI: https://doi.org/10.1613/jair.953

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 785-794). DOI: 10.1145/2939672.2939785 DOI: https://doi.org/10.1145/2939672.2939785

Centers for Disease Control and Prevention. (2021). National diabetes statistics report, 2020. Atlanta, GA: Centers for Disease Control and Prevention, US Department of Health and Human Services. https://stacks.cdc.gov/view/cdc/85309

Harrell Jr, F. E., Lee, K. L., & Mark, D. B. (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine. 15(4): 361-387. DOI: 10.1002/(SICI)1097- 0258(19960229)15:4:361 DOI: https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4

P., A. M., & R. GUNASUNDARI. (2024). An Interpretable PyCaret Approach for Alzheimer’s Disease Prediction. International Journal of Computational and Experimental Science and Engineering, 10(4). https://doi.org/10.22399/ijcesen.655 DOI: https://doi.org/10.22399/ijcesen.655

Bandla Raghuramaiah, & Suresh Chittineni. (2025). BCDNet: An Enhanced Convolutional Neural Network in Breast Cancer Detection Using Mammogram Images. International Journal of Computational and Experimental Science and Engineering, 11(1). https://doi.org/10.22399/ijcesen.811 DOI: https://doi.org/10.22399/ijcesen.811

C, A., K, S., N, N. S., & S, P. (2024). Secured Cyber-Internet Security in Intrusion Detection with Machine Learning Techniques. International Journal of Computational and Experimental Science and Engineering, 10(4). https://doi.org/10.22399/ijcesen.491 DOI: https://doi.org/10.22399/ijcesen.491

Tirumanadham, N. S. K. M. K., S. Thaiyalnayaki, & V. Ganesan. (2025). Towards Smarter E-Learning: Real-Time Analytics and Machine Learning for Personalized Education. International Journal of Computational and Experimental Science and Engineering, 11(1). https://doi.org/10.22399/ijcesen.786 DOI: https://doi.org/10.22399/ijcesen.786

guven, mesut. (2024). Dynamic Malware Analysis Using a Sandbox Environment, Network Traffic Logs, and Artificial Intelligence. International Journal of Computational and Experimental Science and Engineering, 10(3). https://doi.org/10.22399/ijcesen.460 DOI: https://doi.org/10.22399/ijcesen.460

P. Padma, & G. Siva Nageswara Rao. (2024). CBDC-Net: Recurrent Bidirectional LSTM Neural Networks Based Cyberbullying Detection with Synonym-Level N-Gram and TSR-SCSOFeatures. International Journal of Computational and Experimental Science and Engineering, 10(4). https://doi.org/10.22399/ijcesen.623 DOI: https://doi.org/10.22399/ijcesen.623

MUTİ, S., & YILDIZ, K. (2023). Using Linear Regression For Used Car Price Prediction. International Journal of Computational and Experimental Science and Engineering, 9(1), 11–16. Retrieved from https://www.ijcesen.com/index.php/ijcesen/article/view/183 DOI: https://doi.org/10.22399/ijcesen.1070505

M. Venkateswarlu, K. Thilagam, R. Pushpavalli, B. Buvaneswari, Sachin Harne, & Tatiraju.V.Rajani Kanth. (2024). Exploring Deep Computational Intelligence Approaches for Enhanced Predictive Modeling in Big Data Environments. International Journal of Computational and Experimental Science and Engineering, 10(4). https://doi.org/10.22399/ijcesen.676 DOI: https://doi.org/10.22399/ijcesen.676

Optimizing Type II Diabetes Prediction Through Hybrid Big Data Analytics and H-SMOTE Tree Methodology

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Information

Keywords

Announcements

Current Issue