Evaluating Machine Learning Models for House Price Prediction with Different Sampling Techniques
DOI:
https://doi.org/10.22399/ijcesen.2870Keywords:
Machine Learning, Ensemble Methods, House Price Prediction, Data Sampling TechniquesAbstract
This study investigates the interplay between advanced sampling techniques and machine learning models to predict residential property sale prices using a diverse dataset encompassing structural, locational, and economic attributes. Emphasizing Stratified Extreme Ranked Set Sampling (SERSS), the research systematically evaluates the impact of five sampling methods—SERSS, Cluster, Bootstrap, Systematic, and Random Sampling—on various machine learning algorithms, including CatBoost, Random Forest, ElasticNet, and FIkNN. The findings reveal that SERSS significantly enhances the generalizability and robustness of predictive models by capturing both central and extreme data tendencies, outperforming traditional methods in preserving dataset variability. Ensemble methods like CatBoost, Random Forest and similarity algorithm like FIkNN consistently demonstrated superior predictive accuracy, achieving the Mean Absolute Error (MAE) between $85 and $650, and high R² values across structured sampling techniques. Conversely, unstructured methods such as Random Sampling introduced biases, leading to substantial deviations in predictions. These results underscore the critical importance of aligning sampling methodologies with model-specific characteristics to optimize performance. This study provides actionable insights for researchers and practitioners in predictive modeling, offering a framework for integrating sampling strategies with advanced machine learning models to tackle heterogeneous datasets effectively.
References
[1]Newaz, A., Hassan, S., & Haq, F. S. (2022). An empirical analysis of the efficacy of different sampling techniques for imbalanced classification. arXiv preprint arXiv:2208.11852.
[2]Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5), 2223-2273.
[3]Sharma, S., & Gill, S. S. (2024). Advanced Machine Learning Models for Real Estate Price Prediction. In Applications of AI for Interdisciplinary Research (pp. 103-121). CRC Press.
[4]Shao, S., Zhao, B., Cui, X., Dai, Y., & Bao, B. (2024, May). Housing Rental Information Management and Prediction System Based on CatBoost Algorithm-a Case Study of Halifax Region. In International Joint Conference on Rough Sets (pp. 230-246). Cham: Springer Nature Switzerland.
[5]Kansal, M., Singh, P., Shukla, S., & Srivastava, S. (2023, September). A Comparative Study of Machine Learning Models for House Price Prediction and Analysis in Smart Cities. In International Conference on Electronic Governance with Emerging Technologies (pp. 168-184). Cham: Springer Nature Switzerland.
[6]Naz, R., Jamil, B., & Ijaz, H. (2024). Machine Learning, Deep Learning, and Hybrid Approaches in Real Estate Price Prediction: A Comprehensive Systematic Literature Review. Proceedings of the Pakistan Academy of Sciences: A. Physical and Computational Sciences, 61(2), 129-144.
[7]Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Bauder, R. A. (2019). Severely imbalanced big data challenges: investigating data sampling approaches. Journal of Big Data, 6(1), 1-25.
[8]Park, B., & Bae, J. K. (2015). Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data. Expert systems with applications, 42(6), 2928-2934.
[9]Sowah, R. A., Kuditchar, B., Mills, G. A., Acakpovi, A., Twum, R. A., Buah, G., & Agboyi, R. (2021). HCBST: An efficient hybrid sampling technique for class imbalance problems. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3), 1-37.
[10]Saylı, A., & Başarır, S. (2022). Sampling Techniques and Application in Machine Learning in order to Analyse Crime Dataset. Avrupa Bilim ve Teknoloji Dergisi, (38), 296-310.
[11]Ja’afar, N. S., Mohamad, J., & Ismail, S. (2021). Machine learning for property price prediction and price valuation: a systematic literature review. Planning Malaysia, 19.
[12]Kaggle. (n.d.). House prices: Advanced regression techniques dataset. Retrieved November 25, 2024. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
[13]Çetin, A. E., & Koyuncu, N. (2024). New robust class of estimators for population mean under different sampling designs. Journal of Computational and Applied Mathematics, 441, 115669.
[14]Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(4), 795-816.
[15]Munneke, H. J., & Slade, B. A. (2000). An empirical study of sample-selection bias in indices of commercial real estate. The Journal of Real Estate Finance and Economics, 21, 45-64.
[16]MacKinnon, J. G., Nielsen, M. Ø., & Webb, M. D. (2023). Cluster-robust inference: A guide to empirical practice. Journal of Econometrics, 232(2), 272-299.
[17]Çetin, A. İ., & Büyüklü, A. H. (2025). A new approach to K-nearest neighbors distance metrics on sovereign country credit rating. Kuwait Journal of Science, 52(1), 100324.
[18]Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301-320.
[19]Cao, J., Kwong, S., & Wang, R. (2012). A noise-detection based AdaBoost algorithm for mislabeled data. Pattern recognition, 45(12), 4451-4465.
[20]Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, 31.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.