Classification of Medical Complaints: Comparative Analysis of Machine Learning Algorithms with Determination of Dominant Factors Using Information Gain

Authors

  • Catherine Santoso Prasetya Information System, Institut Informatika Indonesia
  • I Gede Wiarta Sena Sena Institut Informatika Indonesia Surabaya
  • Matthew Austen Fernando Dongseo University

DOI:

https://doi.org/10.24002/ijis.v8i2.13128

Abstract

This research compares three machine learning algorithms: Random Forest (RF), Decision Tree (DT), and K-Nearest Neighbors (KNN) for classifying illnesses influenced by climate, patient history, and clinical indicators. The dataset obtained from Kaggle contains 5,200 records combining meteorological and symptom data. Two pre-processing scenarios were tested to examine their impact on model performance: (1) normalization using Min-Max, and (2) normalization followed by balancing with the Synthetic Minority Over-sampling Technique (SMOTE). Results show that normalization significantly improves KNN’s performance, increasing its accuracy from 0.324 on raw data to 0.968. In the first scenario, Random Forest achieved the highest accuracy of 0.985, followed by Decision Tree with 0.974 and KNN with 0.968. After applying SMOTE, Random Forest maintained stable accuracy at 0.985, while Decision Tree and KNN slightly decreased to 0.964. These findings indicate that Random Forest is the most robust and consistent algorithm for this classification task. Furthermore, the study reveals that SMOTE does not always enhance accuracy and must be applied selectively. Information gain analysis identifies symptom features as the strongest predictors. Overall, this research provides guidance in selecting the optimal algorithm and pre-processing strategy for building effective weather-related disease classification systems.

Keywords: Classification of Diseases, Decision Tree, K-Nearest Neighbors, Random Forest, SMOTE

References

[1] P. Khan et al., “Machine Learning and Deep Learning Approaches for Brain Disease Diagnosis: Principles and Recent Advances,” IEEE Access, vol. 9, pp. 37622–37655, 2021, doi: 10.1109/ACCESS.2021.3062484.

[2] M. M. Ahsan, S. A. Luna, and Z. Siddique, “Machine-Learning-Based Disease Diagnosis: A Comprehensive Review,” Healthcare, vol. 10, no. 3, p. 541, Mar. 2022, doi: 10.3390/healthcare10030541.

[3] S. Asif et al., “Advancements and Prospects of Machine Learning in Medical Diagnostics: Unveiling the Future of Diagnostic Precision,” Arch. Comput. Methods Eng., vol. 32, no. 2, pp. 853–883, Mar. 2025, doi: 10.1007/s11831-024-10148-w.

[4] I. G. W. Sena and A. W. R. Emanuel, “MOBILE LEGEND GAME PREDICTION USING MACHINE LEARNING REGRESSION METHOD,” JURTEKSI (Jurnal Teknologi dan Sistem Informasi), vol. 9, no. 2, pp. 221–230, Mar. 2023, doi: 10.33330/jurteksi.v9i2.1866.

[5] Z. Azam, M. M. Islam, and M. N. Huda, “Comparative Analysis of Intrusion Detection Systems and Machine Learning-Based Model Analysis Through Decision Tree,” IEEE Access, vol. 11, pp. 80348–80391, 2023, doi: 10.1109/ACCESS.2023.3296444.

[6] Z. Azam, Md. M. Islam, and M. N. Huda, “Comparative Analysis of Intrusion Detection Systems and Machine Learning-Based Model Analysis Through Decision Tree,” IEEE Access, vol. 11, pp. 80348–80391, 2023, doi: 10.1109/ACCESS.2023.3296444.

[7] N. S. Sediatmoko, Y. Nataliani, and I. Suryady, “Sentiment Analysis of Customer Review Using Classification Algorithms and SMOTE for Handling Imbalanced Class,” Indonesian Journal of Information Systems, vol. 7, no. 1, pp. 38–52, Aug. 2024, doi: 10.24002/ijis.v7i1.8879.

[8] A. Saad Hussein, T. Li, C. W. Yohannese, and K. Bashir, “A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE,” Int. J. Comput. Intell. Syst., vol. 12, no. 2, p. 1412, 2019, doi: 10.2991/ijcis.d.191114.002.

[9] T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Information, vol. 14, no. 1, p. 54, Jan. 2023, doi: 10.3390/info14010054.

[10] N. N. A. Nanda, Y. Farida, and W. D. Utami, “Implementation of SMOTE to Improve the Performance of Random Forest Classification in Credit Risk Assessment in Banking,” INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi, vol. 9, no. 2, pp. 158–177, Jul. 2025, doi: 10.29407/intensif.v9i2.23930.

[11] A. I. Marqués, V. García, and J. S. Sánchez, “A literature review on the application of evolutionary computing to credit scoring,” J. Oper. Res. Soc., vol. 64, no. 9, pp. 1384–1399, Sep. 2013, doi: 10.1057/jors.2012.145.

[12] S. Bhatore, L. Mohan, and Y. R. Reddy, “Machine learning techniques for credit risk evaluation: a systematic literature review,” Journal of Banking and Financial Technology, vol. 4, no. 1, pp. 111–138, Apr. 2020, doi: 10.1007/s42786-020-00020-3.

[13] N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan, “A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction,” Frontiers in Bioinformatics, vol. 2, Jun. 2022, doi: 10.3389/fbinf.2022.927312.

[14] M. M. Ahsan, S. A. Luna, and Z. Siddique, “Machine-Learning-Based Disease Diagnosis: A Comprehensive Review,” Healthcare, vol. 10, no. 3, p. 541, Mar. 2022, doi: 10.3390/healthcare10030541.

[15] T. Rahmawati, Alexander Wirapraja, and E. C. Soesilo, “Sistem Pendukung Keputusan Penentuan Dosen Pembimbing Tugas Akhir Menggunakan Fuzzy Dan Simple Additive Weighting Berbasis Android: Studi Kasus IKADO Surabaya,” KONSTELASI: Konvergensi Teknologi dan Sistem Informasi, vol. 2, no. 1, Apr. 2022, doi: 10.24002/konstelasi.v2i1.5632.

[16] T. Suresh, T. A. Assegie, S. Rajkumar, and N. Komal Kumar, “A hybrid approach to medical decision-making: diagnosis of heart disease with machine-learning model,” International Journal of Electrical and Computer Engineering (IJECE), vol. 12, no. 2, p. 1831, Apr. 2022, doi: 10.11591/ijece.v12i2.pp1831-1838.

[17] A. S. Barkah, S. R. Selamat, Z. Z. Abidin, and R. Wahyudi, “Impact of Data Balancing and Feature Selection on Machine Learning-based Network Intrusion Detection,” JOIV : International Journal on Informatics Visualization, vol. 7, no. 1, p. 241, Feb. 2023, doi: 10.30630/joiv.7.1.1041.

[18] A. Shan, I. Amir, and M. Kamal, “Weather-related Disease Prediction Dataset,” May 2024, Zenodo. doi: 10.5281/zenodo.11366485.

[19] Q. H. Nguyen et al., “Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil,” Math. Probl. Eng., vol. 2021, pp. 1–15, Feb. 2021, doi: 10.1155/2021/4832864.

[20] Y. Dimas Pratama and A. Salam, “Comparison of Data Normalization Techniques on KNN Classification Performance for Pima Indians Diabetes Dataset,” Journal of Applied Informatics and Computing, vol. 9, no. 3, pp. 693–706, Jun. 2025, doi: 10.30871/jaic.v9i3.9353.

[21] P. Soltanzadeh and M. Hashemzadeh, “RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem,” Inf. Sci. (Ny)., vol. 542, pp. 92–111, Jan. 2021, doi: 10.1016/j.ins.2020.07.014.

[22] M. B. Al Snousy, H. M. El-Deeb, K. Badran, and I. A. Al Khlil, “Suite of decision tree-based classification algorithms on cancer gene expression data,” Egypt. Informatics J., vol. 12, no. 2, pp. 73–82, Jul. 2011, doi: 10.1016/j.eij.2011.04.003.

[23] H. A. Salman, A. Kalakech, and A. Steiti, “Random Forest Algorithm Overview,” Babylonian J. Mach. Learn., vol. 2024, pp. 69–79, Jun. 2024, doi: 10.58496/BJML/2024/007.

[24] A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,” Int. J. Comput. Netw. Inf. Secur., vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.

[25] I. Handayani, “Application of K-Nearest Neighbor Algorithm on Classification of Disk Hernia and Spondylolisthesis in Vertebral Column,” Indonesian Journal of Information Systems, vol. 2, no. 1, pp. 57–66, Aug. 2019, doi: 10.24002/ijis.v2i1.2352.

[26] H. M and S. M.N, “A Review on Evaluation Metrics for Data Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, vol. 5, no. 2, pp. 01–11, Mar. 2015, doi: 10.5121/ijdkp.2015.5201.

[27] F. Gong, L. Jiang, H. Zhang, D. Wang, and X. Guo, “Gain ratio weighted inverted specific-class distance measure for nominal attributes,” Int. J. Mach. Learn. Cybern., vol. 11, no. 10, pp. 2237–2246, Oct. 2020, doi: 10.1007/s13042-020-01112-8.

[28] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.

[29] O. Graham and P. Henderson, “Advancing Explainable Artificial Intelligence for Clinical Decision Support: Techniques, Challenges, and Evaluation Frameworks in High-Stakes Medical Environments,” May 28, 2025. doi: 10.20944/preprints202505.2281.v1.

Downloads

Published

2026-02-28

How to Cite

Catherine Santoso Prasetya, Sena, I. G. W. S., & Matthew Austen Fernando. (2026). Classification of Medical Complaints: Comparative Analysis of Machine Learning Algorithms with Determination of Dominant Factors Using Information Gain. Indonesian Journal of Information Systems, 8(2), 212–226. https://doi.org/10.24002/ijis.v8i2.13128