Importance of Feature Selection for Multiple Disease Classification
Keywords:
seleksi fitur, pembelajaran mesin, diagnosis penyakit, akurasi klasifikasi , feature selection, machine learning, disease diagnosis, accuracy performanceAbstract
The performance of machine learning in disease classification heavily depends on effective feature selection. This study explores feature selection methods—Boruta and Recursive Feature Elimination (RFE)—with ensemble models like Random Forest, Decision Tree, Gradient Boosting, LightGBM, and XGBoost using Electronic Health Records (EHR) data. Results show that combining Boruta with LightGBM achieves the highest accuracy of 99%. Feature selection enhances precision by focusing on relevant variables and removing unnecessary ones. Further analysis reveals that features such as Red Blood Cells, Insulin, Heart Rate, and Cholesterol significantly influence the classification of specific diseases. These findings highlight the importance of feature selection in multi-disease classification and medical data analysis, improving the efficiency of machine learning systems. Future research should develop more flexible feature selection methods and test models on diverse disease datasets.
References
C.-H. Hsu et al., “Effective multiple cancer disease diagnosis frameworks for improved healthcare using machine learning,” Measurement, vol. 175, p. 109145, 2021, doi: https://doi.org/10.1016/j.measurement.2021.109145.
Md. M. Ahsan and Z. Siddique, “Machine Learning-Based Heart Disease Diagnosis: A Systematic Literature Review,” Artificial Intelligent in Medicine, vol. 128, p. 102289, 2021, [Online]. Available: https://api.semanticscholar.org/CorpusID:245124466
V. Vijayan and A. C., “Prediction and diagnosis of diabetes mellitus — A machine learning approach,” in Proc. 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2015, pp. 122–127.
P. Zhang and M. N. Kamel Boulos, “Generative AI in Medicine and Healthcare: Promises, Opportunities and Challenges,” Future Internet, vol. 15, no. 9, 2023, doi: 10.3390/fi15090286.
Rajkomar, A., Oren, E., Chen, K. et al. Scalable and accurate deep learning with electronic health records. npj Digital Med 1, 18 (2018). https://doi.org/10.1038/s41746-018-0029-1
F. Kamal Alsheref and W. Gomaa, “Blood Diseases Detection using Classical Machine Learning Algorithms,” International Journal of Advanced Computer Science and Applications, vol. 10, Jan. 2019, doi: 10.14569/IJACSA.2019.0100712.
K. Arumugam, M. Naved, P. P. Shinde, O. Leiva-Chauca, A. Huaman-Osorio, and T. Gonzales-Yanac, "Multiple disease prediction using Machine learning algorithms," Materials Today: Proceedings, vol. 80, no. Part 3, pp. 3682–3685, 2023, doi: 10.1016/j.matpr.2021.07.361.
L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
M. B. Kursa, A. Jankowski, and W. R. Rudnicki, “Boruta – A System for Feature Selection,” Fundamenta Informaticae, vol. 101, no. 4, pp. 271–285, Jan. 2010, doi: 10.3233/FI-2010-288.
A. Natekin and A. Knoll, “Gradient Boosting Machines, A Tutorial,” Front Neurorobot, vol. 7, p. 21, Dec. 2013, doi: 10.3389/fnbot.2013.00021.
S. Zhou, S. Wang, Q. Wu, R. Azim, and W. Li, “Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression,” Computational Biology and Chemistry, vol. 85, p. 107200, 2020, doi: https://doi.org/10.1016/j.compbiolchem.2020.107200.
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in KDD ’16. New York, NY, USA: Association for Computing Machinery, 2016, pp. 785–794. doi: 10.1145/2939672.2939785.
K. Budholiya, S. K. Shrivastava, and V. Sharma, “An optimized XGBoost based diagnostic system for effective prediction of heart disease,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 7, pp. 4514–4523, 2022, doi: https://doi.org/10.1016/j.jksuci.2020.10.013.
M. J. Raihan, M. A. M. Khan, S. H. Kee, et al., "Detection of the chronic kidney disease using XGBoost classifier and explaining the influence of the attributes on the model using SHAP," Scientific Reports, vol. 13, p. 6263, 2023. doi: 10.1038/s41598-023-33525-0.
I. Karabayir et al., “Predicting Parkinson’s Disease and Its Pathology via Simple Clinical Variables,” Journal of Parkinsons Disease, vol. 12, no. 1, pp. 341–351, Sep. 2021, doi: 10.3233/JPD-212876.
G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
J. H. Friedman, “Greedy function approximation: A gradient boosting machine.,” The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, Oct. 2001, doi: 10.1214/aos/1013203451.
C. Dewi and R.-C. Chen, “Human Activity Recognition Based on Evolution of Features Selection and Random Forest,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), 2019, pp. 2496–2501. doi: 10.1109/SMC.2019.8913868.
B. K. Swain, S. Mohapatra, M. Mishra, et al., "A unified approach for Parkinson’s disease recognition: imbalance mitigation and grid search optimized boosting with LightGBM," Medical & Biological Engineering & Computing, vol. 62, pp. 3471–3491, 2024. doi: 10.1007/s11517-024-03139-3.
A. Sharma and B. Singh, “AE-LGBM: Sequence-based novel approach to detect interacting protein pairs via ensemble of autoencoder and LightGBM,” Computers in Biology and Medicine, vol. 125, p. 103964, 2020, doi: https://doi.org/10.1016/j.compbiomed.2020.103964.
R.-C. Chen, C. Dewi, S.-W. Huang, and R. E. Caraka, “Selecting critical features for data classification based on machine learning methods,” Journal of Big Data, vol. 7, no. 1, p. 52, 2020, doi: 10.1186/s40537-020-00327-4.
G. Manikandan, B. Pragadeesh, V. Manojkumar, A. L. Karthikeyan, R. Manikandan, and A. H. Gandomi, “Classification models combined with Boruta feature selection for heart disease prediction,” Informatics in Medicine Unlocked, vol. 44, p. 101442, 2024, doi: https://doi.org/10.1016/j.imu.2023.101442.
B. F. Darst, K. C. Malecki, and C. D. Engelman, “Using recursive feature elimination in random forest to account for correlated variables in high dimensional data,” BMC Genetics, vol. 19, no. 1, p. 65, 2018, doi: 10.1186/s12863-018-0633-8.
E. J. Michaud, Z. Liu, and M. Tegmark, “Precision Machine Learning,” Entropy, vol. 25, no. 1, 2023, doi: 10.3390/e25010175.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification using Support Vector Machines,” Machine Learning, vol. 46, no. 1, pp. 389–422, 2002, doi: 10.1023/A:1012487302797.
R. Y. Krisnabayu, A. Ridok, and A. S. Budi, “Hepatitis Detection using Random Forest based on SVM-RFE (Recursive Feature Elimination) Feature Selection and SMOTE,” Proceedings of the 6th International Conference on Sustainable Information Engineering and Technology, 2021, [Online]. Available: https://api.semanticscholar.org/CorpusID:241571285
E. Aboelnaga, Multiple disease prediction dataset, Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/ehababoelnaga/multiple-disease-prediction/data
O. F.Y, A. J.E.T, A. O., H. J. O, O. O, and A. J, “Supervised Machine Learning Algorithms: Classification and Comparison,” International Journal of Computer Trends and Technology, vol. 48, pp. 128–138, 2017, [Online]. Available: https://api.semanticscholar.org/CorpusID:55362795
T. R. Mahesh, V. Vinoth Kumar, K. Dhilip Kumar, O. Geman, M. Margala, and M. Guduri, "The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification," Healthcare Analytics, vol. 4, 100247, Dec. 2023, doi: 10.1016/j.health.2023.100247.
J. Xu, "An extended one-versus-rest support vector machine for multi-label classification," Neurocomputing, vol. 74, no. 17, pp. 3114–3124, Oct. 2011, doi: 10.1016/j.neucom.2011.04.024.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Copyright of this journal is assigned to Jurnal Buana Informatika as the journal publisher by the knowledge of author, whilst the moral right of the publication belongs to author. Every printed and electronic publications are open access for educational purposes, research, and library. The editorial board is not responsible for copyright violation to the other than them aims mentioned before. The reproduction of any part of this journal (printed or online) will be allowed only with a written permission from Jurnal Buana Informatika.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.






