Analisis Komparatif Algoritme Machine Learning dan Penanganan Imbalanced Data pada Klasifikasi Kualitas Air Layak Minum


  • Generosa Lukhayu Pritalia Universitas Atma Jaya Yogyakarta



Machine Learning, Algorithm, Imbalanced data, Classification, Water Quality



Abstract. Water is essential for survival. Currently, there are requirements to monitor, assess, and classify water quality to understand the impact of industrialization. The water quality classification process has been carried out using traditional methods such as WQI and Storet, and machine learning methods. Imbalanced data in machine learning method can make this method have a tendency to predict the majority class and become biased. In addition, using all features in the classification process can degrade classification performance and lead to high computation time. To overcome the above-mentioned problems, this study proposes several approaches, included resampling the data to be balanced, determined the most suitable and contributing features, and compared the performance of machine learning algorithms in classifying potable water. The results of handling unbalanced data and implementing feature selection were able to provide increased work on the algorithm, especially the accuracy metric reached 24.8% from previous study. The most optimal algorithm performance was obtained from Random Forest with 87% of precision, 84% of recall, 16% of Miss rate, 85% of F-measure, and 85% of test accuracy, while used seven best features. However, another important aspect is the smallest Miss rate, which was 15%, obtained from Decision Tree algorithm.



E. De Buck, V. Borra, E. De Weerdt, A. Vande Veegaete, dan P. Vandekerckhove, “A systematic review of the amount of water per person per day needed to prevent morbidity and mortality in (post-)disaster settings,” PLoS One, vol. 10, no. 5, 2015, doi: 10.1371/journal.pone.0126395.

WHO, “The Human Right to Water and Sanitation Media brief,” UN-Water Decad. Program. Advocacy Commun. Water Supply Sanit. Collab. Counc., no. April 2011, hal. 1–8, 2011, [Daring]. Tersedia pada:

UNU-INWEH, Water Security & the Global Water Agenda. The UN-Water analytical brief, vol. 53, no. 9. 2013.

U. Nations, “Sustainable Development Goal (SDG).” (diakses Feb 14, 2022).

P. Luo et al., “Water quality trend assessment in Jakarta: A rapidly growing Asian megacity,” PLoS One, vol. 14, no. 7, hal. 1–17, 2019, doi: 10.1371/journal.pone.0219009.

A. Kustanto, “Water quality in Indonesia: The role of socioeconomic indicators,” J. Ekon. Pembang., vol. 18, no. 1, hal. 47–62, 2020, doi: 10.29259/jep.v18i1.11509.

A. K. Makarigakis dan B. E. Jimenez-Cisneros, “UNESCO’s contribution to face global water challenges,” Water (Switzerland), vol. 11, no. 2, 2019, doi: 10.3390/w11020388.

A. M. Graboski, J. Martinazzo, S. C. Ballen, J. Steffens, dan C. Steffens, Nanosensors for water quality control. Elsevier Inc., 2020.

S. Kar, V. S. Rathore, P. K. Champati ray, R. Sharma, dan S. K. Swain, “Classification of river water pollution using Hyperion data,” J. Hydrol., vol. 537, hal. 221–233, 2016, doi: 10.1016/j.jhydrol.2016.03.047.

M. A. Rahman, N. Hidayat, dan A. Afif Supianto, “Komparasi Metode Data Mining K-Nearest Neighbor Dengan Naïve Bayes Untuk Klasifikasi Kualitas Air Bersih (Studi Kasus PDAM Tirta Kencana Kabupaten Jombang),” J. Pengemb. Teknol. Inf. dan Ilmu Komput. Vol. 2, No. 12, Desember 2018, hlm. 6346-6353 e-ISSN, vol. 2, no. 12, hal. 925–928, 2018.

Tiyasha, T. M. Tung, dan Z. M. Yaseen, “A survey on river water quality modelling using artificial intelligence models: 2000–2020,” J. Hydrol., vol. 585, no. January, hal. 124670, 2020, doi: 10.1016/j.jhydrol.2020.124670.

P. A. Riyantoko, “Analisis Sederhana Pada Kualitas Air Minum Berdasarkan Akurasi Model Klasifikasi Dengan Menggunakan Lucifer Machine Learning,” Semin. Nas. Sains Data 2021 (SENADA 2021), vol. 2021, no. Senada, hal. 12–18, 2021.

V. García, J. S. Sánchez, dan R. A. Mollineda, “On the effectiveness of preprocessing methods when dealing with different levels of class imbalance,” Knowledge-Based Syst., vol. 25, no. 1, hal. 13–21, 2012, doi: 10.1016/j.knosys.2011.06.013.

F. Charte, A. J. Rivera, M. J. del Jesus, dan F. Herrera, “Addressing imbalance in multilabel classification: Measures and random resampling algorithms,” Neurocomputing, vol. 163, hal. 3–16, 2015, doi: 10.1016/j.neucom.2014.08.091.

Z. M. Hira dan D. F. Gillies, “BioMed Research International ( J BIOMED BIOTECHNOL ),” Comput. Math. Methods Med., vol. 2015, no. 1, hal. 2–4, 2015, [Daring]. Tersedia pada:

WHO, “Guidelines for Drinking-water Quality,” vol. 1, no. 3rd, 2006.

D. Setiabudidaya, “Jupyter notebook app: Alternatif teknologi pembelajaran fisika berbasis web browser,” in Annual Research Seminar (ARS), 2015, vol. 1, no. 1, hal. Annual Research Seminar (ARS), [Daring]. Tersedia pada: Conservacion de alimentos y Recetas sencillas.pdf%0A

J. Han, M. Kamber, dan J. Pei, Data Mining Concepts and Techniques Third. 2012.

I. Pardoe, Applied Regression Modeling: A Business Approach. 2012.

D. Sarkar, R. Bali, dan T. Sharma, Practical Machine Learning with Python. Berkely: Apress, 2018.

H. Li, J. Li, P. C. Chang, dan J. Sun, “Parametric prediction on default risk of Chinese listed tourism companies by using random oversampling, isomap, and locally linear embeddings on imbalanced samples,” Int. J. Hosp. Manag., vol. 35, hal. 141–151, 2013, doi: 10.1016/j.ijhm.2013.06.006.

M. Raihan-Al-Masud dan M. Rubaiyat Hossain Mondal, “Data-driven diagnosis of spinal abnormalities using feature selection and machine learning algorithms,” PLoS One, vol. 15, no. 2, hal. 1–21, 2020, doi: 10.1371/journal.pone.0228422.

B. George, “A study of the effect of random projection and other dimensionality reduction techniques on different classification methods.,” A Biannu. J. Interdiscip. Stud. Res., vol. XVIII, no. 01, 2017.

A. C. Mueller dan S. Guido, Introduction to machine learning with Python. 2016.

J. D. Kelleher, B. Mac Namee, dan A. D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics : Algorithms, Worked Examples, and Case Studies, Second. Cambridge, Massachusetts: The MIT Press, 2020.

A. Zheng dan A. Casari, Feature Engineering for Machine Learning and Data Analytics. Sebastopol, CA: O’Reilly Media, Inc, 2018.

Y. Y. Song dan Y. Lu, “Decision tree methods: applications for classification and prediction,” Shanghai Arch. Psychiatry, vol. 27, no. 2, hal. 130–135, 2015, doi: 10.11919/j.issn.1002-0829.215044.

I. Handayani, “Application of K-Nearest Neighbor Algorithm on Classification of Disk Hernia and Spondylolisthesis in Vertebral Column,” Indones. J. Inf. Syst., vol. 2, no. 1, hal. 57, 2019, doi: 10.24002/ijis.v2i1.2352.