Comparative Analysis of Classification Methods of KNN and Naïve Bayes to Determine Stress Level of Junior High School Students

Stress is generally defined as a state where someone is mentally disturbed as the response to the adversity that he/she experiences. Junior High School students usually are not aware of the stress that they encounter. This research aims to compare two classification methods of KNN and Naïve Bayes to determine stress level. The data of this research were gathered from 254 respondents from Catholic Junior High School of Don Bosco Bitung. The tests of k-cross validation and percentage split from the data showed that Naïve Bayes method excelled KNN method. With k=3, KNN accuracy reached 86.61% at the highest and Naïve Bayes reached 87.40%. Meanwhile, based on percentage split test, the average of Naïve Bayes accuracy was higher than KNN with percentage of 88.31%. Moreover, for the precision and recall, Naïve Bayes was higher than KNN with 88.30% and 87.40% seen from the k-cross validation.


Introduction
Stress is thought or feeling that occurs as the response to adversity or threats, which are called as stressor [1]. [1] conveys that stress can positively motivate or trigger someone to reach some points, but on the other hand it can negatively cause health problems like indigestion and insomnia, especially for junior high school (JHS) students. Hence, determining students' stress level should be a concern so it can be directly handled. Unfortunately, the stress level that the students face is detected too late. Besides the lack of the personnel and consultation time at school, the other obstacle to introduce stress level is the students' reluctance to consult their problems.
Stress level can be determined based on classification method from data mining, which consists of several classification methods. The study conducted by [2] results that data mining classification method shows better performance compared to the tested system. There are several methods but K-Nearest Neighbors (KNN) and Naïve Bayes are the most frequently used methods [3]. Based on the procedure, these two methods are chosen because KNN can handle a lot of training data with abundant noise [4]. This notion is supported by [5] who claims that a lot of training data can be handled with KNN. Besides that, Naïve Bayes process is faster when applied to a lot of data set and it is easier to understand [3]. [6] adds that increasing number of Naïve Bayes data can increase the accuracy of method. These two methods return sufficient accuracy score [7].
This study aims to compare the accuracy of KNN method and Naïve Bayes method using WEKA to determine the stress level of JHS students. The data of this study were obtained using a questionnaire given to 254 respondents, which consists of respondents' identity and questions. [8] explains that stress is a term that is commonly used to explain an unstable feeling condition, caused by anger, frustration, fatigue, or pressure. Furthermore, [8] notes that stress theoretically can be viewed as an effort to withstand physiological reaction when faced with suppressing condition or danger, which is called as stressor. According to the study of [9], stress can be classified into three categories, low stress, medium stress, and acute stress. Moreover, [10] explains that the stress level depends on how someone is exposed to the stressor, as the following:

Definition of Stress
a. Low stress. It is the early phase for someone to respond stressor which indicates a warning to make a resistance. This phase is followed with strong stimulation to physical symptoms, where someone shows strong feeling of anxiety and anger, fear, increase of heart rate and breathing rhythm, and sweat. b. Medium stress. In this phase, the body slowly returns to its normal state characterized by reduced intensity and the recovery of the energy spent. Stimuli that arise are still high, but different from the previous level. At this stage, the visible stimuli are fatigue, getting offended easily, and anger. c. Acute stress. This stage occurs when stressor exposes someone continuously. The intensity of heart rate and breathing decreases, but with the ongoing stress the energy will be drained. It is also possible to be followed with impaired heart and kidney function, allergies, and depression.

K-Nearest Neighbors
K-Nearest Neighbors (KNN) method is an algorithm used to estimate and predict, which is frequently used in classification process [11]. In their study, [11] explain that classification has similarity with estimation, but the target variable is in the form of categorical not numeric. In the description the classification works by: a. Examining the data set that contains the target variables and predictions that are used as training data. b. Sorting new data that are stored without including information to form the basis of training data, then a new classification for the data is determined.
KNN is categorized in instance-based learning where the testing to the new data to the training data already exists so the classification process of the new data is done by pairing the majority of similar training data during the testing [11]. This majority is drawn from the number of the nearest neighbors [12]. The distance function of KNN that is generally used is Euclidean distance with the formulation as the following: Where: d = Euclidean distance xi= test data yi= training data

Naïve Bayes
Naïve Bayes is derived from the Bayes's theorem assuming that all features are conditioned independently of each other against the target variable [13]. Bayes's theorem is formulated for the probability of an event using existing knowledge of the related conditions. The Bayes theorem is calculated from the following equation: Where A and B are events, P(A) is the probability of event A, and P(B) is the probability of event B. P(A | B) is the probability condition of event A for event B [13].

Methodology
This study is conducted in the following stages: a. Literature study. It is done by analyzing the relevant theories and studies related to the topic. The literature includes books, electronic journals, and other reliable sources as the theoretical framework. b. Data collecting. The data were gathered from students of Don Bosco Bitung Catholic JHS that consisted of 699 students based on School Monthly Report in October 2017. The data used for this study were obtained from 254 students. The source of the data was taken from the previous study conducted by [14]. The data were in the form of age, class, gender, number of children in the family, what number is the student in the family, and 20-question questionnaire, c. Data Analysis. In this stage, the data were analyzed using Weka to determine the stress level using the two methods KNN and Naïve Bayes. The results of the two classification methods then were compared to calculate the accuracy of the two methods.

Results
This study used 254 data taken from the study of [14] that gathered the data from students of Don Bosco Bitung Catholic JHS. The result of the study explained that KNN could be applied to classify the stress level of JHS students, but it did not point out the accuracy obtained in the study [14]. The value of k=5 was used as parameter of neighborhood in this study [14]. Then, the KNN variable was numeric [11], so in the study [15] changed gender variable F and M into decimal to enable the data to be calculated using KNN, with F=80 and M=76. Then stress was classified into three, low stress, medium stress, and acute stress [9] where the distribution of JHS students stress can be seen in Table 1.

Testing on KNN
Out of 254 respondents, there were 36 students with low stress, 191 students with medium stress, and students with acute stress (Table 1). Then, testing on KNN with a value of k=1 to k=40 using Weka 3.9 was done, which is a k-cross validation test model with a value of folds=10 and percentage split. In the percentage split, the data was divided into 90, 80, 70 and 60 [16]. Based on the data presented in Table 1, the highest accuracy value is taken to form the confusion matrix presented in Table 4 and Table 5. Then, Table 6 shows the precision and recall of the highest accuracy value.

Test on Naïve Bayes
Unlike the KNN, the test for Naïve Bayes with the k-cross validation requires only one test, as the KNN needs to determine the k value. The uncertainty of the k values used as classification reference makes KNN accuracy always change, but not for Naïve Bayes that only requires a one-time test with a k-cross validation accuracy obtained 87.40%, with confusion matrix that can be seen in Table 7 obtained from testing using WEKA. The result of accuracy test with percentage split can be seen on Table 8 and confusion matrix of each  test on Table 9H. Then, the precision and recall of the test can be seen in Table 9.

Comparison of KNN and Naïve Bayes
In comparing the values of KNN and Naïve Bayes accuracy, the KNN accuracy is influenced with the number of the set nearest neighbors. Therefore, the values of KNN is taken from the highest accuracy values regardless the number of tested neighbors.

Discussion
The study of [17] explains that k=1 on KNN shows inflexible result because it only uses one nearest neighbor on the stored record. However, the use of big number of neighbors will blur the result as well, so the value of k=13 is the most optimum result since the accuracy reaches 75.14% from value of k=1 to k=49 [17]. This result is supported by [18] who uses k=13 and obtains accuracy of 97.28%, in other hand k=7 only obtain 54% for it accuracy [15]. On the other tests from 1 to 40, it is obtained k=3 with accuracy of 93% [19]. For the value of Naïve Bayes accuracy of 78.69%, it depends on the number of training data [6]. The similar thing is also concluded by [20] that the number of test data and training data can affect the accuracy of Naïve Bayes, in this case the accuracy is 80%. The value of 90.57% is obtained from the study conducted by [21] and this value is still higher compared to the implementation on heart disease risk prediction [22] where the accuracy is 78%. The accuracy comparison of KNN and Naïve Bayes done by [7] shows the superiority of Naïve Bayes with the accuracy of 98.1% compared to KNN (the accuracy level is 95.3%). This is also supported with the study of [3] that the accuracy of Naïve Bayes is higher than KNN that is 72.5% compared to 57.5% in predicting the divorce case in Cimahi and the study of [23] on the classification of Indonesian articles with the Naïve Bayes accuracy of 70% compared to 40% of KNN accuracy. Not only compared to KNN, Naïve Bayes also seems to be superior to Support Vector Machine [24] and Neural Network [2], but [5] shows that KNN and Naïve Bayes give balance result. However, different opinion from [25] in determining the feasibility of planting teak tree says that KNN is superior compared to Naïve Bayes with accuracy of 96.66% compared to 82.63%. This opinion is also supported by [26] concerning the document text classification, where the KNN accuracy reaches 55.17%, surpassing Naïve Bayes with 39.01% accuracy.
It can be seen in Table 4 of the k-cross validation tests for KNN, the number of data that are successfully reclassified are 220 data and the false data are 34 data. In testing with percentage split as shown in Table 5, the number of data tested changed from 254 data to depending on the percentage split that is for the test data of 90% as many as 25 data, 80% test data as many as 51 data, 70% test data as many as 76, and 60% test data as many as 102. From those results, the data that were successfully reclassified correctly for 25 test data were 20 data and 5 incorrect data. Furthermore, out of 51 data there were 42 correct test data and 9 false data, out of 76 test data there were 63 correct data and 13 false data, and out of 102 test data, there were 90 correct data and 12 false data. Table 7 shows a total of 222 correctly classified data and 32 false data from the k-cross validation test for Naïve Bayes method. In the results of a percentage split test in Table 9, the number of data tested from the 254 changed based on the value of percentage split of the test data by 90% as many as 25, 80% test data as many as 51, 70% test data as many as 76, and 60% test data as many as 102. From those results, the data that were successfully reclassified correctly for 25 test data were 22 data and the incorrect ones were 3, for 51 test data, 46 were correct and 5 were incorrect. Furthermore, for 76 test data, 66 were correct and 10 were incorrect, and for 102 test data the correct data were 90 and the incorrect ones were 12.
Based on the result obtained above, the comparison of KNN and Naïve Bayes in determining the stress level of 254 data shows that: a. KNN and Naïve Bayes methods can be used to determine the stress level since they have accuracy values above 70% b. Naïve Bayes method excels KNN in k-cross validation and percentage validation test, with the accuracy of Naïve Bayes as 87.40% and for the percentage split average as 88.31%. Based on the k-cross validation test, the accuracy of Naïve Bayes is higher than KNN. However, for percentage split test, for 60% training data of KNN and Naïve Bayes has the same accuracy that is 88.23%, but for 70% and 80% percentage split Naïve Bayes excels KNN, and for 90% KNN and Naïve Bayes value is same.

Conclusion
Based on the discussion, it can be concluded if the change in the amount of data made affects accuracy, precision, and recall both through the k-cross validation and percentage split tests. The highest accuracy value of KNN of the k-cross validation test is at a k=3 value of 86.61%, a precision of 86.60% and a recall of 87.40%, but with the same value it produces different results for the percentage split test where the accuracy obtained reaches 88.00%, precision of 89.60% and recall of 88.00%. In the percentage split test of 80%, 70%, and 60%, the value of k=5 is obtained as the optimal.