Emotion Classification in Indonesian Language: A CNN Approach with Hyperband Tuning

Abstrak. Klasifikasi Emosi dalam Bahasa Indonesia: Pendekatan CNN dengan Hyperband Tuning. Saat ini, teknik klasifikasi emosi yang andal sangat dibutuhkan di beberapa bidang. Penelitian ini mengusulkan penggunaan Convolutional Neural Network (CNN) yang telah dioptimalkan dengan Hyperband Tuner (HT) untuk secara efektif melakukan tugas Klasifikasi Emosi dalam bahasa Indonesia. Eksperimen pada berbagai teknik ekstraksi fitur, termasuk CountVectorizer (CV), TF-IDF, dan Keras Tokenizer (KT) dilakukan juga untuk mengeksplorasi kombinasi terbaik dari ekstraksi fitur dan CNN pada set data yang ada. Metodologi yang diusulkan dievaluasi dan dibandingkan dengan K-Nearest Neighbors (KNN), Decision Tree (DT), Naive Bayes (NB), dan Boosting SVM. Hasil percobaan menunjukkan bahwa metode yang digunakan pada penelitian ini mengungguli teknik yang ada, dibuktikan oleh metrik akurasi, presisi, daya ingat, dan skor


Introduction
Emotions are crucial to human intellect, encompassing making decisions rationally, interacting with others, attitude, thought, learning, and creativity [1].Consequently, understanding and accurately detecting emotions have increased essential considerations in some fields related to decision-making, such as neuroscience, psychology, and behavioral science.Furthermore, it is vital for human-robot interaction to enable robots to provide empathetic responses to human needs, particularly in applications such as chatbots and customer service robots [2], [3].
Once emotions are detected, the subsequent critical step involves classifying them into specific emotion states, predefined classes, known as emotion classification.Despite recent advancements, exploring emotion classification beyond four categories still needs to be improved.Yudha and Riyanarto employed Support Vector Machine (SVM), Naïve Bayes (NB), and K-Nearest Neighbors (KNN) to classify five distinct emotion types [4], revealing NB's superior performance with 63% accuracy.Similarly, Ahmad Zamsuri et al. utilized KNN for classification [5], but the model's accuracy remained below 60% for six emotion types.Moreover, model performance dropped when dealing with imbalanced datasets [6].
In recent years, the landscape of Emotion Classification has been substantially influenced by Convolutional Neural Network (CNN), a learning technique known for its ability to automatically learn and extract intricate features from raw data.However, its effectiveness relies on properly tuning their hyperparameters, a critical yet time-consuming process.Inadequate parameter tuning can underscore the significance of meticulous hyperparameter tuning [7].
In this study, our objective was to tune the hyperparameters of a CNN model for classifying emotion in the Indonesian language.A Hyperband Tuner (HT) was employed, which provided an efficient approach to tuning the hyperparameter.Additionally, an analysis was conducted to determine the optimal feature extraction methods and CNN configurations for a specific dataset.Various techniques for feature extraction, including CountVectorizer (CV), Term Frequency-Inverse Document (TF-IDF), and Keras Tokenizer (KT), were evaluated and compared with state-of-the-art techniques such as NB [4], [8], KNN [5], [9], Decision Tree (DT) [10], and Boosting SVM [11].Combining the findings from hyperparameter optimization, feature extraction analysis, and comparative evaluation, identifying the most suitable combination techniques yielding the highest performance in the emotion classification task for the Indonesian language can be done.Furthermore, the potential applications of emotion classification in business and psychology were briefly discussed.

Literature Review 2.1. Feature Extraction
Feature extraction is a process of extracting essential parts of raw data.In Natural Language Processing (NLP), the data obtained through extraction consists of an essential word set.Several prevalent techniques for feature extraction from a set of words include the following:

CountVectorizer (CV)
CountVectorizer (CV) technique is an approach for quantifying words.It counts the frequency of each term in a given text, thereby earning the moniker of raw count methodology.This technique will generate a matrix that contains mostly zero elements, commonly known as a feature matrix.To gain a deeper understanding of the CountVectorizer technique, matrix processing will be exemplified using the sample dataset provided in Figure 1.

Figure 1. Sample Dataset
This method's initial step is determining the unique word of the entire document.Based on the dataset in Figure 1, the unique words extracted are "benci", "dengan", "frustasi", "ini", "keadaan", "lelah", "saya", and "suasana".Employing the CV, the occurrence of each unique word in every document is tabulated, as depicted in Table 1.The highlighted area represents the feature matrix of CV.

Term Frequency -Inverse Document Frequency (TF-IDF)
TF-IDF is a quantitative measure that indicates the significance of specific terms concerning particular documents.Different from CountVectorizer, which only counts the frequency of the word.The TF-IDF is calculated by multiplying the Term Frequency (TF) with the Inverse Document Frequency (IDF).The TF of a unique word in the particular document is calculated using Equation 1.

𝑇𝐹(𝑖, 𝑗) =
(1) where, (, ) denotes the term frequency of a unique word  within a particular document  in a dataset.Next, a specific word's IDF is calculated using Equation 2.
where, () represents the Inverse Document Frequency of a unique word , calculated as the logarithm of the ratio between the total number of documents () in the dataset divided by the number of documents containing  (()).Last, by multiplying Equation 1 and 2, TF-IDF output for a unique word  within a particular document  denoted by (, ) can be calculated using Equation 3.
Table 2 shows the implementation of the TF-IDF technique using the dataset from Figure 1 to provide additional clarity and enhance comprehension of TF-IDF.The highlighted columns represent the resulting feature matrix obtained through TF-IDF.

Keras Tokenizer (KT)
Keras Tokenizer function divides textual data into tokens the model can understand.The method works with a Keras model to preprocess textual data and then produce tokenized representations of the textual data that the model can use for further processing [12].In contrast to CV and TF-IDF, the Keras Tokenizer requires an initial index of words before converting each document in the dataset into a numerical representation.
As an illustrative example, Figure 2 shows the list of word indices corresponding to the sample dataset depicted in Figure 1.In addition, the index number "1" is exclusively reserved to represent out-of-vocabulary (OOV) words, denoting words that have never been indexed.To form the feature matrix, each word within a document is replaced by its corresponding index.Therefore, the "Doc 1" is turned into [8 2 3 9 5], where "saya" corresponds to 8, "benci" corresponds to 2, "dengan" corresponds to 3, "suasana" corresponds to 9, and "ini" corresponds to 5. The process as mentioned earlier is also applied to the remaining documents of the dataset.Table 3 presents the feature matrix generated by the KT technique applied to the sample dataset depicted in Figure 1.Doc 1 8 2 3 9 5  Doc 2 8 4 3 6 5  Doc 3 8 7 3 6 5

Convolutional Neural Network (CNN)
CNN is part of deep learning technology.The utilized methodology of this technology involves the mathematical concept of convergence, as opposed to the conventional approach of matrix multiplication, in at least one of its layers.Generally, the technique was employed to identify an image by ascertaining the optimal classification for the given input [13].
CNN comprises three primary layers, namely the convolutional layer, pooling layer, and fully connected layer.Figure 3 depicts the general architecture of the CNN.The Convolutional Layer performs computations on the input data by filtering the data, while the polling layer selects the maximum value from the filtered matrix.Finally, the fully connected layer calculates the class score.Subsequently, the resultant outcome will be categorized according to the class score, utilizing the softmax activation function for multiclass classification.Softmax activation function 4.
where i z e is the exponential of the input vector number i-th and sums up the exponential values of all n classes in the input vector.
CNN configuration requires variables, namely hyperparameters.Network structure hyperparameters fall into two categories.First, hyperparameters network structure includes kernel size, kernel types-value, stride, padding, hidden layer, and activation function.Second, hyperparameter network training has learning rate, momentum, number of epochs, and batch size.The value of its hyperparameters influences the performance of a CNN.Tuning errors can potentially decrease a system's overall performance [7].

Hyperband Tuner (HT)
The Hyperband Tuner (HT) is a technique employed for the automated optimization of system functions.Hyperparameter tuning involves evaluating various models to identify the optimal model with the lowest time cost.The present study employs the hyperband algorithm to optimize the hyperparameter selection process.In theory, this methodology exhibits the potential to adapt to unfamiliar convergence as a parameter function.Thus, utilizing HT can resolve the intricacy of the hyperparameter amalgamation.Moreover, the performance of these tuning mechanisms exceeds that of the frequently employed Bayesian optimization technique [7].
In the HT, a predefined budget is allocated to a collection of hyperparameter configurations.Each configuration is trained and evaluated, and the algorithm discards the worstperforming half while retaining the better-performing half.This iterative process continues with progressively more resources allocated to the promising configurations.By adapting its allocation of resources, Hyperband efficiently explores a wide range of hyperparameter combinations, facilitating quicker convergence.The utilized dataset in this study was curated by Riccosan et al. [14], comprising 7,080 sentences annotated with six distinct emotion labels: anger, fear, joy, love, sadness, and neutral.Specifically, the dataset consisted of 1,130 anger-related sentences, 911 fear-related sentences, 1,275 joy-related sentences, 760 love-related sentences, 1,003 sadness-related sentences, and 2,001 neutral sentences.The dataset was partitioned into 80% training and 20% testing sets.Moreover, during the training process, 20% of the training data was reserved for validation sets.

Data Preprocessing
Data preprocessing is an essential step before inputting the dataset into the model.In the initial preprocessing of this study, special characters and punctuation were removed.The dataset analysis revealed the presence of non-emotional characters such as "¢¢".Subsequently, double spaces were eliminated and repeated characters were normalized.Certain irrelevant repeated characters were identified within this dataset, such as "Mahalll" which was original "Mahal".In addition, non-emotion-related details, such as time adverbs ("malam", "pagi") were removed.
Furthermore, we improved the text by removing stop words using the Natural Language Toolkit (NLTK) and converting slang words based on the dictionary established by Saputri et al. [15].The dataset was subsequently stemmed and lemmatized using the Sastrawi Library [16].

Implementation of CNN with Hyperband Tuner
The experiment used the Google Collaboratory platform to conduct our deep learning research.Specifically, high-ram instance (32 GB) with NVIDIA V100 GPU was used.These high-ram environment and V100 GPU were crucial for obtaining reliable experimental results.
In implementing HT, validation accuracy was selected as the objective and trained the models for a maximum of 50 epochs.The factor was set to 3 and enabled overwriting of previous results.Additionally, to prevent overfitting, the stop_early callback function was utilized.Subsequently, given the imbalanced dataset, inverse class frequency (ICF) weighting technique was applied to address the skewed distribution of the classes.ICF weighting assigns higher weights to the minority class and lower weights to the majority class.The computation of class weights for each class can be performed using Equation 5, where, CW(c) represents the class weight for class c, NA is the total number of documents in the dataset, C is the total class in the dataset, and NC is the total number of documents in the dataset from class c.Once the class weights for all classes in the dataset have been obtained, they were normalized using Equation 6, where, NCW(c) represent the normalized class weight for class c and TCW is the total class weight.
In the combination of HT and CNN with CV (CV+CNN+HT) and the combination of HT and CNN with TF-IDF (TFIDF+CNN+HT) models, the focus was on tuning the following parameters: the unit configuration and activation function of the input dense layer, the target shape of the reshape layer, the filters and kernel size of the CNN layer, the activation function of the CNN layer, the unit configuration and activation function of the second dense layer (after Global Max Pooling), and the dropout rate of the dropout layer.
In the case of the combination with KT (KT+CNN+HT), a different approach was adopted for the input layer.Instead of using a dense layer, an embedding layer was utilized.In this case, the tuned parameter was the embedding size, as opposed to the unit configuration of the dense layer.The remaining tuned parameters for KT+CNN+HT were the same as those used for CV+CNN+HT and TFIDF+CNN+HT.
The hyperband tuning process was conducted three times for each combination (CV+CNN+HT, TFIDF+CNN+HT, KT+CNN+HT).Subsequently, the model was trained using the tuned parameters five times and calculated the mean performance as a measure of stability.Ensuring a stable model was crucial, as a significant performance gap indicated suboptimal parameter selection.Thus, the highest achieved performance in each category was compared and examined as the best-tuned parameters from each combination.As a result, the best-tuned parameters from the optimal combination were compared with previous research methods to assess their effectiveness and relevance.Figure 4 shows the implementation of CNN+HT with several feature extractions.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
(7) 1 −  = 2 *  *   +  (10) where TP (True Positive) represents the instances accurately classified into a particular class, while TN (True Negative) refers to the instances correctly classified as not belonging to that class.FP (False Positive) indicates the instances that have been incorrectly classified to a specific class.Conversely, FN (False Negative) refers to the instances that have been incorrectly classified.

Feature Extraction+CNN+HT
The results of applying HT to CNN with distinct feature extraction techniques are presented in Table 4.The findings demonstrate high consistency in the tuning outcomes across the three feature extraction methods, with a marginal deviation of approximately 1%.Among the three methods investigated, it has been observed that utilizing KT as a feature extraction approach yields superior results compared to the other two techniques.The KT+CNN method consistently achieves performance levels of over 71% across multiple evaluation metrics, including accuracy, precision, recall, and F1-score.This performance surpassed that of the CNN model with alternative feature extraction methods ranging from 67% to 68%, demonstrating its overall superiority.The best-tuned KT+CNN+HT model (KT+CNN+HT 3) identified in this study achieved an accuracy of 71.5655%, precision of 71.5483%, recall of 71.5655%, and F1-score of 71.0041%.In addition, the confusion matrix of the KT+CNN+HT 3 is shown in Figure 5.

Discussion and Comparison
Based on our experiment, the HT showed great performance by consistently generating stable CNN models comparable to diverse feature extraction methods, including TF-IDF, CV, and KT.From our experience, KT offers notable advantages over CV and TF-IDF as the feature extraction in CNN architecture.KT enables the direct feeding of tokenized sequences, obviating the need for data reshaping.Additionally, KT provides automatic padding and truncation, streamlining the input preprocessing stage.
The outcomes derived from our experimental analysis unequivocally establish the superiority of KT relative to alternative feature extractions.KT's exceptional performance can be attributed to its effective handling of out-of-vocabulary (OOV) words.It is highly possible that the model encountered words in the test set but not in the training set.KT provided a mechanism to handle this OOV issue by assigning a specific token to represent it.Another contributing factor to the enhanced performance of KT compared to TF-IDF and CV was its natural ability to provide word-level representation.By providing word-level representation, KT could capture more specific meaning, context, and dependencies between words compared to TF-IDF and CV.
While KT demonstrated superior performance in our experiments, it also presented a drawback.It demanded a more extended training process than CV and TF-IDF techniques.The study revealed that KT's average duration per epoch was 5.125 seconds, significantly longer than the 3 seconds for TF-IDF and the quickest 2.25 seconds per epoch observed with CV.These disparities underscored the heightened computational overhead of employing KT.
Furthermore, a comparative analysis was conducted to assess the performance of our tuned CNN model (KT+CNN+HT) against other state-of-the-art techniques.While it was observed that the precision of our model was slightly lower than that of TFIDF combined with Boosting SVM [11].The comprehensive evaluation in Table 6 indicated that our tuned CNN model outperformed previous techniques regarding accuracy, recall, and F1 score.However, despite surpassing the performance of previous techniques, our model performance of 71.5655% accuracy in text-based Indonesian emotion classification underscores the intricate challenges involved.The scarcity of linguistic resources due to the low-resource nature of the Indonesian language hampers our model's ability to capture nuanced language intricacies.Despite efforts to mitigate class imbalance in the dataset, accurate prediction of minority classes remains hindered.The complexity of multiclass classification, encompassing six distinct emotions, further exacerbates the challenge, straining predictive accuracy.Additionally, the utilized dataset covers a wide array of topics, introducing a further layer of complexity that is generally more challenging than a single-topic dataset.The presence of lengthy sentences within the dataset exacerbates this issue by introducing noise and reducing precision.Nevertheless, this advancement holds significant value by exceeding the outcomes of preceding methods.It forms an essential foundation for future work, displaying promise for real-world applications.

The Potential of Emotion Classification in Psychology
The application of emotion classification holds considerable potential within the realm of psychology.Emotion classification is a valuable tool for various psychological applications, including predicting loneliness through analysis of social media accounts [17].Furthermore, it supports foundational research in psychology, facilitating investigations into fundamental questions such as the factors that contribute to human thriving, the determinants of happiness, and how individuals express emotions through mining social media posts [18].
However, although emotion classification models demonstrate the ability to accurately classify text inputs by leveraging inherent patterns and features within the text, it is imperative to acknowledge that textual inputs may not consistently reflect the actual emotional state of individuals [18].Factors such as the deliberate downplaying or exaggeration of emotions, the utilization of euphemisms, and the employment of other linguistic strategies can misalign with genuine feelings.Furthermore, it is worth noting that most existing emotion classification datasets encompass only six basic emotions, despite numerous emotional states.Therefore, in current development, emotion classification should be regarded as a reference or tool for psychologists and other professionals when making decisions or assessments related to emotional states rather than providing a comprehensive understanding of an individual's emotions [19].

The Potential of Emotion Classification in Business
Emotion classification within a business context presents considerable advantages to companies.The primary function of this tool is to aid in the acquisition of significant perspectives on consumer opinion, thereby promoting a comprehensive comprehension of customers' emotional reactions toward their products, services, or brand.This knowledge enables companies to make well-informed decisions, and efficiently tackle customer issues.Furthermore, precise classification of customer emotions allows organizations to customize their interactions and promotional strategies, increasing customer involvement and allegiance [20].However, preceding the implementation of emotion classification, it is crucial to undertake meticulous deliberations such as system integration and scalability, data privacy and ethical practices, and cost.

Conclusions and Recommendations
The findings indicate that employing the CNN methodology with HT yields noteworthy outcomes.The best CNN model combinations were achieved by combining CNN with KT tuned by HT (KT+CNN+HT).While the precision values exhibited by KT+CNN+HT are 2% inferior to those of the TFIDF+Boosting SVM approach, notable disparities in outcomes are evident across other metrics.The combination results in an accuracy score of 71.5655%, precision score of 71.5483%, recall score of 71.5655%, and F1-score of 71.0041%.Lastly, forthcoming research may involve crafting ensemble models fine-tuned through Hyperparameter Tuning (HT) while integrating cutting-edge techniques tailored to address imbalanced datasets.
Furthermore, the prospective applications of emotion classification within the domains of business and psychology were briefly examined.The categorization of emotions holds promise as a guiding framework for psychologists in formulating decisions about emotional states within psychology's purview.Analogously, within the context of business, the integration of this technological capability stands to augment entrepreneurs' capacity to discern the extent of consumer demand for specific products or services, thereby engendering a more versatile approach.The latent possibilities within these domains warrant dedicated exploration and analysis in forthcoming investigations.

Figure 3 .
Figure 3. General Architecture of CNN

Figure 4 .
Figure 4. Implementation of CNN+HT with Several Feature Extractions