Optimizing Semantic Clustering of Cultural Heritage Question-Answering Corpora Using Sentence-BERT Embeddings and PCA-Enhanced K-Means
DOI:
https://doi.org/10.24002/konstelasi.v6i1.15153Keywords:
K-Means, Sentence-BERT MiniLM, PCA, Sentence Clustering, Internal Validation MetricsAbstract
This study examines semantic text clustering using all-MiniLM-L6-v2 sentence embeddings and K-Means on a Dewi Durga question-answering corpus from Indian, Javanese, and Balinese cultural contexts. The dataset contains 1,620 Context-Question-Answer entries extracted from Chapters 1-22. Text preprocessing included structural checking, missing-value inspection, duplicate detection, case folding, non-alphanumeric character removal, and whitespace normalization. Each context was transformed into a 384-dimensional dense embedding vector. The optimal cluster number was evaluated using Auto K across K values from 2 to 10 with Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index, while Manual K = 5 was used as a comparative setting for more detailed thematic interpretation. Six embedding transformation scenarios were tested in both modes. The results show that Auto_K_S5, combining normalization and PCA with 50 components, achieved the strongest internal validation performance with a Silhouette Score of 0.098899, Davies-Bouldin Index of 2.912914, and Calinski-Harabasz Index of 186.476974. Manual_K5_S3 produced more granular themes related to ritual, mythology, history, archaeology, and religious narrative.








