Optimizing Semantic Clustering of Cultural Heritage Question-Answering Corpora Using Sentence-BERT Embeddings and PCA-Enhanced K-Means

Nala Widyadhana; Nur Cahyo Wibowo; Tri Lathif Mardi Suryanto

doi:10.24002/konstelasi.v6i1.15153

Authors

Nala Widyadhana Universitas Pembangunan Nasional “Veteran” Jawa Timur
Nur Cahyo Wibowo Universitas Pembangunan Nasional “Veteran” Jawa Timur
Tri Lathif Mardi Suryanto Universitas Pembangunan Nasional “Veteran” Jawa Timur

DOI:

https://doi.org/10.24002/konstelasi.v6i1.15153

Keywords:

K-Means, Sentence-BERT MiniLM, PCA, Sentence Clustering, Internal Validation Metrics

Abstract

This study examines semantic text clustering using all-MiniLM-L6-v2 sentence embeddings and K-Means on a Dewi Durga question-answering corpus from Indian, Javanese, and Balinese cultural contexts. The dataset contains 1,620 Context-Question-Answer entries extracted from Chapters 1-22. Text preprocessing included structural checking, missing-value inspection, duplicate detection, case folding, non-alphanumeric character removal, and whitespace normalization. Each context was transformed into a 384-dimensional dense embedding vector. The optimal cluster number was evaluated using Auto K across K values from 2 to 10 with Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index, while Manual K = 5 was used as a comparative setting for more detailed thematic interpretation. Six embedding transformation scenarios were tested in both modes. The results show that Auto_K_S5, combining normalization and PCA with 50 components, achieved the strongest internal validation performance with a Silhouette Score of 0.098899, Davies-Bouldin Index of 2.912914, and Calinski-Harabasz Index of 186.476974. Manual_K5_S3 produced more granular themes related to ritual, mythology, history, archaeology, and religious narrative.

Optimizing Semantic Clustering of Cultural Heritage Question-Answering Corpora Using Sentence-BERT Embeddings and PCA-Enhanced K-Means

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

Login/Register

Submit your paper here

-manuscript-template

Information

Sinta 4

google_scholar

turnitin_mendeley

DOI

garuda

visitors

KONSTELASI: Konvergensi Teknologi dan Sistem Informasi