Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

Authors

  • Yulius Denny Prabowo Institut Teknologi dan Bisnis Kalbis https://orcid.org/0000-0002-5632-3744
  • Tedi Lesmana Marselino Institut Teknologi dan Bisnis Kalbis
  • Meylisa Suryawiguna Institut Teknologi dan Bisnis Kalbis

DOI:

https://doi.org/10.24002/jbi.v10i1.2053

Abstract

Abstract.

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.
Keywords: vector space model, word to vector, Indonesian vector space model.


Abstrak.

Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.
Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Author Biography

Yulius Denny Prabowo, Institut Teknologi dan Bisnis Kalbis

Fakultas Industri Kreatif

Program Studi Informatika

Institut Teknologi dan Bisnis Kalbis

References

Tomas Mikolov, Greg Corrado, Kai Chen & Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space”, 2013. https://arxiv.org/pdf/1301.3781.pdf [2] Yoshua Bengio, Ducharme Rejean, Vincent Pascal & Janvin Christian. “A Neural Probabilistic Language Model”, 2003. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf [3] Collobert Ronan, & Weston Jason. “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning”. 2008. https://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf

M. Faruqui and C. Dyer, "Improving Vector Space Word Representations Using Multilingual Correlation", Carnegie Mellon University, 2014 hal.236-244.

R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torallba and S. Fidler, "Skip-Tought Vector", Advances in Neural Information Processing System, 2015. [diakses tanggal 07 Januari 2018]

L. Wolf, Y. Hanani, K. Bar, and N.Dershowitz, "Joint Word2Vec Networks for Bilingual Semantic Representations", International Journal of Computation Linguistics and Applications, Vol.5, 2014. [diakses tanggal 07 Januari 2018]

T. Mikolov, K. Chen, G.Corrado, and J.Dean, "Efficient Estimation of Word Representations in Vector Space", Cornell 2013. [Online] https://arxiv.org/pdf/1301.3781.pdf [diakses tanggal 07 Januari 2018]

T. Mikolov, I. Sutskever, K. Chen, G.Corrado and J.Dean, "Distributed Representations of Words and Phrases and their Compositionality", Neural Information Processing System Conference, 2013. [Online] https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf [diakses tanggal 07 Januari 2018]

T.Mikolov, Y.Wen-Tau, and G.Zweig, "Linguistic Regularities in Continuous Space Word Representations", Association for Computational Linguistic 2013. [Online] http://www.aclweb.org/anthology/N13-1090 [diakses tanggal 07 Januari 2018]

X.Rong, "Word2Vec Parameter Learning Explained", Cornell 2016. [Online] https://arxiv.org/pdf/1411.2738v3.pdf [diakses tanggal 07 Januari 2018]

Y.Goldberg and O.Levy, "Word2Vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method", Cornell 2014. [Online] https://arxiv.org/pdf/1402.3722v1.pdf [diakses tanggal 07 Januari 2018]

M. Baroni and A. Lenci, "Distributional Memory: A General Framework for Corpus-based Semantics", Computational Linguist, 2010. [Online] https://arxiv.org/pdf/1301.3781.pdf [diakses tanggal 07 Januari 2018]

E. Grave, P. Bojanowski, P. Gupta, A. Joulin and T. Mikolov, "Learning Word Vectors for 157 Languages", International Conference on Language Resources and Evaluation, [online] https://arxiv.org/abs/1802.06893, [diakses juli 2018]

L.T. Eren and K.Metin, "Vector Space Models in Detection of Semantically Non-compositional Word Combinations in Turkish", Analysis of Images, Social Networks and Texts (AIST) 2018. Lecture Notes in Computer Science, vol 11179. Springer

S.Taylor and T.Brychcín, "The representation of some phrases in Arabic word semantic vector spaces", Open Computer Science. 8(1): 182-193

Downloads

Published

2019-04-26