Article Info

Corpus Development for Malay Sentiment Analysis Using Semi Supervised Approach

Ezuana Sukawai, Nazlia Omar


Research on sentiment analysis have gained so much interest currently. However, research on Malay sentiment analysis and the availability of the resources is still lacking. The aim of this study is to develop a Malay sentiment corpus using a semi-supervised approach. Data from Twitter have been used in this corpus development. The corpus is developed using the combination of lexicon and machine learning approach. The sentiment lexicon will be used to build the seed of training data from unlabelled data resources. In addition, sentiment emoticons are used to compare the accuracy of the lexicon-based approach. After preparation of training data set, the process of adding new training data instances will be carried out using the seed set and machine learning classification method. The process of classification using machine learning approach consists of pre-processing, feature extraction and classification. Five types of classifiers are considered for the classification task. Based on the experimental results, the lexicon-based approach and Multinomial Naïve Bayes algorithm is the best classifier for Malay sentiment corpus development.


Corpus, Sentiment Analysis, Sentiment Lexicon, Classification, Semi Supervised, Twitter


Knowledge Technology