SAINS MALAYSIANA

Sains Malaysiana 49(2)(2020): 447-459

http://dx.doi.org/10.17576/jsm-2020-4902-24

Ensemble Learning for Multidimensional Poverty Classification

(Pembelajaran Ensembel untuk Pengelasan Kemiskinan Pelbagai Dimensi)

AZURALIZA ABU BAKAR*, RUSNITA HAMDAN & NOR SAMSIAH SANI

Center for Artificial Intelligence Technology, Faculty of Information Science & Technology, 46300 UKM Bangi, Selangor Darul Ehsan, Malaysia

Diserahkan: 13 Mac 2019/Diterima: 10 November 2019

ABSTRACT

The poverty rate in Malaysia is determined through financial or income indices and measurements. As such, periodic measurements are conducted through Household Expenditure and Income Survey (HEIS) twice every five years, and subsequently used to generate a Poverty Line Income (PLI) to determine poverty levels through statistical methods. Such uni-dimensional measurement however is unable to portray the overall deprivation conditions, especially based on the experience of the urban population. In addition, the United Nation Development Programme (UNDP) has introduced a set of multi-dimensional poverty measurements but is yet to be applied in the case of Malaysia. In view of this, a potential use of Machine Learning (ML) approaches that can produce new poverty measurement methods is therefore of interest, which must be triggered by the existence of a rich database collection on poverty, such as the eKasih database maintained by the Malaysian Government. The goal of this study was to determine whether ensemble learning method (random forest) can classify poverty and hence produce multidimensional poverty indicator compared to based learner method using eKasih dataset. CRoss Industry Standard Process for Data Mining (CRISP-DM) methods was used to ensure data mining and ML processes were conducted properly. Beside Random Forest, we also examined decision tree and general linear methods to benchmark their performance and determine the method with the highest accuracy. Fifteen variables were then rank using varImp method to search for important variables. Analysis of this study showed that Per Capita Income, State, Ethnic, Strata, Religion, Occupation and Education were found to be the most important variables in the classification of poverty at a rate of 99% accuracy confidence using Random Forest algorithm.

Keywords: Machine learning; multidimensional poverty; random forest

ABSTRAK

Kadar kemiskinan di Malaysia ditentukan melalui pengukuran perspektif kewangan atau pendapatan. Pengukuran berkala dilakukan melalui Bancian Perbelanjaan Rumah dan Penyiasatan Pendapatan (HEIS) dua tahun sekali digunakan untuk menghasilkan Paras Garis Kemiskinan (PGK) dalam menentukan tahap kemiskinan menggunakan kaedah statistik. Pengukuran uni-dimensi itu bagaimanapun tidak dapat menggambarkan keadaan kekurangan keseluruhan yang terutamanya dialami penduduk bandar. Program Pembangunan Bangsa-Bangsa Bersatu (PBB) telah memperkenalkan satu kaedah pengukuran kemiskinan pelbagai dimensi yang belum digunakan di Malaysia. Oleh itu, potensi penggunaan pendekatan Pembelajaran Mesin (ML) untuk menghasilkan kaedah pengukuran kemiskinan yang baru adalah tinggi disebabkan oleh adanya pengumpulan pangkalan data kemiskinan yang utama seperti pangkalan data eKasih yang dikendalikan oleh Kerajaan Malaysia. Tujuan kajian ini untuk membuktikan kaedah pembelajaran mesin bergabung (hutan rawak) boleh mengkelaskan kemiskinan dengan ketepatan yang tinggi dan dapat menyenaraikan indikator pelbagai dimensi kemiskinan berbanding dengan kaedah pembelajaran asas menggunakan dataset eKasih. Metod CRoss Industry Standard Process for Data Mining (CRISP-DM) digunakan untuk memastikan perlombongan data dan proses ML dijalankan dengan baik. Di samping Hutan Rawak, kami juga mengkaji pokok keputusan dan kaedah linear am untuk menanda aras prestasi mereka dan menentukan kaedah terbaik dengan ketepatan tertinggi. Lima belas pemboleh ubah disusun menggunakan kaedah varImp untuk mencari pemboleh ubah penting. Analisis kajian ini menunjukkan bahawa Pendapatan Perkapita, Negeri, Etnik, Strata, Agama, Pekerjaan dan Pendidikan didapati sebagai faktor yang paling penting dalam mengkelaskan kemiskinan pada kadar kepercayaan ketepatan 99% dengan menggunakan algoritma hutan secara rawak.

Kata kunci: Hutan rawak; kemiskinan pelbagai dimensi; pembelajaran mesin

RUJUKAN

Adomavicius, G. & Tuzhilin, A. 2001. Using data mining methods to build customer profiles. Computer 34(2): 74-81.

Ahmad, W.D. & Abu Bakar, A. 2018. Classification models for higher learning scholarship. Asia-Pacific Journal of Information Technology and Multimedia 7(2): 131-145.

Albashish, D., Sahran, S., Abdullah, A., Shukor, N.A. & Pauzi, S. 2016. Ensemble learning of tissue components for prostate histopathology image grading. International Journal on Advanced Science, Engineering and Information Technology 6(6): 1134-1140.

Alsac, A., Colak, M. & Keskin, G.A. 2017. An integrated customer relationship management and data mining framework for customer classification and risk analysis in health sector. 6th International Conference on Industrial Technology and Management, ICITM 2017. pp. 41-46.

Bambang Widjanarko Otok. & Dian Seftiana. 2015. The classification of poor households in jombang with random forest classification and regression trees (RF-CART) approach as the solution in achieving the 2015 Indonesian MDGs' targets. International Journal of Science and Research 3(8): 1497-1503.

Chen, G.B., Li., S.S., Knibbs, L.D., Hamm, N.A.S., Cao, W., Li, T.T., Guo, J.P., Ren, H.Y., Abramson, M.J. & Guo, Y.M. 2018. A machine learning method to estimate PM_2.5 concentrations across China with remote sensing, meteorological and land use information. Science of The Total Environment 636: 52-60.

Deng, H.L., Zhang, L.J. & Su, W.K. 2016. Clustering the families successfully applying for minimum living standard security system based on K-means algorithm. 12th International Conference on Computational Intelligence and Security. pp. 494-498.

DOSM. 2017. Department of Statistics Malaysia Press Release Report of Household Income and Basic Amenities Survey 2016. Report of Household Income and Basic Amenities Survey 2016. doi:10.1021/ja064532c.

Doycheva, K., Horn, G., Koch, C., Schumann, A. & König, M. 2017. Assessment and weighting of meteorological ensemble forecast members based on supervised machine learning with application to runoff simulations and flood warning. Advanced Engineering Informatics 33: 427-439.

Husam, I.S., Abuhamad, Azuraliza Abu Bakar, Suhaila Zainudin, Mazrura Sahani. & Zainudin Mohd Ali. 2017. Feature selection algorithms for Malaysian dengue outbreak detection model. Sains Malaysiana 46(2): 255-265.

Jean, N., Burke, M., Xie, M., Davis, W.M., Lobell, D.B. & Ermon, S. 2016. Combining satellite imagery and machine learning to predict poverty. Science 353(6301): 790-794.

Kshirsagar, V., Wieczorek, J., Ramanathan, S. & Wells, R. 2017. Household poverty classification in data-scarce environments: A machine learning approach. NIPS 2017 Workshop on Machine Learning for the Developing World. http://arxiv.org/abs/1711.06813.

McBride, L. & Nichols, A. 2016. Retooling poverty targeting using out-of-sample validation and machine learning. The World Bank Economic Review 32(3): 531-550.

Natita Wangsoh, Wiboonsak Watthayu. & Dusadee Sukawat. 2017. A hybrid climate model for rainfall forecasting based on combination of self- organizing map and analog method. Sains Malaysiana 46(12): 2541-2547.

Nor Samsiah Sani, Mariah Abdul Rahman, Azuraliza Abu Bakar, Shahnurbanon Sahran. & Hafiz Mohd Sarim. 2018. Machine learning approach for bottom 40 percent households (B40) poverty classification. International Journal on Advanced Science, Engineering and Information Technology 8(4-2): 1698.

Nor Samsiah Sani, Illa Iza Suhana Shamsuddin, Shahnorbanun Sahran, Abdul Hadi Abd Rahman. & Ereena Nadjimin Muzaffar. 2018. Redefining selection of features and classification algorithms for room occupancy detection. International Journal on Advanced Science, Engineering and Information Technology 8(4-2): 1486-1493.

Othman, Zalinda, Soo Wui Shan, Ishak Yusoff. & Chang Peng Kee. 2018. Classification techniques for predicting graduate employability. International Journal on Advanced Science, Engineering and Information Technology 8(4-2): 1712-1720.

Pavithra, R. & Sudha, P. 2018. A survey on classification in R programming using data mining. International Journal of Research in Engineering, Science and Management 1(9): 401-403.

Perez, A. & Azzari, G. 2017. Poverty prediction with public landsat 7 satellite imagery and machine learning. NIPS 2017 Workshop on Machine Learning for the Developing World. https://arxiv.org/abs/1711.03654.

Sano, A.V.D. & Nindito, H. 2011. Application of K-Means algorithm for cluster analysis on poverty of provinces in Indonesia. ComTech: Computer, Mathematics and Engineering Applications 7(6): 141-150.

Santoso & Mohammad Isa Irawan. 2016. Classification of poverty levels using k-nearest neighbor and learning vector quantization. International Journal of Computing Science and Applied Mathematics 2(1): 8-13.

Sohnesen, T.P. & Stender, N. 2017. Is random forest a superior methodology for predicting poverty? An empirical assessment. Poverty and Public Policy 9(1): 118-133.

Thoplan, R. 2014. Random forests for poverty classification. International Journal of Sciences: Basic and Applied Research 4531(8): 252-259.

Unit Perancang Ekonomi. 2015. Rancangan Malaysia Ke-11 (2016-2020). Unit Perancang Ekonomi, Jabatan Perdana Menteri. Kuala Lumpur: Percetakan Nasional Malaysia Berhad. http://www.epu.gov.my.

Vafeiadis, T., Diamantaras, K.I., Sarigiannidis, G. & Chatzisavvas, K.C. 2015. A comparison of machine learning techniques for customer churn prediction. Simulation Modelling Practice and Theory 55: 1-9.

Wirth, R. 2000. CRISP-DM: Towards a standard process model for data mining. Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining 24959: 29-39.

Wrzesień, M., Waldemar, T., Klamkowski, K. & Rudnicki, W.R. 2019. Prediction of the apple scab using machine learning and simple weather stations. Computers and Electronics in Agriculture 161: 252-259.

Wu, R., Yan, S., Shan, Y., Dang, Q. & Sun, G. 2019. Deep image: Scaling up image recognition. Arxiv.Org. Accessed by May 15. https://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf.

Yang, X., Liu, W., Tao, D. & Cheng, J. 2019. Canonical correlation analysis networks for two-view image recognition. Information Sciences 385-386: 338-352.

Zheng, H., Fu, J., Mei, T. & Luo, J. 2019. Learning multi-attention convolutional neural network for fine-grained image recognition. The IEEE International Conference on Computer Vision (ICCV) 2017: 5209-5217.

Zhong, J., Zhang, X. & Wang, Y. 2019. Relatively weak meteorological feedback effect on PM_2.5 mass change in winter 2017/18 in the Beijing area: Observational evidence and machine-learning estimations. Science of The Total Environment 664: 140-147.

*Pengarang untuk surat-menyurat; email: azu1328@yahoo.com

sebelumnya

kandungan