SAINS MALAYSIANA

Sains Malaysiana 46(6)(2017): 1001–1010

http://dx.doi.org/10.17576/jsm-2017-4606-20

New Discrimination Procedure of Location Model for Handling Large Categorical Variables

(Prosedur Diskriminasi Baharu Model Lokasi untuk Mengendalikan Pemboleh Ubah Kategori Besar)

HASHIBAH HAMID*, LONG MEI MEI & SHARIPAH SOAAD SYED YAHAYA

Statistics Department, School of Quantitative Sciences, Universiti Utara Malaysia College of Arts and Sciences, 06010 UUM Sintok, Kedah Darul Aman, Malaysia

Diserahkan: 16 Oktober 2015/Diterima: 28 November 2016

ABSTRACT

The location model proposed in the past is a predictive discriminant rule that can classify new observations into one of two predefined groups based on mixtures of continuous and categorical variables. The ability of location model to discriminate new observation correctly is highly dependent on the number of multinomial cells created by the number of categorical variables. This study conducts a preliminary investigation to show the location model that uses maximum likelihood estimation has high misclassification rate up to 45% on average in dealing with more than six categorical variables for all 36 data tested. Such model indicated highly incorrect prediction as this model performed badly for large categorical variables even with large sample size. To alleviate the high rate of misclassification, a new strategy is embedded in the discriminant rule by introducing nonlinear principal component analysis (NPCA) into the classical location model (cLM), mainly to handle the large number of categorical variables. This new strategy is investigated on some simulation and real datasets through the estimation of misclassification rate using leave-one-out method. The results from numerical investigations manifest the feasibility of the proposed model as the misclassification rate is dramatically decreased compared to the cLM for all 18 different data settings. A practical application using real dataset demonstrates a significant improvement and obtains comparable result among the best methods that are compared. The overall findings reveal that the proposed model extended the applicability range of the location model as previously it was limited to only six categorical variables to achieve acceptable performance. This study proved that the proposed model with new discrimination procedure can be used as an alternative to the problems of mixed variables classification, primarily when facing with large categorical variables.

Keywords: Large categorical variables; leave-one-out method; location model; nonlinear principal component analysis; misclassification rate

ABSTRAK

Model lokasi yang dicadangkan pada masa lalu adalah satu peraturan diskriminan ramalan yang boleh mengelaskan cerapan baharu ke dalam salah satu daripada dua kumpulan yang telah ditetapkan berdasarkan campuran pemboleh ubah selanjar dan kategori. Keupayaan model lokasi untuk mendiskriminasi cerapan baharu dengan betul adalah amat bergantung kepada bilangan sel-sel multinomial yang dicipta melalui bilangan pemboleh ubah kategori. Penyelidikan ini menjalankan suatu kajian awal untuk menunjukkan model lokasi yang menggunakan anggaran kebolehjadian maksimum mempunyai kadar silap pengelasan yang tinggi sehingga 45% secara purata dalam berurusan dengan lebih daripada enam pemboleh ubah kategori bagi kesemua 36 data yang diuji. Model tersebut menunjukkan ramalan tidak tepat yang sangat tinggi kerana model ini berprestasi teruk bagi pemboleh ubah kategori besar walaupun dengan saiz sampel yang besar. Untuk mengurangkan kadar kesilapan pengelasan yang tinggi, satu strategi baharu telah diterapkan dalam peraturan diskriminan dengan memperkenalkan analisis komponen utama tak linear (NPCA) ke dalam model lokasi klasik (cLM), terutamanya untuk mengendalikan bilangan besar pemboleh ubah kategori. Strategi baharu ini dikaji pada beberapa set data simulasi dan sebenar melalui anggaran kadar silap pengelasan menggunakan kaedahleave-one-out. Hasil daripada kajian berangka menampakkan kebolehlaksanaan model yang dicadangkan dengan kadar silap pengelasan menurun secara mendadak berbanding dengan cLM untuk kesemua 18 tetapan data yang berbeza. Aplikasi amali menggunakan set data sebenar menunjukkan penambahbaikan yang signifikan dan mendapat keputusan yang setanding dalam kalangan kaedah terbaik yang dibandingkan. Hasil kajian secara keseluruhan menunjukkan bahawa model yang dicadangkan memperluaskan rangkaian kebolehgunaan model lokasi kerana sebelum ini ia telah dihadkan kepada hanya enam pemboleh ubah kategori untuk mencapai prestasi yang boleh diterima. Kajian ini membuktikan bahawa model yang dicadangkan dengan prosedur diskriminasi yang baharu boleh digunakan sebagai alternatif kepada masalah klasifikasi pemboleh ubah campuran, terutamanya apabila berhadapan dengan pemboleh ubah kategori besar.

Kata kunci: Analisis komponen utama tak linear; kadar silap pengelasan; kaedahleave-one-out; model lokasi; pemboleh ubah kategori besar

RUJUKAN

Asparoukhov, O. & Krzanowski, W.J. 2000. Non-parametric smoothing of the location model in mixed variable discrimination. Statistics and Computing 10: 289-297.

Costa, P.S., Santos, N.C., Cunha, P., Cotter, J. & Sousa, N. 2013. The use of multiple correspondence analysis to explore associations between categories of qualitative variables in healthy ageing. Journal of Aging Research 2013: Article ID. 302163. doi:10.1155/2013/302163.

De Leeuw, J. 2011. Nonlinear Principal Component Analysis and Related Techniques. UCLA: Department of Statistics. https://escholarship.org/uc/item/7bt7j6nk.

De Leeuw, J. & Mair, P. 2009. Gifi methods for optimal scaling in R: The package homals. Journal of Statistical Software 31(4): 1-21. http://www.jstatsoft.org/.

Donoho, D.L. 2000. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. pp. 1-33. http://mlo.cs.man.ac.uk/resources/Curses. pdf.

Fan, J. & Li, R. 2006. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In Feature Selection in Knowledge Discovery. pp. 1-27. doi:10.4171/022-3/31.

Fan, J. & Lv, J. 2010. A selective overview of variable selection in high dimensional feature space. Statistica Sinica20(1): 101-148. doi:10.1063/1.3520482.

Ferrari, P.A. & Manzi, G. 2010. Nonlinear principal component analysis as a tool for the evaluation of customer satisfaction. Quality Technology and Quantitative Management 7(2): 117- 132. http://air.unimi.it/handle/2434/141402\nhttp://web2.cc.nctu.edu.tw/~qtqm/qtqmpapers/2010V7N2/2010V7N2_ F2.pdf.

Gervini, D. & Rousson, V. 2004. Criteria for evaluating dimension-reducing components for multivariate data. The American Statistician 58(1): 72-76. doi:10.1198/0003130042863.

Gupta, V. 2013. Exploring Data Generated by Pocket Devices. London. http://files.howtolivewiki.com/SMART_CITIES/ The_Smart_City.To_Whos_Advantage.Pocket_Devices_ and_Data_Trails.Vinay_Gupta.pdf.

Hamid, H. 2010. A new approach for classifying large number of mixed variables. International Scholarly and Scientific Research and Innovation 4(10): 120-125. doi: 14621.

Hamid, H. 2014. Integrated smoothed location model and data reduction approaches for multi variables classification. Doctoral Dissertation. Universiti Utara Malaysia, Malaysia (Unpublished).

Hamid, H. & Mahat, N.I. 2013. Using principal component analysis to extract mixed variables for smoothed location model. Far East Journal of Mathematical Sciences (FJMS) 80(1): 33-54.

Katz, M.H. 2011. Multivariate Analysis: A Practical Guide for Clinicians and Public Health Researchers. Cambridge: Cambridge University Press.

Krzanowski, W.J. 1995. Selection of variables, and assessment of their performance, in mixed-variable discriminant analysis. Computational Statistics & Data Analysis 19: 419-431. doi:10.1016/0167-9473(94)00011-7.

Krzanowski, W.J. 1993. The location model for mixtures of categorical and continuous variables. Journal of Classification 10(1): 25-49. doi:10.1007/BF02638452.

Krzanowski, W.J. 1983. Stepwise location model choice in mixed-variable discrimination. Journal of the Royal Statistical Society. Series C (Applied Statisitcs) 32(3): 260-266.

Krzanowski, W.J. 1975. Discrimination and classification using both binary and continuous variables. Journal of American Statistical Association 70(352): 782-790.

Li, Q. 2006. An integrated framework of feature selection and extraction for appearance-based recognition. Doctoral Dissertation. University of Delaware Newark, DE, USA (Unpublished).

Linting, M., Meulman, J.J., Groenen, P.J.F. & Van der Kooij, A.J. 2007. Nonlinear principal components analysis: Introduction and application. Psychological Methods 12(3): 336-358. doi:10.1037/1082-989X.12.3.336.

Linting, M. & Van der Kooij, A.J. 2012. Nonlinear principal components analysis with CATPCA: A tutorial. Journal of Personality Assessment 94(1): 12-25. doi:10.1080/0022389 1.2011.627965.

Long, M.M. 2016. Binary variable extraction using nonlinear principal component analysis in classical location model. Master Dissertation. Universiti Utara Malaysia, Malaysia (Unpublished).

Mahat, N.I. 2006. Some investigations in discriminant analysis with mixed variables. Doctoral Dissertation. University of Exeter, London, UK (Unpublished).

Mahat, N.I., Krzanowski, W.J. & Hernandez, A. 2009. Strategies for non-parametric smoothing of the location model in mixed-variable discriminant analysis. Modern Applied Science 3(1): 151-163.

Mahat, N.I., Krzanowski, W.J. & Hernandez, A. 2007. Variable selection in discriminant analysis based on the location model for mixed variables. Advances in Data Analysis and Classification 1(2): 105-122. doi:10.1007/s11634-007- 0009-9.

Manisera, M., A.J. Van der Kooij, & Dusseldorp, E. 2010. Identifying the component structure of job satisfaction by nonlinear principal components analysis. Quality Technology and Quantitative Management 7: 97-115. http:// elisedusseldorp.nl/pdf/Manisera_QTQM2010.pdf.

Mohd Aris, Khairul Dahri, Faizal Mustapha, Mohd Sapuan Salit & Dayang Laila Abang Abdul Majid. 2014. Condition structural index using principal component analysis for undamaged, damage and repair conditions of carbon fiber-reinforced plastic laminate. Journal of Intelligent Material Systems and Structures 25(5): 575-584. doi:10.1177/1045389X13494932.

Ramadevi, G.N. & Usharaani, K. 2013. Study on dimensionality reduction techniques and applications. Publications of Problems & Application in Engineering Research 4(1): 134-140.

Russom, P. 2013. Managing Big Data. TWDI Best Practices Report. Washington: twdi.org.

Solanas, A., Manolov, R., Leiva, D. & Richard, M.M. 2011. Retaining principal components for discrete variables. Anuario de Psicologia41(1-3): 33-50.

Vlachonikolis, I.G. & Marriott, F.H.C. 1982. Discrimination with mixed binary and continuous data. Applied Statistics 31(1): 23-31.

Young, P.D. 2009. Dimension reduction and missing data in statistical discrimination. Doctoral Dissertation. USA Baylor University (Unpublished).

Zheng, H. & Zhang, Y. 2008. Feature selection for high-dimensional data in astronomy. Advances in Space Research 41(12): 1960-1964. doi:10.1016/j.asr.2007.08.033.

*Pengarang untuk surat-menyurat; email: hashibah@uum.edu.my

sebelumnya

kandungan