SAINS MALAYSIANA

Sains Malaysiana 51(12)(2022): 4153-4160

http://doi.org/10.17576/jsm-2022-5112-22

Performance Analysis and Discrimination Procedure of Two-Group Location Model with Some Continuous and High-Dimensional of Binary Variables

(Analisis Prestasi dan Prosedur Pembezaan Model Lokasi Dua Kumpulan dengan Sebilangan Pemboleh Ubah Selanjar dan Dimensi Tinggi Pemboleh Ubah Binari)

HASHIBAH HAMID^1,*, FRIDAY ZINZENDOFF OKWONU², NOR AISHAH AHAD¹ & HASLIZA ABDUL RAHIM³

¹School of Quantitative Sciences, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 UUM Sintok, Kedah, Malaysia

²Department of Mathematics, Delta State University, Abraka, Delta, P.M.B.1, Nigeria

³School of Computer and Communication Engineering, Universiti Malaysia Perlis, 02600 UniMAP Arau, Perlis, Malaysia

Received: 23 April 2022/Accepted: 26 August 2022

Abstract

This research’s primary goal was to evaluate the performance analysis of the recently constructed smoothed location models (SLMs) for discrimination purposes by combining two kinds of multiple correspondence analysis (MCA) to handle high dimensionality problems arising from the binary variables. A previous study of SLM, together with MCA as well as principal component analysis (PCA), displayed that the misclassification rate was still very high with respect to a large number of binary variables. Thus, two new SLMs are constructed in this paper to solve this particular problem. The first model results from the combination of SLM with Burt MCA (denoted as SLM+Burt), and the second one is with the joint correspondence analysis (denoted as SLM+JCA). The findings showed that both models performed well for all sample sizes (n) and all binary variables (b) under investigation, except n=60 and b=25 for the SLM+JCA model. Overall, the SLM+JCA model yields a greater performance in contrast to the SLM+Burt model. Moreover, the concept and procedures of the discrimination for the two-group classification conducted in this paper can be extended to multi-class classification as practitioners often deal with many groups and complexities of variables.

Keywords: Discrimination; large binary variables; misclassification rate; multiple correspondence analysis; smoothed location model

Abstrak

Matlamat utama penyelidikan ini adalah untuk menilai analisis prestasi model lokasi terlicin (SLMs) yang dibina sebelum ini untuk tujuan pembezaan dengan menggabungkan dua jenis analisis kesepadanan berganda (MCA) bagi menangani masalah dimensi tinggi yang berlaku daripada pemboleh ubah binari. Kajian terdahulu mengenai SLM bersama-sama dengan MCA serta analisis komponen utama (PCA), menunjukkan bahawa kadar salah pengelasan masih sangat tinggi dengan sejumlah besar bilangan pemboleh ubah binari. Oleh itu, dalam kajian ini, dua SLMs baharu dibina untuk menyelesaikan masalah khusus ini. Model pertama terhasil daripada gabungan SLM dengan Burt MCA (ditandakan sebagai SLM+Burt), dan yang kedua adalah dengan analisis kesepadanan bersama (ditandakan sebagai SLM+JCA). Hasil kajian menunjukkan bahawa kedua-dua model menunjukkan prestasi yang baik untuk semua saiz sampel (n) dan semua pemboleh ubah binari (b) di bawah kajian, kecuali untuk kes n=60 dan b=25 bagi model SLM+JCA. Secara keseluruhan, model SLM+JCA menghasilkan prestasi yang lebih baik berbanding model SLM+Burt. Selain itu, konsep dan prosedur pembezaan untuk pengelasan dua kumpulan yang dijalankan dalam kajian ini boleh diperluaskan kepada pengelasan berbilang kumpulan kerana pengamal sering berurusan dengan banyak kumpulan dan kerumitan pemboleh ubah.

Kata kunci: Analisis kesepadanan berganda; diskriminasi; kadar salah pengelasan; model lokasi terlicin; pembezaan; pemboleh ubah binari besar

REFERENCES

Asparoukhov, O. & Krzanowski, W.J. 2000. Non-parametric smoothing of the location model in mixed variable discrimination. Statistics and Computing 10(4): 289-297.

Dávideková, M., Michal Greguš, M.L. & Bureš, V. 2019. Yet another classification of ICT in knowledge management initiatives: Synchronicity and interaction perspective. Journal of Engineering and Applied Sciences 14(Special Issue 9): 10549-10554.

El Abbassi, M., Overbeck, J., Braun, O., Calame, M., van der Zant, H.S. & Perrin, M.L. 2021. Benchmark and application of unsupervised classification approaches for univariate data. Communications Physics 4(1): 1-9.

Greenacre, M.J. 2007. Correspondence Analysis in Practice (2nd ed.). Boca Raton: Chapman & Hall.

Greenacre, M.J. & Blasius, J. 2006. Multiple Correspondence Analysis and Related Methods. London: Taylor and Francis Group.

Hamid, H. 2018. New location model based on automatic trimming and smoothing approaches. Journal of Computational and Theoretical Nanoscience 15(2): 493-499.

Hamid, H. 2014. Integrated smoothed location model and data reduction approaches for multi variables classification. PhD Dissertation, Universiti Utara Malaysia, Malaysia (Unpublished).

Hamid, H. 2010. A new approach for classifying large number of mixed variables. International Journal: World Academy of Science, Engineering and Technology 46: 156-161.

Hamid, H., Zainon, F. & Yong, T.P. 2016. Performance analysis: An integration of principal component analysis and linear discriminant analysis for a very large number of measured variables. Research Journal of Applied Sciences 11(11): 1422-1426.

Hamid, H., Ngu, P.A.H. & Alipiah, F.M. 2018. New smoothed location models integrated with PCA and two types of MCA for handling large number of mixed continuous and binary variables. Pertanika Journal of Science & Technology 26(1): 247-260.

Jimoh, R.G., Abisoye, O.A. & Uthman, M.M.B. 2022. Ensemble feed-forward neural network and support vector machine for prediction of multiclass malaria infection. Journal of Information and Communication Technology 21(1): 117-148.

Jolliffe, I.T. 1986. Principal Component Analysis. New York: Springer-Verlag.

Kaiser, H.F. 1961. A note on Guttmann’s lower bound for the number of common factors. British Journal of Mathematical and Statistical Psychology 14: 1-2.

Kemsley, E.K. 1996. Discriminant analysis of high-dimensional data: A comparison of principal component analysis and partial least squares data reduction methods. Chemometrics and Intelligent Systems 33: 47-61.

Krzanowski, W.J. 1995. Selection of variables, and assessment of their performance, in mixed variable discriminant analysis. Computational Statistics and Data Analysis 19(4): 419-431.

Krzanowski, W.J. 1993. The location model for mixtures of categorical and continuous variables. Journal of Classification 10: 25-49.

Krzanowski, W.J. 1983. Stepwise location model choice in mixed-variable discrimination. Applied Statistics 32(3): 260-266.

Krzanowski, W.J. 1980. Mixtures of continuous and categorical variables in discriminant analysis. Biometrics 36: 493-499.

Massey, W.F. 1965. Principal components regression in exploratory statistical research. Journal of American Statistical Association 60: 234-246.

Nenadic, O. & Greenacre, M.J. 2007. Correspondence analysis in R, with two- and three-dimensional graphics: The ca Package. Journal of Statistical Software 20(3): 1-13.

Okwonu, F.Z., Dieng, H., Othman, A.R. & Ooi, S.H. 2012. Classification of aedes adults mosquitoes in two distinct groups based on fisher linear discriminant analysis and FZOARO techniques. Mathematical Theory and Modeling 2(6): 22-30.

Rencher, A.C. 2002. Methods of Multivariate Analysis: Wiley Series in Probability and Statistics. 2nd ed. New York: John Wiley & Sons, Inc.

Vlachonikolis, I.G. & Marriott, F.H.C. 1982. Discrimination with mixed binary and continuous data. Applied Statistics 31(1): 23-31.

^*Corresponding author; email: hashibah@uum.edu.my

content