Sains Malaysiana 50(9)(2021): 2579-2589

http://doi.org/10.17576/jsm-2021-5009-07

 

Enhanced Dimensionality Reduction Methods for Classifying Malaria Vector Dataset using Decision Tree

(Peningkatan Kaedah Pengurangan Kedimensian untuk Mengelaskan Set Data Vektor Malaria menggunakan Pokok Keputusan)

 

MICHEAL OLAOLU AROWOLO*, MARION OLUBUNMI ADEBIYI & AYODELE ARIYO ADEBIYI

 

Department of Computer Science, Landmark University, Omu-Aran, Nigeria

 

Received: 6 October 2020/Accepted: 21 January 2021

 

ABSTRACT

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.

Keywords: Decision tree; independent component analysis; malaria vector; optimized genetic algorithm; principal component analysis

ABSTRAK

Data RNA-Seq digunakan untuk aplikasi biologi dan membuat keputusan untuk pengelasan gen. Banyak kajian kebelakangan ini memfokus untuk mengurangkan dimensi data RNA-Seq. Pendekatan pengurangan dimensi telah diusulkan dalam pengambilan maklumat yang relevan dalam data yang diberikan. Dalam kajian ini, algoritma pengurangan dimensi optimum baharu dicadangkan dengan menggabungkan algoritma genetik yang dioptimumkan dengan Analisis Komponen Utama dan Analisis Komponen Bebas (GA-O-PCA dan GAO-ICA), yang digunakan untuk mengenal pasti ciri subset optimum dan korelasi laten. Pengelas menggunakan Pokok keputusan pada kumpulan data terturun nyamuk anopheles gambiae untuk meningkatkan ketepatan dan kebolehan pengukuran dalam analisis ekspresi gen. Algoritma yang dicadangkan digunakan untuk mengambil ciri yang relevan berdasarkan ruang ciri input dimensi tinggi. Ciri pemeringkatan dan pengalaman sebelumnya digunakan. Prestasi model dinilai dan disahkan menggunakan ketepatan pengelasan untuk membandingkan pendekatan sedia ada dalam kepustakaan. Hasil uji kaji yang dicapai terbukti menjanjikan ciri pemilihan dan pengelasan dalam analisis data ekspresi gen dan menentukan bahawa pendekatan tersebut merupakan pengumpulan yang mampu dilakukan terhadap teknik perlombongan data yang berlaku.

Kata kunci: Algoritma genetik yang dioptimumkan; analisis komponen bebas; analisis komponen utama; Pokok keputusan; vektor malaria

 

REFERENCES

Arowolo, M.O., Adebiyi, M.O., Adebiyi, A.A. & Okesola, J.O. 2020a. PCA Model for RNA-Seq malaria vector data classification using KNN and decision tree algorithm. International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS). pp. 1-8.

Arowolo, M.O., Adebiyi, M.O. & Adebiyi, A.A. 2020b. An efficient PCA ensemble learning approach for prediction of RNA-Seq malaria vector gene expression data classification. International Journal of Engineering Research and Technology 13(1): 163-169.

Arowolo, M.O., Abdulsalam, S.O., Isisaka, R.M. & Gbolagade, K.A. 2017. A hybrid dimensionality reduction model for classification of microarray dataset. International Journal of Information Technology and Computer Science 9(11): 57-63.

Aziz, R., Verma, C.K. & Srivastava, N. 2017. Dimension reduction methods for microarray data: A review. AIMS Bioengineering 4(1): 179-197.

Bajaj, V., Taran, S., Khare, S.K. & Sengur, A. 2020. Feature extraction method for classification of alertness and drowsiness states EEG signals. Applied Acoustics 163: 107224.

Bose, J. 2016. Hybrid GA/KNN/SVM algorithm for classification of data. BioHouse Journal of Computer Science 2(2): 5-11.

Cai, J., Luo, J., Wang, S. & Yang, S. 2018. Feature selection in machine learning: A new perspective. Neurocomputing 300: 70-79.

Chen, C-W., Tsai, Y-H., Chang, F-R. & Lin, W-C. 2020. Ensemble feature selection in medical datasets: Combining filter, wrapper, and embedded feature selection results. Expert Systems, Special Issue on Advances in Visual Analytics and Mining Visual Data 37(5): e12553.

Chiesa, M., Maioli, G., Colombo, G.J. & Piacentini, L. 2020. GARS: Genetic algorithm for the identification of a robust subset of features in high-dimensional datasets. BMC Bioinformatics 21(1): 54.

Chuang, L., Chu, Y., Li, J.C. & Yang, C. 2012. A hybrid BPSO-CGA approach for gene selection and classification of microarray data. Journal of Computational Biology 19: 68-82.

Feng, C., Liu, C., Zhang, H., Guan, R., Li, D., Zhou, F., Liang, Y. & Feng, X. 2020. Dimension reduction and clustering models for single-cell RNA-Seq data: A comparative study. International Journal of Molecular Sciences 21(2181): 1-21.

Feng, C., Lu, S., Zhang, H. & Feng, X. 2018. Dimension reduction and clustering models for Sc-RNA sequencing data. International Journal of Molecular Sciences 21: 1-21.

Hashemi, F.S.G., Ismail, M.R., Yusop, M.R., Hashemi, M.S.G., Shahraki, M.H.N., Rastegari, H., Miah, G. & Aslani, F. 2018. Intelligent mining of large-scale bio-data: Bioinformatics applications. Biotechnology, and Biotechnological Equipment http://dx.doi.org/10.1080/13102818.2017.1364977.

Hira, Z.M. & Gillies, D.F. 2015. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015: Article ID. 198363.

Hodgson, S.H., Muller, J., Lockstone, H.E., Hill, A.V.S., Marsh, K., Draper, S.J. & Knight, J.C. 2019. Use of gene expression studies to investigate the human immunological response to malaria infection. Malaria Journal 18(1): 418.

Hyunh, P-C., Nguyen, V-H. & Do, T.N. 2019. Novel hybrid DCNN-SVM model for classifying RNA-Sequencing gene expression data. Journal of Information and Telecommunication 3(4): 533-547.

Jabeen, A., Ahmad, N. & Raza, K. 2018. Machine learning-based state-of-the-art methods for the classification of RNA-Seq data. In Classification in BioApps. Lecture Notes in Computational Vision and Biomechanics, vol 26, edited by Dey, N., Ashour, A. & Borra, S. New York: Springer, Cham. pp. 133-172.

Jain, D. & Singh, V. 2018. An efficient hybrid feature selection model for dimensionality reduction. International Conference on Computational Intelligence and Data Science, Procedia Computer Science 123: 333-341.

Kong, W., Vanderburg, C.R., Gunshin, H., Rogers, J.T. & Huang, X. 2018. A review of independent component analysis application to microarray gene expression data. Biotechniques 45(5): 501-520.

Lin, Z. & Zhang, G. 2019. Genetic algorithm-based parameter optimization for EO-1 Hyperion remote sensing image classification. European Journal of Remote Sensing 50(1): 124-131.

Liu, Y., Ju, S., Wang, J. & Su, C. 2020. A new feature selection method for text classification based on independent feature space search. Mathematical Problems in Engineering 2020: Article ID. 6076272.

Mafarja, M. & Mirjalili, S. 2018. Whale optimization for wrapper feature selection. Applied Soft Computing 62: 441-453.

Mohan, A., Rao, M.D., Sunderrajan, S. & Pennathur, G. 2014. Automatic classification of protein structures using physicochemical parameters. Interdiscip. Sci.: Comput. Life Sci. 6: 176-186.

Motieghader, H., Najafi, A., Sadeghi, B. & M-Nejad, A. 2017. A Hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata. Informatics in Medicine Unlocked 9: 246-254.

Pashaei, E., Pashaei, E. & Aydin, N. 2019. Gene selection using hybrid binary black hole algorithm and modified binary particle swarm optimization. Genomics 111(4): 669-686.

Pragadeesh, C., Jeyaraj, R., Siranjeevi, K., Abishek, R. & Jeyakumar, G. 2019. Hybrid feature selection using micro genetic algorithm on microarray gene expression data. Journal of Intelligent and Fuzzy Systems 36(3): 2241-2246.

Sahu, B., Dehuri, S. & Jagadev, A. 2018. A study on relevance of feature selection methods in microarray data. The Open Bioinformatics Journal 11: 117-139.

Shen, L., Jiang, H., He, M. & Liu, G. 2017. Collaborative representation-based classification of microarray gene expression data. PLoS ONE 12(12): e0189533.

Shukla, A.K., Singh, P. & Vardhan, M. 2019. A new hybrid wrapper TLBO and SA with SVM approach for gene expression data. Information Sciences 503: 238-254.

Sun, L., Kong, X., Xu, J., Xue, Z., Zhai, R. & Zhang, S. 2019. A hybrid gene selection method based on Refief-F and Ant colony optimization algorithm for tumor classification. Nature Research Academics 9: 8978.

Susmi, S.J., Nehimiah, H.K. & Kannan, A. 2018. Hybrid dimensionality reduction techniques with genetic algorithm and neural network for classifying leukemia gene expression data. Indian Journal of Science and Technology 9(1): 1-8.

Tadist, K., Najah, S., Nikolov, N.S., Mrabti, F. & Zahi, A. 2019. Feature selection methods and genomic big data: A systematic review. Journal of Big Data 6: 79.

Uma, S.M. & Kirubakaran, E. 2016. A hybrid heuristic dimensionality reduction technique for microarray gene expression data classification: A blending of GA, PSO, and ACO. International Journal of Data Mining, Modelling and Management 8(2): 160-179.

Wang, J., Du, P., Niu, T. & Yang, W. 2017. A novel hybrid system based on a new proposed algorithm-multi-objective whale optimization algorithm for wind speed forecasting. Applied Energy 208: 344-360.

Wang, L., Wang, Y. & Chang, Q. 2017. Feature selection methods for big data bioinformatics: A Survey from the search perspective. Methods 111: 21-31.

Wenric, S. & Shemirani, R. 2018. Using supervised learning methods for gene selection in RNA-Seq case-control studies. Frontiers in Genetics 9: 297.

Zhao, S., Fung-Leung, W-P., Bottner, A., Ngo, K. & Liu, X. 2014. Comparison of RNA-Seq and microarray in transcriptome profiling of activated t-cells. PLoS ONE 9(1): e78644.

 

*Corresponding author; email: arowolo.olaolu@lmu.edu.ng

 

 

                 

previous