SAINS MALAYSIANA

Sains Malaysiana 50(3)(2021): 753-768

http://doi.org/10.17576/jsm-2021-5003-17

Predicting 30-Day Mortality after an Acute Coronary Syndrome (ACS) using Machine Learning Methods for Feature Selection, Classification and Visualisation

(Meramalkan Kematian 30 Hari selepas Sindrom Koronari Akut (ACS) menggunakan Kaedah Pembelajaran Mesin untuk Pemilihan Ciri, Pengelasan dan Pemvisualan)

NANYONGA AZIIDA¹, SORAYYA MALEK¹*, FIRDAUS AZIZ¹, KHAIRUL SHAFIQ IBRAHIM² & SAZZLI KASIM²

¹Bioinformatics Division, Institute of Biological Sciences, University of Malaya, 50603 Kuala Lumpur, Federal Territory, Malaysia

²Department of Cardiology, Faculty of Medicine, Universiti Teknologi MARA (UiTM), Sungai Buloh Campus, Jalan Hospital, 47000 Sungai Buloh, Selangor Darul Ehsan, Malaysia

Diserahkan: 23 Disember 2019/Diterima: 26 Ogos 2020

ABSTRACT

Hybrid combinations of feature selection, classification and visualisation using machine learning (ML) methods have the potential for enhanced understanding and 30-day mortality prediction of patients with cardiovascular disease using population-specific data. Identifying a feature selection method with a classifier algorithm that produces high performance in mortality studies is essential and has not been reported before. Feature selection methods such as Boruta, Random Forest (RF), Elastic Net (EN), Recursive Feature Elimination (RFE), learning vector quantization (LVQ), Genetic Algorithm (GA), Cluster Dendrogram (CD), Support Vector Machine (SVM) and Logistic Regression (LR) were combined with RF, SVM, LR, and EN classifiers for 30-day mortality prediction. ML models were constructed using 302 patients and 54 input variables from the Malaysian National Cardiovascular Disease Database. Validation of the best ML model was performed against Thrombolysis in Myocardial Infarction (TIMI) using an additional dataset of 102 patients. The Self-Organising Feature Map (SOM) was used to visualise mortality-related factors post-ACS. The performance of ML models using the area under the curve (AUC) ranged from 0.48 to 0.80. The best-performing model (AUC = 0.80) was a hybrid combination of the RF variable importance method, the sequential backward selection and the RF classifier using five predictors (age, triglyceride, creatinine, troponin, and total cholesterol). Comparison with TIMI using an additional dataset resulted in the best ML model outperforming the TIMI score (AUC = 0.75 vs. AUC = 0.60). The findings of this study will provide a basis for developing an online ML-based population-specific risk scoring calculator.

Keywords: Acute coronary syndrome; feature selection; hybrid model; machine learning; self-organising maps

ABSTRAK

Gabungan hibrid pemilihan ciri, pengelasan dan pemvisualan menggunakan kaedah pembelajaran mesin (ML) mempunyai potensi untuk pemahaman yang lebih baik untuk ramalan kematian pesakit bagi tempoh 30 hari dengan penyakit kardiovaskular menggunakan data penduduk yang khusus. Mengenal pasti ciri-ciri kaedah pemilihan dengan algoritma pengelas yang menghasilkan prestasi tinggi dalam kajian kematian adalah penting dan tidak pernah dilaporkan sebelum ini. Ciri-ciri kaedah pemilihan seperti ‘Boruta’, ‘Random Forest’ (RF), ‘Elastic Net’ (EN), ‘Recursive Feature Elimination’ (RFE), ‘Learning Vector Quantization’ (LVQ), ‘Genetic Algorithm’ (GA), ‘Cluster Dendrogram’ (CD), ‘Support Vector Machine’ (SVM) dan ‘Logistic Regression’ (LR) telah digabungkan dengan algoritma bagi pengelasan RF, SVM, LR dan EN bagi ramalan kematian bagi tempoh 30 hari. Model ML telah dibina menggunakan 302 pesakit dan 54 pemboleh ubah input dari Pangkalan Data Penyakit Kardiovaskular Kebangsaan Malaysia. Pengesahan terbaik model ML telah dijalankan dengan Trombolisis dalam Infarksi Miokardium (TIMI) menggunakan set data tambahan daripada 102 pesakit. Peta swaurus (SOM) telah digunakan untuk menggambarkan faktor yang berkaitan dengan kematian selepas ACS. Prestasi model diukur menggunakan kawasan di bawah lengkung (AUC) antara 0.48-0.80. Model terbaik mencatatkan (AUC = 0.80) adalah gabungan hibrid RF cara kepentingan berubah-ubah, pemilihan ke belakang berurutan dan pengelas RF menggunakan lima peramal (umur, trigliserida, kreatinin, troponin dan jumlah kolesterol). Model terbaik telah dibandingkan dengan TIMI menggunakan set data tambahan yang menyebabkan model ML mengatasi TIMI (AUC = 0.75 vs AUC = 0.60). Penemuan daripada kajian ini akan digunakan sebagai asas untuk membangunkan talian ML berdasarkan pengiraan pemarkahan risiko yang penduduk tertentu.

Kata kunci: Model hibrid; pembelajaran mesin; pemilihan ciri; peta swaurus sindrom koronari akut

RUJUKAN

Alalyan, F., Zamzami, N. & Bouguila, N. 2019. Model-based hierarchical clustering for categorical data. In IEEE 28th International Symposium on Industrial Electronics (ISIE). Vancouver, Canada: IEEE. pp. 1424-1429. doi: 10.1109/ISIE.2019.8781307.

Breiman, L. 2001. Using iterated bagging to debias regressions. Machine Learning 45(3): 261-277. https://doi.org/10.1023/a:1017934522171.

Castro-Dominguez, Y., Dharmarajan, K. & Mcnamara, R.L. 2018. Predicting death after acute myocardial infarction. Trends in Cardiovascular Medicine 28(2): 102-109. https://doi.org/10.1016/j.tcm.2017.07.011.

Chandrashekar, G. & Sahin, F. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40(1): 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024.

Chen, X. & Ishwaran, H. 2012. Random forests for genomic data analysis. Genomics 99(6): 323-329. https://doi.org/10.1016/j.ygeno.2012.04.003.

Cheng, J.M., Helming, A.M., Vark, L.C.V., Corstiaan, I.K., Uil, A.D., Jewbali, L.S., van Geuns, R., Zijlstra, F., van Domburg, R.T., Boersma, E. & Akkherhuis, K.M. 2015. A simple risk chart for initial risk assessment of 30-day mortality in patients with cardiogenic shock from ST-elevation myocardial infarction. European Heart Journal: Acute Cardiovascular Care 5(2): 101-107. https://doi.org/10.1177/2048872615568966.

Chopra, A., Dimri, A. & Pradhan, T. 2017. Prediction of factors affecting amlodipine induced pedal edema and its classification. In International Conference on Advances in Computing, Communications and Informatics (ICACCI). Udupi, India: DBLP. pp. 1684-1689. https://doi.org/10.1109/icacci.2017.8126085.

Collazo, R.A., Pessôa, L.A.M., Bahiense, L., Pereira, B.D.B., Reis, A.F.D. & Silva, N.S.E. 2016. A comparative study between artificial neural network and support vector machine for acute coronary syndrome prognosis. Pesquisa Operacional 36(2): 321-343. https://doi.org/10.1590/0101-7438.2016.036.02.0321.

Couronné, R., Probst, P. & Boulesteix, A. 2018. Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinformatics 19(1): 270. https://doi.org/10.1186/s12859-018-2264-5.

Cox, D.R. 1958. Two further applications of a model for binary regression. Biometrika 45(3-4): 562-565. https://doi.org/10.1093/biomet/45.3-4.562.

Dunkler, D., Plischke, M., Leffondré, K. & Heinze, G. 2014. Augmented backward elimination: A pragmatic and purposeful way to develop statistical models. PLoS ONE 9(11): e113677. https://doi.org/10.1371/journal.pone.0113677.

Engberding, N. & Wenger, N.K. 2017. Acute coronary syndromes in the elderly. F1000Research 6: 1791. https://doi.org/10.12688/f1000research.11064.1.

Fawcett, T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8): 861-874. https://doi.org/10.1016/j.patrec.2005.10.010.

Fernández-Delgado, M., Eva, C., Senén, B. & Dinani, A. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15: 3133-3181.

Galili, Tal. 2015. Dendextend: An R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22): 3718-3720. https://doi.org/10.1093/bioinformatics/btv428.

Geisser, S. 1993. Predictive Inference: An Introduction. London: Chapman and Hall. http://dx.doi.org/10.1007/978-1-4899-4467-2.

Genuer, R., Poggi, J. & Tuleau-Malot, C. 2010. Variable selection using random forests. Pattern Recognition Letters 31(14): 2225-2236. https://doi.org/10.1016/j.patrec.2010.03.014.

Hammer, B. & Villmann, T. 2002. Generalized relevance learning vector quantization. Neural Networks 15(8-9): 1059-1068. https://doi.org/10.1016/s0893-6080(02)00079-5.

Hinde, C.J. 2003. Extracting causal nets from databases. In Developments in Applied Artificial Intelligence Lecture Notes in Computer Science, IEA/AIE 2003, Lecture Notes in Computer Science. pp. 166-175. https://doi.org/10.1007/3-540-45034-3_17.

Holland, J.H. 1992. Genetic algorithms. Scientific American 267(1): 66-72. https://doi.org/10.1038/scientificamerican0792-66.

Hoo, F.K., Boo, Y.L., Foo, Y.L., Mohd, S., Lim, S. & Ching, S.M. 1969. Acute coronary syndrome in young adults from a Malaysian tertiary care centre. Pakistan Journal of Medical Sciences 32(4): 841-845. https://doi.org/10.12669/pjms.324.9689.

Huang, B.F.F. & Boutros, P.C. 2016. The parameter sensitivity of random forests. BMC Bioinformatics 17: 331. https://doi.org/10.1186/s12859-016-1228-x.

Jafarian, A., Ngom, A. & Rueda, L. 2011. A novel recursive feature subset selection algorithm. In IEEE 11^th International Conference on Bioinformatics and Bioengineering. Taichung, Taiwan: IEEE. pp. 78-83. https://doi.org/10.1109/bibe.2011.19.

Johansson, S., Rosengren, A., Young, K. & Jennings, E. 2017. Mortality and morbidity trends after the first year in survivors of acute myocardial infarction: A systematic review. BMC Cardiovascular Disorders 17(1): 53. https://doi.org/10.1186/s12872-017-0482-9.

Kesavaraj, G. & Sukumaran, S. 2013. A study on classification techniques in data mining. In Fourth International Conference on Computing, Communications and Networking Technologies. Tiruchengode, India: IEEE. pp. 1-7. https://doi.org/10.1109/icccnt.2013.6726842.

Kohonen, T. 2001. Self-organizing maps. In Springer Series in Information Sciences. Berlin, Germany: Springer. https://doi.org/10.1007/978-3-642-56927-2.

Kohonen, T. 2001. Learning vector quantization. In Self-Organizing Maps Springer Series in Information Sciences. Berlin, Germany: Springer. pp. 245-261. https://doi.org/10.1007/978-3-642-56927-2_6.

Kuhn, M. 2008. Building predictive models in R using the caret package. Journal of Statistical Software 28(5): 1-26. https://doi.org/10.18637/jss.v028.i05.

Kursa, M.B. & Rudnicki, W.R. 2010. Feature selection with the boruta package. Journal of Statistical Software 36(11): 1-13. https://doi.org/10.18637/jss.v036.i11.

Liang, H., Guo, Y.C., Chen, L.M., Li, M., Han, W.Z., Zhang, X. & Jiang, S.L. 2016. Relationship between fasting glucose levels and in-hospital mortality in Chinese patients with acute myocardial infarction and diabetes mellitus: A retrospective cohort study. BMC Cardiovascular Disorders 16: 156. https://doi.org/10.1186/s12872-016-0331-2.

Lin, X., Li, C., Zhang, Y., Su, B., Fan, M. & Wei, H. 2017. Selecting feature subsets based on svm-rfe and the overlapping ratio with applications in bioinformatics. Molecules 23(1): 52. https://doi.org/10.3390/molecules23010052.

Liu, C.H., Bryan, B.P.C., Little, D.A. & Cardoso, A. 2017. Generalising random forest parameter optimisation to include stability and cost. Machine Learning and Knowledge Discovery in Databases Lecture Notes in Computer Science 10536: 102-113. https://doi.org/10.1007/978-3-319-71273-4_9.

Malek, S., Gunalan, R., Kedija, S.Y., Lau, C.F., Mogeeb, A.A., Milow, M.P., Lee, S.A. & Saw, A. 2018. Random forest and self-organizing maps application for analysis of pediatric fracture healing time of the lower limb. Neurocomputing 272: 55-62. https://doi.org/10.1016/j.neucom.2017.05.094.

Mandrekar, J.N. 2010. Receiver operating characteristic curve in diagnostic test assessment. Journal of Thoracic Oncology 5(9): 1315-1316. https://doi.org/10.1097/jto.0b013e3181ec173d.

Marenzi, G., Cabiati, A., Cosentino, N., Assanelli, E., Milazzo, V., Rubino, M., Lauri, G., Morpurgo, M., Moltrasio, M., Marana, I., Metrio, M.D., Bonomi, A., Veglia, F. & Bartorelli, A. 2015. Prognostic significance of serum creatinine and its change patterns in patients with acute coronary syndromes. American Heart Journal 169(3): 363-370. https://doi.org/10.1016/j.ahj.2014.11.019.

Menard, S. 2002. Applied Logistic Regression Analysis. 2nd ed. USA: SAGE Publishing. https://doi.org/10.4135/9781412983433.

Mokeddem, S., Atmani, B. & Mokaddem, M. 2013. Supervised feature selection for diagnosis of coronary artery disease based on genetic algorithm. In Computer Science & Information Technology (CS & IT). Dubai, UAE: DDBM. pp. 41-52. https://doi.org/10.5121/csit.2013.3305.

Motwani, M., Dey, D., Berman, D.S., Germano, G., Achenbach, S., Al-Mallah, M.H., Andreini, D., Budoff, M.J., Cademartini, F., Callister, T.Q., Chang, H.J., Chinnaiyan, K., Chow, B.J.W., Cury, B.C., Delago, A., Gomez, M., Gransar, H., Hadamitzky, M., Hausleiter, J., Hindoyan, N., Feuchtner, G., Kaufmann, P.A., Kim, Y.J., Leipsic, J., Lin, F.Y., Maffei, E., Marques, H., Pantone, G., Raff, G., Rubinshtein, R., Shaw, L.J., Stehli, J., Villines, T.C., Duniing, A., Min, J.K. & Slomka, P.J. 2016. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: A 5-year multicentre prospective registry analysis. European Heart Journal 38(7): 500-507. https://doi.org/10.1093/eurheartj/ehw188.

Perez-Riverol, Y., Kuhn, M., Vizcaíno, J.A., Hitz, M. & Audain, E. 2017. Accurate and fast feature selection workflow for high-dimensional omics data. PLoS ONE 12(12): e0189875. https://doi.org/10.1371/journal.pone.0189875.

Prokashgoswami, J. & Mahanta, A.J. 2013. Categorical data clustering based on an alternative data representation technique. International Journal of Computer Applications 72(5): 7-12. https://doi.org/10.5120/12488-8301.

Saeys, Y., Inza, I. & Larranaga, P. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23(19): 2507-2517. https://doi.org/10.1093/bioinformatics/btm344.

Shaikhina, T., Lowe, D., Daga, S., Briggs, D., Higgins, R. & Khovanova, N. 2019. Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation. Biomedical Signal Processing and Control 52: 456-462. https://doi.org/10.1016/j.bspc.2017.01.012.

Shouval, R., Hadanny, A., Shlomo, N., Iakobishvili, Z., Unger, R., Zahger, D., Alcalai, R., Atar, S., Gottlieb, S., Matetzky, S., Goldenberg, I. & Beigel, R. 2017. Machine learning for prediction of 30-day mortality after ST elevation myocardial infraction: An acute coronary syndrome Israeli survey data mining study. International Journal of Cardiology 246: 7-13. https://doi.org/10.1016/j.ijcard.2017.05.067.

Sonawane, J.S. & Patil, D.R. 2014. Prediction of heart disease using learning vector quantization algorithm. In Conference on IT in Business, Industry and Government (CSIBIG). Indore, India: IEEE Xplore. https://doi.org/10.1109/csibig.2014.7056973.

Steele, A.J., Denaxas, S.C., Shah, A.D., Hemingway, H. & Luscombe, N.M. 2018. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoS ONE 13(8): e0202344. https://doi.org/10.1371/journal.pone.0202344.

Torres, M. & Moayedi, S. 2007. Evaluation of the acutely dyspneic elderly patient. Clinics in Geriatric Medicine 23(2): 307-325. https://doi.org/10.1016/j.cger.2007.01.007.

Tuckova, J. 2013. The possibility of kohonen self-organizing map applications in medicine. In IEEE 11^th International Workshop of Electronics, Control, Measurement, Signals and Their Application to Mechatronics. France: IEEE. pp. 1-6. https://doi.org/10.1109/ecmsm.2013.6648946.

Vapnik, V. 1998. The support vector method of function estimation. In Nonlinear Modeling. Boston, MA: Springer. pp. 55-85. https://doi.org/10.1007/978-1-4615-5703-6_3.

Wallert, J., Tomasoni, M., Madison, G. & Held, C. 2017. Predicting two-year survival versus non-survival after first myocardial infarction using machine learning and Swedish national register data. BMC Medical Informatics and Decision Making 17(1): 99. https://doi.org/10.1186/s12911-017-0500-y.

Wu, C., Singh, A., Collins, B., Fatima, A., Qamar, A., Gupta, A., Hainer, J., Klein, J., Jarolim, P., Carli, M.D., Nasir, K., Bhatt, D.L. & Blankstein, R. 2018. Causes of troponin elevation and associated mortality in young patients. The American Journal of Medicine 131(3): 284-292. https://doi.org/10.1016/j.amjmed.2017.10.026.

Yang, J., Li, X., Chen, T., Li, Y., Xie, G. & Yang, Y. 2018. Machine learning models to predict in-hospital mortality for ST-elevation myocardial infraction: From China acute myocardial infarction (cami) registry. Journal of the American College of Cardiology 71(11): A236. https://doi.org/10.1016/s0735-1097(18)30777-0.

Yang, X. 2017. Identification of risk genes associated with myocardial infarction based on the recursive feature elimination algorithm and support vector machine classifier. Molecular Medicine Reports 17(1): 1555-1560. https://doi.org/10.3892/mmr.2017.8044.

Zhang, L. & Lin, X. 2011. Some considerations of classification for high dimension low-sample size data. Statistical Methods in Medical Research 22(5): 537-550. https://doi.org/10.1177/0962280211428387.

Zhang, Z., Murtagh, F., Poucke, S.V., Lin, S. & Lan, P. 2017. Hierarchical cluster analysis in clinical research with heterogeneous study population: Highlighting its visualization with R. Annals of Translational Medicine 5(4): 75. https://doi.org/10.21037/atm.2017.02.05.

Zhou, X. 2010. Enhancement of topology preservation of self-organizing map. Journal of Computer Applications 29(12): 3256-3258. https://doi.org/10.3724/sp.j.1087.2009.03256.

Zou, H. & Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2): 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x.

*Pengarang untuk surat-menyurat; email: sorayya@um.edu.my

sebelumnya

kandungan

seterusnya