Application of Bioinformatics in Cancer Genomic Research

Written by : Nur Alyaa Afifah Md Shahri, Muhammad Redha Bin Abdullah Zawawi and Siti Aishah Sulaiman

Date Publish :  02 Disember 2022

INSTITUT BIOLOGI MOLEKUL PERUBATAN UKM

Application of Bioinformatics in Cancer Genomic Research

 

Bioinformatics is a subdiscipline of biology and computing that uses computational tools to collect, classify, store, analyse and visualise all biological data, mainly the DNA and amino acid sequences [1]. The term “bioinformatics” is initially used by two Dutch biologists, Paulien Hogeweg and Ben Hesper, in the early 1970s. They used this term to study the information processing that could theoretically help understand living organisms [2].

However, the advancements in computing and data processing, as well as genome sequencing technologies, make it possible to sequence more DNA and protein, leading to the human genome sequencing by the Human Genome Project. The first almost complete version (92%) was published in 2003 [3], and the gapless version (100%) in 2022 [4]. Therefore, nowadays, the most critical parts of bioinformatics are to analyse and interpret these large datasets in order to understand human health and biology, particularly the mechanism of disease development [5].

Cancer has been a constant battle worldwide and a significant national health burden. According to the National Strategic Plan for Cancer Control Programme 2021-2025 report, cancer is one of Malaysia’s top five causes of death, contributing to 12.18% mortality in 2019 [6]. The Malaysia National Cancer Registry (MNCR) 2012-2016 reported that the majority of breast, colorectal, and cervical cancer patients were diagnosed at a late stage, and the relative survival rate was reported to be highest at stage I (Table 1) [6], indicating the importance of early cancer detection.

 

Table 1 Relative survival by stage at diagnosis and cancer types from 2007-2011 and followed up to 2016 [6].

Cancer Type 1-year relative survival rate (%) 5-year relative survival rate (%)
Stage I Stage IV Stage I Stage IV
Breast 97.8 66.8 87.5 23.3
Colorectal 87.8 55.1 75.8 17.3
Cervix Uteri 94.3 53.0 75.3 23.0
Lung 63.3 29.6 37.1 6.30
Nasopharynx 94.0 66.2 63.7 26.9

 

Currently, researchers use various bioinformatics tools in cancer research and early detection. Fundamentally, most researchers follow the basic protocol of sequencing the DNA or genome, analysing the data between healthy individuals and cancer patients, and then interpreting the findings for cancer development and progression [5]. This area of research is also known as cancer genomics research.

The most popular application of bioinformatics in cancer genomic research is genomic analysis, which detects specific DNA alterations (genetic variants) that could influence disease development or phenotypes [7]. This process typically uses the high-throughput sequencing technologies such as Next-Generation Sequencing (NGS). There are three standard techniques of NGS which are: 1) whole genome sequencing (WGS), which sequences the whole genome, including the exons (coding) and introns (non-coding), 2) whole exome sequencing (WES) that covers the regions of all exons (coding), and 3) the targeted sequencing that focuses on the selected genes or regions of the genome [7].

The bioinformatics analysis of these genome data often involves the steps of data pre-processing, variant calling, downstream biological analysis, and clinical impacts (Figure 1). Following the DNA extraction and library preparation, the sequencing process generates raw data (FASTQ files) containing DNA sequences. The analysis begins with a quality check to exclude low base quality, adapter contamination, and base composition biases [8]. The sequences are then mapped to the reference genome to determine their similarity, followed by annotation and variant identification [9]. Identifying the genetic variants responsible for cancer development allows for the potential usage of genetic variants as an early screening tool. Moreover, many commercial kits have been available for family and individual screening for cancer nowadays [10]. Significantly, computational tools and software advancement also enhances the identification of structural DNA alterations (deletion, duplication, and translocation) [5], further improving the understanding of cancer development.

Figure 1 The general bioinformatics workflow of whole genome sequencing (WGS) analysis.

 

In UMBI, one of the WGS projects is the ‘Cancer Genome: Mapping Molecular Networks for Biomarkers and Anti-Cancer Drug Discovery.’ This study extracted the DNA of tumour tissues and matched control of colorectal cancer (CRC) patients for the WGS technique. Then, the genetic variants were annotated to interpret their potential pathogenicity [11]. The findings are that TP53, APC, and KRAS genes are the most commonly mutated in Malaysian CRC patients. The Wnt signalling pathway was then identified as the primary pathway to be affected. In the same study, the researchers also identified structural variants (SVs) among Malaysian CRC patients [12]. Fifty-six pathogenic somatic structural variants (41 large deletions and 15 duplications) were identified (Figure 2). The deletions affecting the RBFOX1 gene in two patients were concluded to be associated with poor prognosis. This study was the first to characterise a comprehensive view of genomic rearrangements with multiple classes of SVs in Malaysian CRC patients.

Figure 2 Structural Variants Identified in One of the Colorectal Cancer Patients [13].

 

Besides genomic analysis, other bioinformatics tools are also available for cancer research. One such is artificial intelligence (AI), a field of research in which computers mimic human intelligence to solve problems [14]. There are two subfields of AI: machine learning and deep learning. Machine learning uses mathematical and statistical approaches in the algorithms, which means that the machine analyzes the data, learns from that data, and applies the information to make decisions [14]. There are well-known applications that use machine-learning algorithms, such as Spotify, Netflix, and Amazon, that collect users’ preferences and offer personalised recommendations [15]. On the other hand, deep learning is an evolution of machine learning by using layers of algorithms and computing units (artificial neural networks) to improve outcomes [14].

The application of AI also improves cancer genomic research. First, the application of AI allows for multitasking learning from the output data and other related data from various online databases and publications and combines these different types of data [14]. Since cancer is a complex disease, integrating multi-layered data is preferable for a big-picture understanding. A study of 323 cancer patients analysed with cognitive computing (Watson for Genomics, WfG) identified several new genomic changes with potential clinical impacts [16]. Importantly, these genomic alterations were not detected by the conventional molecular tumors panel, and each case analysis only took less than 3 minutes [16], suggesting the improvement of the analysis via AI application. Secondly, improvement also occurs in the variant calling process. For example, Google’s DeepVariant uses the Inception architecture that converts the aligned sequencing data (bam file) into an image similar to a genome browser and then determines the variants based on the likelihood [17]. Notably, the performance of DeepVariant outshines the conventional GATK in the precision of calling true variants [18, 19].

Besides cancer genomic research, bioinformatics is also applicable to other fields. Identifying additional genetic markers by combining various data types allows the design of newer drugs that could be specific for individuals (personalised medicine) [5]. Moreover, AI applications will further enhance the prediction analysis to prevent disease or identify susceptible individuals. In conclusion, the application of bioinformatics is an innovative approach that could provide critical insights into cancer research.

 

References:

 

  1. Adams, D. and National Human Genome Research Institute (NHGRI). Bioinformatics. Genetics Glossary 2022 6 September 2022 [cited 2022 November 25]; Available from: https://www.genome.gov/genetics-glossary/Bioinformatics.
  2. Hogeweg, P., The roots of bioinformatics in theoretical biology. PLoS computational biology 2011, 7(3), e1002021.
  3. International Human Genome Sequencing Consortium, Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011), 931-45.
  4. Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Gershman, A., et al., The complete sequence of a human genome. 2022, 376(6588), 44-53.
  5. Canzoneri, R., Lacunza, E., and Abba, M.C., Genomics and bioinformatics as pillars of precision medicine in oncology. Medicina 2019, 79(Spec 6/1), 587-592.
  6. Ministry of Health Malaysia, National strategic plan for cancer control programme 2021-2025. 2021, Ministry of Health Malaysia.
  7. Manzoni, C., Kia, D.A., Vandrovcova, J., Hardy, J., Wood, N.W., Lewis, P.A., and Ferrari, R., Genome, transcriptome and proteome: The rise of omics data and their integration in biomedical sciences. Briefings in bioinformatics 2018, 19(2), 286-302.
  8. Trivedi, U.H., Cézard, T., Bridgett, S., Montazam, A., Nichols, J., Blaxter, M., and Gharbi, K., Quality control of next-generation sequencing data without a reference. 2014, 5.
  9. Berger, M.F. and Mardis, E.R., The emerging clinical relevance of genomics in cancer medicine. Nature reviews. Clinical oncology 2018, 15(6), 353-365.
  10. Schienda, J. and Stopfer, J., Cancer genetic counseling-current practice and future challenges. Cold Spring Harbor perspectives in medicine 2020, 10(6).
  11. Mohd Yunos, R., Ab Mutalib, N., Khoo, J., Saidin, S., Ishak, M., Abu, N., Mohd Yusof, N., Mahamad Nazir, N., Rose, I., Sagap, I., et al., Uncovering the landscape of somatic mutations in malaysian colorectal cancer patients via whole genome sequencing. 2019. 10.3389/conf.fphar.2018.63.00133.
  12. Mohd Yunos, R., Ab Mutalib, N., Jia-Shiun, K., Saidin, S., Ishak, M., Md Yusof, N., Mahamad Nadzir, N., Md Rose, I., Sagap, I., Mazlan, L., et al., Identification of structural variants in malaysian colorectal cancer. 2019.
  13. Jamal, R., Ab Mutalib, N., and Mohd Yunos, R. Uncovering the molecular landscape of 50 malaysian colorectal cancer genomes. UMBI News 2019 30 December 2019 [cited 2022 November 25]; Available from: https://www.ukm.my/umbi/news/uncovering-the-molecular-landscape-of-50-malaysian-colorectal-cancer-genomes/
  14. Shimizu, H. and Nakayama, K.I., Artificial intelligence in oncology. Cancer science 2020, 111(5), 1452-1460.
  15. Coursera. Deep learning vs. Machine learning: Beginner’s guide. 2022 [cited 2022 November 25]; Available from: https://www.coursera.org/articles/ai-vs-deep-learning-vs-machine-learning-beginners-guide.
  16. Patel, N.M., Michelini, V.V., Snell, J.M., Balu, S., Hoyle, A.P., Parker, J.S., Hayward, M.C., Eberhard, D.A., Salazar, A.H., McNeillie, P., et al., Enhancing next-generation sequencing-guided cancer care through cognitive computing. The oncologist 2018, 23(2), 179-185.
  17. Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P.T., et al., A universal snp and small-indel variant caller using deep neural networks. 2018, 36(10), 983-987.
  18. Supernat, A., Vidarsson, O.V., Steen, V.M., and Stokowy, T., Comparison of three variant callers for human whole genome sequencing. Scientific reports 2018, 8(1), 17851.
  19. Lin, Y.L., Chang, P.C., Hsu, C., Hung, M.Z., Chien, Y.H., Hwu, W.L., Lai, F., and Lee, N.C., Comparison of gatk and deepvariant by trio sequencing. Scientific reports 2022, 12(1), 1809.