The Importance of Bioinformatics in Molecular Biology Research (Part I)

Part I: Introduction to Bioinformatics & RNA-Seq

Siti Nur Hasanah Mohd Yusuf, Mira Farzana Mohamad Mokhtar & Saiful Effendi Syafruddin

UKM Medical Molecular Biology Institute, Universiti Kebangsaan Malaysia

Date: 5 April 2021

The field of bioinformatics has rapidly evolved and is indispensable in this current era of big data and fourth industrial revolution (IR 4.0). It is an interdisciplinary field that formed from the applications of computer science, biology and mathematics. Due to the escalating number of large-scale genomics, transcriptomics and proteomics studies, the bioinformatics and related computational biology tools are essential to manage, analyze and interpret the gargantuan data generated by these studies(1). In addition, bioinformatics also involve the development of algorithms and software to facilitate and enhance robustness of the data analysis. In regards to the algorithms and software development, it is vital to utilize the suitable operating system (OS) as well as possessing good understanding and literacy in the programming language.

OS is a package of programs that makes the computer work by managing the hardware and software resources as well as providing common services for computer programs (2). Modern OS such as Microsoft Windows, Linux, macOS, iOS, Berkeley Software Distribution (BSD), Android, Blackberry OS and Chrome OS are among the available OS today. In bioinformatics, Linux is the most commonly used OS due to its free open-source and its resources and programs are directly accessible to the public. Therefore, most of the available bioinformatics programs and packages are developed on this OS. Moreover, these tools are more suitable to be installed and run on this system compared to others. Other than that, Linux is suitable for data analysis due to its multi-user and multi-tasking system that allows users to access and perform programming simultaneously on a single platform. 

(Logo of Python & Linux. source: google images)

Biological data analysis requires the execution of a developed bioinformatics pipeline that uses various programming languages. For instance, one of the most popular programming languages for bioinformatics today is Python. This is due to its diverse applications in biosciences including sequence-based bioinformatics molecular evolution, phylogenomics, system biology and structural biology (3). It has been used a lot in scripting either to execute software functions in data analysis or software development by implementing sets of algorithms to it. Due to this, to facilitate scientists in analyzing biological data, the open-source package named BioPython has been developed consisting of more base for scientific libraries and tool kits to allow more advanced programming projects.

RNA-Sequencing (RNA-Seq)

RNA sequencing (RNA-Seq) is one of the techniques that utilize the high-throughput sequencing (HTS) to  profile and characterize the transcriptome of the biological samples. The RNA-Seq was developed about 10 years ago after the emergence of Next Generation Sequencing technologies. Compared to the conventional Sanger sequencing method, HTS enables hundreds to millions DNA/RNA molecules from multiple samples to be sequenced simultaneously. This technology has significantly reduced the amount of times and resources to perform large scale sequencing projects. To put this into perspective, the sequencing of the first human genome took about 13 years to complete, which involved large-scale scientific international collaborations and cost about USD 3 billion. Nowadays, with this HTS technology, the whole genome of multiple samples can be sequenced in 1 day with a cost less than USD 1000. 

The central dogma of biology has now expanded beyond “DNA-RNA-Protein” traditional notion due to the discovery of a number of non-coding RNAs species that include transfer RNAs, ribosomal RNAs and small RNAs such as miRNA and lncRNA. RNA-Seq have a lot more to offer particularly in discovery of novel transcripts and detection of allele-specific expression (4). Nowadays, RNA- Seq is widely used especially in molecular biology research and disease studies such as cancer research and therapy, including biomarker discovery, drug resistance and immunotherapy (5). This technique requires the expertise of bioinformatics to analyse the sequencing data using linux and programming languages such as python, perl and java. The common technique of RNA-Seq involves transcriptome profiling with sequencing reads, align reads to genome and assemble transcripts and continue with differential expression gene analysis (Figure 1) by using the specific command line to run the analysis. The expression level of the mapped gene is estimated by counting the number of reads that are aligned to each transcript. 

Figure 1: RNA Seq analysis workflow (6)

The number of counts mapped to the transcript were then analysed by using tools like edgeR, DESeq2 and limma (6) to identify the differential gene expression between two conditions (for example, normal and cancer cells). Differential expression analysis assists in determining the target genes throughout the variability between different samples by performing statistical data analysis and analysing the changes in expression levels.

In conclusion, the application of various bioinformatics tools in analysing biological data will provide us with useful information that aid in solving the questions in molecular research. However, it requires a computationally fluent scientist to keep up with the advancement of computational biology methods by knowing some programming languages such as Linux and Python and understanding of core computer science principles.

References:

  1. Zhang S-Y, Liu S-L. Bioinformatics. In: Maloy S, Hughes KBT-BE of G (Second E, editors. Brenner’s Encyclopedia of Genetics (Second Edition). San Diego: Academic Press; 2013. p. 338–40. 
  2. Liu S, Liu Z. Introduction to Linux and Command Line Tools for Bioinformatics. Bioinformatics in Aquaculture. 2017. p. 1–29. (Wiley Online Books). 
  3. Ekmekci B, McAnany CE, Mura C. An Introduction to Programming for Bioscientists: A Python-Based Primer. PLoS Comput Biol. 2016 Jun;12(6):e1004867. 
  4. Kukurba KR, Montgomery SB. RNA Sequencing and Analysis. Cold Spring Harb Protoc. 2015;2015(11):951-969. doi:10.1101/pdb.top084970
  5. Hong M, Tao S, Zhang L, et al. RNA sequencing: new technologies and applications in cancer research. J Hematol Oncol. 2020;13(1):166. doi:10.1186/s13045-020-01005-x
  6. Ji F, Sadreyev RI. RNA-seq: Basic Bioinformatics Analysis. Curr Protoc Mol Biol. 2018;124(1):e68. doi:https://doi.org/10.1002/cpmb.68