Drug Discovery Via Chemical Fingerprint and Similarity Analysis

Written by: Muhammad Redha Abdullah Zawawi

Published date: 1 Jun 2022

Machine learning has emerged as a hot topic in computer-assisted drug discovery and chemical genomics. Chemical fingerprints of small molecules have developed into a remarkable procedure for understanding the protein complexes and cellular networks at a larger scale. To date, thousands of drug-like compound structures have been made freely available and stored in various biological databases, such as PubChem, KNApSAcK, ZINC, ChEMBL and a few others [1]. According to the similar property principle (SPP), virtual screening has been widely used in new drug development by ranking compounds from a large database in ascending order of their structural similarity to a reference compound with known biological activity, where two or more compounds with high structure similarity are assumed to have shared similar biological and physiological activities [1,2].

The workflow of drug discovery can be divided into three main components, which are (i) molecular fingerprints, (ii) detection of chemical similarity subnetworks, and (iii) classification using Ward’s clustering method (Fig 1). The molecular fingerprinting is a fundamental technique used to compute the structural similarity of the 2D compounds, where the presence or absence of molecular features (i.e., substructural fragments) are represented as binary vectors (present = 1 and absent = 0). The fraction of bits shared by two vectors are mostly examined by computing the Tanimoto coefficient (Tc) to calculate the similarity score. The Tc score between two compounds (i.e., “X” as natural product and “Y” as commercialised drug) is calculated using the following equation (1), which X and Y represent an individual compound and XY represents the number of common features examined in both X and Y (Fig 1A) [3,4]

Tc (X, Y)=XY/(X+Y-XY) (1)

Tc values above a certain threshold are considered structurally similar between X and Y, but a definitive assessment in predicting the biological similarity remains lacking. For example, what chemical groups do they have in common? To overcome this shortcoming, a network approach called chemical similarity network is proposed to predict drug candidates by linking two structurally similar compounds with Tc score as the evidence of interactor/edge [5,6]. At this stage, a chemical similarity network can be clustered into subnetworks to determine the chemical diversity of the input data and identify specific groups of drug-like natural compounds with high structural similarity to commercially available drugs (Fig 1B). Subnetwork, also called a cluster or module, is defined as highly interconnected regions in a network. Predicting molecular complexes from chemical similarity data is crucial because it infers the functional annotation of an undocumented compounds with the annotated compounds from a similar subnetwork [3-6].

Finally, clustering of disease based on drug-like compound similarity. Clustering is an unsupervised learning method that involves grouping a group of objects (cluster) in accordance with similarity or distance measures (Fig 1C). This clustering approach has been adopted in many fields, including machine learning, image analysis and bioinformatics. The hierarchical clustering using Ward’s method can be utilised using the disease-cluster matrices to classify the compound-content-based classification between the disease and compound information [3,4]. Ultimately, such classification not only reveals the phylogenetic relationship of drug-like compounds as potential drugs for specific diseases, but it can also be used in bioprospecting to predict the medicinal properties.

Figure 1. (A) Chemical similarity search using 2D chemical fingerprints in drug discovery. (B) Chemical similarity networks clustered natural products and commercialised drugs into subnetworks corresponding to the presence of shared chemical features. (C) A Ward hierarchical clustering based on disease-cluster matrices is applied to classify the undocumented natural products as potential drug candidates based on their chemical structure similarity with the commercialised drugs of any diseases.


  1. Lo, Y. C., & Torres, J. Z. (2016). Chemical similarity networks for drug discovery. Special Topics in Drug Discovery, 53.
  2. Lo, Y. C., Rensi, S. E., Torng, W., & Altman, R. B. (2018). Machine learning in chemoinformatics and drug discovery. Drug discovery today23(8), 1538-1546.
  3. Abdullah, A. A., Altaf-Ul-Amin, M., Ono, N., Sato, T., Sugiura, T., Morita, A. H., … & Kanaya, S. (2015). Development and mining of a volatile organic compound database. BioMed research international2015.
  4. Liu, K., Abdullah, A. A., Huang, M., Nishioka, T., Altaf-Ul-Amin, M., & Kanaya, S. (2017). Novel approach to classify plants based on metabolite-content similarity. BioMed research international2017.
  5. Seo, M., Shin, H. K., Myung, Y., Hwang, S., & No, K. T. (2020). Development of Natural Compound Molecular Fingerprint (NC-MFP) with the Dictionary of Natural Products (DNP) for natural product-based drug development. Journal of cheminformatics12(1), 1-17.
  6. Safizadeh, H., Simpkins, S. W., Nelson, J., Li, S. C., Piotrowski, J. S., Yoshimura, M., … & Myers, C. L. (2021). Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical–Genetic Interactions. Journal of chemical information and modeling61(9), 4156-4172.