Sains Malaysiana 33(2): 157-172 (2004)                                                                      Pengajian Kuantitatif /

Quantitative Studies


On the Significance of Topological-indices Based Non-binary

Molecular Similarity Measures


Naomie Salim

Faculty of Computer Science and Information Technology

Universiti Teknologi Malaysia, Skudai, Johor


John Holliday & Peter Willett

Krebs Institute for Biomolecular Research and Department of Information Studies

University of Sheffield

Western Bank, Sheffield S I 0 2TN, UK






Kertas kerja ini membincangkan mengenai kajian untuk melihat sejauh mana nilai keserupaan bukan binari yang dihasilkan melalui perbandingan indeks topologi sebatian mampu mewakili perbezaan atau keserupaan ciri fizikal dan kimia sebatian yang dibandingkan. Di dalam kajian ini, nilai log P yang diperolehi daripada ujikaji makmal telah dibandingkan dengan nilai log P jangkaan yang diambil daripada purata log P sebatian yang mempunyai pelbagai julat nilai keserupaan tertinggi berdasarkan perbandingan indeks tolopologi kesemua sebatian di dalam pangkalan data dengan sebatian berkenaan. Analisa menunjukkan yang pengiraan keserupaan bukan binari menggunakan angkali Cosine, Simpson dan Pearson boleh memberikan nilai keserupaan yang mengelirukan apabila sesetengah jenis sebatian dibandingkan. Nilai keserupaan yang melibatkan 1% sebatian paling serupa berdasarkan angkali Tanimoto atau Euclidean didapati mampu menggambarkan keserupaan ciri fizikal dan kimia sebatian yang dibandingkan. Justeru, carian atau pemilihan berfokus bagi mendapatkan 1% sebatian paling serupa dengan sesuatu sebatian menggunakan angkali Tanimoto dan Euclidean ke atas perwakilan bukan binari sebatian dijangka berkecenderungan memberikan hasil yang lebih memuaskan berbanding dengan pemilihan rambang. Nilai keserupaan yang melibatkan 5% sebatian paling berbeza berdasarkan angkali Tanimoto juga didapati mampu menggambarkan perbezaan ciri fizikal dan kimia molekul yang dibandingkan. Ini menunjukkan yang pemilihan rasional berdasarkan angkali Tanimoto bagi memilih subset yang terdiri daripada 5% molekul paling rencam dari sebuah pangkalan data molekul yang mempunyai perwakilan bukan binari berkecenderungan untuk memberikan hasil yang lebih baik daripadapemilihan secara rambang. Walau bagaimanapun. di dalam kedua-dua pemilihan berfokus atau rencam menggunakan angkali yang dinyatakan, semakin banyak sebatian yang dipilih, hasil yang didapati semakin menyerupai pemilihan secara rawak dari segi keserupaan atau kerencaman ciri fizikal dan kimia.





This paper describes experiments to study on how well the whole range of topological indices-based non-binary similarity values represents the physicochemical similarities between compounds. Measured log P values have been compared with the log P values predicted from compounds at different range of similarities calculated based on various topological indices of the compounds. Analysis shows that the non-binary Cosine, Simpson and Pearson coefficients might give misleading results when certain compounds are compared. Similarity values involving 1% most similar compounds based on the non-binary Tanimoto or Euclidean coefficients has been found to be able to represent physicochemical similarities between the molecules compared. Therefore, for searches requiring around 1% most similar compounds, rational selection methods based on the non-binary Tanimoto or Euclidean coefficients are likely to produce better results than random selection. Similarity values involving 5% most dissimilar compounds based on the non-binary Tanimoto coefficients has also been found to be able to represent physicochemical dissimilarities between the molecules compared. Therefore, for diverse selection requiring less than 5% most dissimilar compounds, rational selection methods based on the non-binary Tanimoto coefficient is likely to produce better results than random selection. However, in both focused and diverse selection using the coefficients mentioned, as more and more compounds are selected, the selection becomes more and more like random selection in terms of physicochemical properties similarity and dissimilarity.





Bradshaw, J. & Sayle, RA. 1997. Some thoughts on the signiticant similarity and sufficient diversity .[online]. EuroMUG'97- Davlight European Medchem User Group (MUG). Available from: html [Accessed 10th September 2003].

Brown, R. D. & Martin, Y. C. 1996. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. Journal of Chemical Information and Computer Sciences. 36:572-584.

Dean, P.M. 1995. Molecular Similarity In Drug Design. London: Blackie Academic & Professional.

Dean, P.M. & Lewis, RA. 1999. Molecular Diversity In Drug Design. Kluwer Dordrecht: Academic Publishers.

Dixon, S.L & Koehler, R.T. 1999. The hidden component of size in two-dimensional fragment descriptors: side effects on sampling in bioactive libraries. Journal of Medicinal Chemistry. 42: 2887-2900.

Ellis D., Furner-Hines, J. & Willett, P. 1994. Measuring the degree of similarity between objects in text-retrieval systems. Perspective of Information Management. 3:128-149.

Fisanick, W., Cross, K.P. & Rusinko, A. 1992. A similarity search on CAS Registry Substances I. Global molecular property and generic atom triangle geometric searching. Journal of Chemical Information and Computer Sciences. 32: 664-674.

Flower, D.R 1998. On the properties of bit string based measures of chemical similarity. Journal of Chemical Information and Computer Sciences. 38: 379-386.

Godden, J.W., Xue, L. & Bajorath, J. 1999. Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. Journal of Chemical Information and Computer Science. 40: 163­-166.

Hall, L.H. & Kellogg, G.E. 1999. Molconn-Z Version 3.50 (Including 3.50S for Sybyl). [online]. EduSoft L.C. Available from: molconn/manuals/350/ [Accessed 27th September 2003].

Hansch, C. & Leo, A. 1995. Exp/oring QSAR: Fundamentals and Applications in Chemistry and Biology. Washington D.C.: American Chemical Society.

Hoekman, D. No date. MedChem/BioByte. [online]. Available from: http:// [Accessed 27th  October 2003].

Holliday, J.D., Hu, C-Y. & Willett, P. 2002. Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit­ strings. Combinatorial Chemistry & High Throughput Screening. 5: 155-166.

Holliday, J.D., Salim, N & Willett, P. in press. On the Magnitudes of Coefficient Values in the Calculation of Chemical Similarity and Dissimilarity. Lavine, B.K (Ed.) ACS Volume Based on Chemoinformatics and Chemometric. American Chemical Society.

Johnson, M.A & Maggiora, G.M. 1990. Concepts and Application of Molecular Similarity. New York: John Wiley and Sons.

Lajiness, M.S. 1997. Dissimilarity-based compound selection techniques. Perspectives in Drug Discovery and Design. 7(8):65-84.

Martin, EJ., Blaney, J.M., Siani, M.A., Spellmeyer, D.C., Wong, A.K. & Moos, W.H. 1995. Measuring diversity: Experimental design of combinatorial libraries for drug discovery. Journal of Medicinal Chemistry. 38: 1431-1436.

Matter, H. 1997. Selecting optimally diverse compounds from structural databases: A validation study of two-dimensional and three-dimensional molecular descriptors. Journal of Medicinal Chemistry. 40: 1219-1229.

Patterson, D.E., Cramer, R.D., Ferguson, AM., Clark, R.D. & Weinberger, L.E. 1996. Neighbourhood behaviour: a useful concept for validation of molecular diversity descriptors. Journal of Medicinal Chemistry. 39 : 3060-3069.

Rouvray, D.H. 1990. The growing use of topological indices for property prediction. In: Bawden, D. & Mitchell. E.M. (eds.). Chemical Information Systems: Beyond Structure Diagram. Chichester: Ellis Horwood. 124-148.

Sheridan R.P. & Miller M.D. 1998. A method for visualizing recurrent topological substructures in sets of active molecules. Journal of Chemical Information and Computer Sciences. 38:915-924.

Snarey, M., Terret, N.K., Willett, P. & Wilton, DJ. 1998. Comparison of algorithms for dissimilarity-based compound selection. Journal of Molecular Graphics and Modelling. 15:372-385.

Turner, D.B., Willett, P., Ferguson, AM. & Heritage, W. 1995. Similarity searching in files of three-dimensional structures: evaluation of similarity coefficients and standardisation methods for field-based similarity searching. Structure-Activity Relationship and Quantitative Structure-Activity Relationship in Environmental Research. 3: 101-130.

Whittle, M., Willett, P., Klaffke, W. & van Noort, P. 2003. Evaluation of similarity measures for searching the dictionary of natural products database. Journal of Chemical Information and Computer Sciences. 43: 819-828.

Willett, P .. Barnard, J.M. & Downs, G.M. 1998. Chemical similarity searching. Journal of Chemical Information and Computer Sciences. 38:983·996.

Xue, Ling, Godden, J.W. & Bajorath, J. 1999. Database searching for compounds with similar biological activity using short binary bit string representations of molecules. Journal of Chemical Information and Computer Sciences. 39:881­-886.