BIRCH Tree Clustering
PI: Ramon Miranda-Quintana
University of Florida, Department of Chemistry and Quantum Theory Project
2024 Spring ~ Summer
Research Focus: Within the realm of cheminformatics, molecular similarity refers to how much in common two molecules have with each other in terms of functional groups, atoms, and chemical bonds. When calculating the similarity value, molecular fingerprints are a popular way to represent these molecules. These fingerprints are essentially strings composed of 1s and 0s with 1 indicating that a feature is present, and 0 for the absence of a feature. Using these fingerprints, a Tanimoto Similarity coefficient can be calculated to quantify the area of overlap between two fingerprints (i.e. having the same byte on the same index). With this coefficient, one can assess how much in common the molecules have with each other on a scale between 0 to 1, with 0 indicating no similarity and 1 for being fully identical. Unfortunately, functions from popular cheminformatic libraries that implement the Tanimoto coefficient only allow comparisons between two singular molecules at a time. This means that if we were to run a full scale molecularity search on a dataset, it would require O(N squared) time complexity where N is the number of molecules, significantly hindering its execution time. The MQ Lab presents a novel method to reduce this horrendous time complexity to a linear trend by developing the iSIMs approach which uses the concept of extended similarity. That is, performing the Tanimoto Similarity index on behalf of the entire dataset as an average by leveraging an attribute called linear sums. With this brilliant implementation, it takes significantly less time to go through the dataset and cluster out molecules based on into their respective groups based on a similarity index.
Responsibilities: My main responsibility for this project was to further optimize the given code whether it be through simplifying arduous, repetitive code or even by coming up with new tricks (such as “folding” the molecule in half) to speed up the execution time. Because some of these tricks sacrifice the clustering accuracy to a certain degree, we plan on releasing two separate versions of the code: a main, albeit “slow”, version and a fast version with the tricks. I was also responsible for running time analysis and bottlenecking which areas of the code were critical in time length (offering insights on which parts should be prioritized) using Excel, and renovating an existing BIRCH C code with the new implementations brought by the group. I mainly used Python, Linux, and RDKit for this project, but I’ve also played around with other libraries such as ChemFp and PyBind to test out functionalities.