Eliminating the hype of big data for computational drug discovery and chemical biology - chemogenomic active learning

Drug discovery works by finding compounds that bind to proteins and induce downstream effects in cell signaling that are therapeutic. However, there are practically an infinite number of small molecules, and screening even a large subset of them in a wet lab is not only incredibly expensive, it is largely wasteful since success rates of finding an active compound are typically 1%. So, people have worked on computational prediction of drug discovery for several decades. More recently, people have been blindly jumping on to the bandwagon led by the hype of big data and complex machine learning (namely deep learning), without even asking the fundamental question of whether such are necessary in the first place for effective building of compound activity models. A new method known as chemogenomic active learning (Reker et al., Future Med Chem (2017) p381-402), implemented in part with scikit-learn, numpy, scipy, and matplotlib, demonstrated through repeated experiment that large (40,000+ example) bioactivity databases could be analyzed so efficiently that 25% or less of the data could be highly predictive (MCC=0.8) on entire datasets when the most informative compound-protein bioactivity datapoints were selected. The implications of the Reker et al. work are significant - not only does the new CGAL method dispel the increasingly popular trend to blindly include increasingly more data or blindly replace SVMs with deep learning for computational drug discovery, but CGAL could lead to substantially faster drug lead identification by identifying new initial hits and/or suggesting directions to proceed after hits are adaptively incorporated during cycles of predict-experiment. By executing small subsets of wet-lab validations rather than massive, iterative screening campaigns, and continually feeding the cycles back to CGAL, research laboratories could save substantial amounts of money in projects were hit and lead identification are the goals.