World’s Largest Open Protein-Ligand Dataset 

Drug discovery workflows in (bio)pharma have incorporated limited computational methods until recent times. Pharma has usually relied on experimental methods starting with the target (or protein) discovery, lead (or ligand) identification, and optimization, before embarking on clinical trials. Following this process, the pharma industry has seen the following: greater than 10 years to a marketable drug, greater than $2B (US Dollars) in developing a new drug, and less than 20% conversion rate for drugs in the pipeline making it to market. Moreover, the design space exploration of the causal target(s) was limited, resulting in a low conversion rate for drugs in the pipeline. In the past five years, cloud computing and AI/ML methods have gained traction resulting in a faster drug discovery process, reduced costs, and increased drug success rates. The biggest lever to achieving these goals is to extend the possibilities of computational techniques across the workflow. INAI and IHub-Data are teaming up with ecosystem partners for creating the World’s largest Open Protein-Ligand Dataset.  

Accurate binding affinity prediction of Protein-Ligand (PL) complexes is of paramount importance in drug discovery. PDBbind is the largest open-source manually curated database of experimentally measured binding affinities for PL complex structures. Although experimental data is supposed to reflect real processes ongoing in biological systems, results depend on the type of experimental approach, study conditions, and performance. Since PDBbind data were received from different wet labs, this data a priori possesses noise that is difficult to mitigate. Obtaining binding affinities of PL complexes through experimental assays mitigating these issues is expensive and time-consuming. Further, the size of this dataset (namely, ~19.5K) is restricted to fully bound states of the PL complexes and does not have its higher energy states which are equally important.  

To alleviate this problem, computational methods are being adopted to predict the binding affinity of PL complexes (PLAS-5K). One such method to compute binding free energy is Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA). Using the MMPBSA computational method, one can calculate the binding free energy or binding affinity of PL complexes including their higher energy states which helps in learning the energetics of ligands going from unbound to bound state. This can lead to a total dataset size of well over 200K, which can then become the World’s Largest Open PL Dataset. For validating the efficacy of this dataset, one can employ compute-based methods for a subset of PL complexes (alchemical methods – a more accurate prediction of binding free energy and an order of magnitude more complex computationally compared to MMPBSA) and experimental assays for still fewer PL complexes.  

For this computational exercise, IIITH and IHub-Data, along with collaborators in IITD have created a 20K dataset of binding affinities of bound structures that took more than a year of time. As a next step, high-energy structures for each of these complexes have to be computed and partner Insilico Medicine will perform more accurate alchemical calculations to obtain the binding affinities of a subset of these structures for validation. Given this high and massively parallel computational need for obtaining the World’s Largest Open PL Dataset, INAI, IHub-Data, and other ecosystem partners intend to generate and validate the PL dataset through CPU-based server farms and launch it in Q2’23 for a better and faster drug discovery process.