Since joining Cyclica nearly two years ago, I’ve conducted several R&D projects investigating and applying aspects of Cyclica’s technology, and supported numerous innovative drug discovery projects, working with inspiring and talented scientists across the globe. As a biochemist by training, I’m particularly interested in the intersection of biology, chemistry and pharmacology that goes into drug discovery.
Finding an active molecule for a protein target of interest is the first step on a drug discovery journey; at Cyclica we use our technology to find high quality chemical starting points faster. While identifying molecules that bind is essential, we consider it equally important to consider aspects of a molecule that are important for its optimization into a drug, early in the hit finding process. This is what motivated us to develop POEM, a supervised machine learning algorithm, which we use to build robust property models. The application of drug property models built with POEM is critical to our drug discovery programs. As an example here I would like to share my experience using POEM to construct a highly predictive hERG channel activity model to aid in the discovery of safer medicines. Considering hERG liability and thus cardiac safety early in the drug discovery phase can aid in reducing drug attrition.
The generation of POEM models requires only a list of compounds alongside their known property. Most often the chemical activity datasets we retrieve are highly skewed, with discrepancies spanning multiple orders of magnitudes at times. We sought to explore the impact of applying these highly skewed datasets on our POEM models. The importance of a balanced dataset was apparent throughout model exploration; for example, our investigations demonstrated that new predictions on models generated with highly skewed data would classify the blind query compound as the label which was in excess. These observations led us to re-evaluate our existing models and strengthen our internal model standards by removing those with either low specificity or low sensitivity. For our hERG POEM model, we applied a balanced dataset of ~6000 data points, containing an equal number of active and inactive compounds (for complete details on how we curated the dataset see our application note). Application of the resulting hERG model to a blind test dataset yielded a sensitivity of 0.83 and a specificity of 0.93, our most performant POEM model to date.
Since we recently expanded the capabilities of our Ligand Express platform to allow our users to easily generate custom predictive models using our POEM algorithm, this application note is intended to provide our users with an overview of our approach. Without the need for hyper-parameterization, POEM models can be rapidly generated alongside their cross-validation metrics, chemical landscape plots and cluster plots to assess the performance of the model. Importantly, custom POEM models can be applied as a selective pressure during the design process, allowing our partners to design towards properties specific to their research objectives.