Recently, the Reproducibility Project: Cancer Biology published work that suggests that “more than half of high-impact cancer lab studies could not be replicated in controversial analysis.“ The scientific community jumped at the findings and have been discussing implications to pre-clinical drug discovery and impact to clinical programs, as well as exploring opportunities to solve the problem. Derek Lowe, as is often the case, swiftly turned around an analysis to summarize the findings, discuss the underlying and pervasive issues that exist in scientific research, and suggest a path forward to mitigate crisis of reproducibility (spoiler: both the authors and Derek suggest more rigorous requirements by publishers to require researchers to share data, protocols and methods).
I’m not here to discuss the systemic issues about data reproducibility that is inherent in scientific research. Instead, I want to discuss the implications to artificial intelligence (AI) - machine learning and deep learning - techniques that rely on data. We all know the cliche “garbage in, garbage out”. When “original positive results only replicated 40% of the time,” that seems to suggest that 60% of data may very well be garbage. That certainly seems like a cause to ring the alarm bell on AI. What’s to follow is my view on the reproducibility issue and how we’ve approached it at Cyclica
Reproducibility in biological and chemical research has been an issue for decades. An Amgen paper from 2016 suggested that they were “unable to reproduce the findings in 47 of 53 ‘landmark’ cancer papers.” It’s common for researchers to selectively publish positive data, and many journals do not require or encourage including negative data so it’s hard to obtain balanced and unbiased training data for machine learning that reproduce the full picture. With that said, relying on inconsistent, imbalance, or poor data is not just an artificial intelligence (AI) problem, it is a human intelligence (HI) problem. Humans review many of the same papers that are used for training and generate hypotheses on the basis of these papers. If anything, AI techniques, if properly constructed and trained, can mitigate some of the negative bias and noise that is inherent in the dataset. A good example for the pitfalls such negative bias can cause are discussed in our blog post on the database of useful decoys, enhanced (DUD-E).
With all that said, it is true that one of the limitations of AI is that it is reliant on three things: data, data, and more data. Where data isn’t balanced, of high quality, or accessible then the predictive power or AI will fall off a cliff. That’s why at Cyclica we start with a fundamental understanding of protein structure, combine that with large amounts of quality controlled experimental drug target interaction (DTI) data, and use that for training our MatchMaker proteome-wide DTI prediction model. Structural data enhances the model's ability to generalize, which in turn enhances its ability to tolerate noise in the training data. In addition, we avoid the missing negative bias by using randomized assumed negatives. This is actually essential to MatchMaker, as it is the only feasible way to eliminate all sorts of biases that would otherwise short-circuit the target dependency of model predictions. The model would otherwise recognize negatives from positives with little or no regard to the target. You can learn more about MatchMaker here.
With MatchMaker, because we train a single large model covering all known targets, and train it with millions of proteome-wide experimental data points, the model can learn from many proteins at once and deduce general principles that apply universally. This enables MatchMaker to tackle targets that do not yet have chemical data associated with them, and allows us to answer the key question about the application of AI: “can an AI-augmented drug discovery platform identify novel hits, not only for well characterized targets, but for targets with little to no previous data”. This capability has been shown in practice on several of our collaborative projects. Here and here, you can find a study that demonstrates MatchMaker's generalizability to new, data-less protein targets. This can be attributed to MatchMaker's unique combination of protein structure-based and ligand-based approaches. Recently, we showcased some foundational work to the scientific community where we undertook an open source project with The Structural Genomics Consortium (SGC) to develop a tool compound for an DCAF1, an undruggable target that is involved in ubiquitin-mediated proteasome degradation and member of the WD40 repeat protein family, a target class garnering a lot of interest for novel drug development.
One of the last ways that I’ll mention we address this (there are many others) is by layering a wealth of systems biology data on top of drug-target interaction prediction data in the interface of our discovery platform. This allows our drug discovery team, who use the platform when working on any one of our dozens of drug programs, to interrogate the data in real time. For example, we’re able to place a predicted drug-target interaction into biological context through network graphs (shown below) which shows not only the targets that are predicted to interact with the query molecule via MatchMaker, but allows our scientists to review the underlying datasource making the biological link between the target and the disease area of interest.
There’s always so much more work to do, and fighting to get our hands on more, higher quality, balanced, and fair data is important. We do this by deeply evaluating public and acquired datasets prior to integration, including both positive and negative data back into our models where appropriate, and partnering with leading institutions globally to integrate their vetted data.
The day has not arrived where a machine can train a machine to create a drug. While that level of futuristic thinking is inspiring to some, I believe we need to ground ourselves in the truth: that AI is not the silver bullet, nor is HI. They both have to work in unison for us to make a demonstrable impact on how drugs are discovered and brought to patients faster and more equitably.
Thanks to Andreas Windemuth for the quick review of this article and to my marketing team for all of their support.