Discarding information to make predictions: Thoughts on how to predict drug properties robustly

When discussing information, especially in the context of using information to make predictions, I hear often that “more information is better”. Emphasis is placed on “using all the information” in a dataset, i.e. “finding the signal” and “extracting information”. When using deep learning approaches to train prediction models, it’s common to speculate on what specific information the model learned. Given a large amount of data, useful information is “buried” somewhere in it, and somehow the model is able to determine what information is needed (unless it doesn’t, as is all too often the case). Information is always front of mind, making it feel like a scarce and valuable resource. In my view, though, the most interesting questions really start when you begin to ask about the trained model itself: where is the information now? Somehow encoded into the weights and structure of the network? Can it be “extracted”? Is the process of training a model reversible? 

The author Tor Nørretranders would say that a trained model has a high “exformation” content. In his book “The User Illusion” (1991), he defines exformation as “explicitly discarded information”. He gives an example of a publisher and a novelist communicating about the sales of a new book. The author sends the publisher a single character: “?”. The publisher responds “!”, and the author is relieved to know the book is doing well. In classic terms, the information content of those messages is amazingly low; however, the exformation content is quite high, which makes it an efficient summary.

The idea of exformation is tightly linked to information entropy, but with far less emphasis on statistics, and it is quite useful to consider when making machine learning models for property prediction. Consider, for example, the application of machine learning to predict properties based on small molecule structures, which presents unique challenges. Foremost among these, how you represent the molecule in a machine-friendly way strongly impacts your final outcomes. There are string-based representations, like SMILES or SELFIES; there are also so-called chemical fingerprints. These fingerprinting methods algorithmically process the structure of a molecule, and record that output as a vector (often a sparse int vector). The utility of these fingerprint encodings is that they enable one to compare molecules for similarity, using methods like the Tanimoto (Jaccard) coefficient, which operates on vectors to reduce them to a single number corresponding to the supposed similarity (or distance) of the two molecules being compared. However, a number of concerns arise from this process:

  1. Fingerprinting is typically irreversible. This means that information is lost (discarded intentionally) when a molecule is fingerprinted.

  2. Fingerprints are not complete, or unique. No single fingerprint contains “all” the information about a given molecule's structure. For that reason also, it is possible (but very unlikely) that two molecules could produce the same fingerprint. It’s worth noting that sometimes this is desirable, for example, if you want to ignore the presence or absence of explicit hydrogens.

  3. Distributions of Tanimoto distances are not the same between fingerprint types. A set of Tanimoto distances produced from Daylight fingerprints will be dramatically different from the set of Tanimoto distances produced from Morgan (Extended Connectivity) fingerprints, despite being produced from the same molecules!

  4. Fingerprints introduce “algorithmic noise”. When created, various fingerprinting methods (even newer ones like neural network based autoencoders) are biased towards different aspects of molecular structure. For example, some emphasize pharmacophoric features, while others focus on topology. This can also bake in a bias, where two molecules could read as similar according to a fingerprint, but not to a medicinal chemist who thinks about the molecules in a different way (or even another fingerprint).

Upon contemplating these issues one may be tempted to create an “uber fingerprint” to address these shortcomings. However, if we think about this challenge from an exformation perspective, then a far simpler solution exists! Specifically, if we abandon the idea that we need to keep all similarity measurements to maximize predictive power, then we can aggressively discard information and simplify the problem. With this in mind we developed POEM, which uses a Pareto scheme to consider multiple fingerprints simultaneously, and with each step in the algorithm we discard information:

Molecule 2D Structure (as SMILES) -> 10 different fingerprints -> 10 sets of distances to a set of reference molecules -> matrix of dominance relationships between reference molecules -> relative similarity scores (to our prediction target molecule) -> normalized weights -> probability of predicted property (e.g. 0.86 probability to pass the blood brain barrier)

This information loss is acceptable, because we only need this similarity information to make a single prediction. For a given query molecule we can calculate very robust measures of relative similarity to each reference molecule. We can’t compare these similarity calculations to those other query molecules, but the good news is that we don’t need to compare them. So, at each step of the POEM algorithm, we carefully discard information in an irreversible manner; at the end of the process we’re left with a single number: the probability of our predicted property (i.e. the molecular equivalent of “!”). As an added bonus, since we keep information about how we moved along each step of the process, we learn how the prediction was made (we don’t need to reverse anything, if we just remember what we did), increasing the interpretability of results.


If you’re interested in reading more about how information and entropy relate to our ability to make models, you can read more here: https://onezero.medium.com/when-is-ai-trustworthy-when-is-ai-useful-215aaee24a6f.

If you’re interested in reading more about POEM, you can get a detailed explanation in the paper, published recently here: https://doi.org/10.1088/2632-2153/ab891b.

If you want to learn more about how we at Cyclica use methods like these to solve problems in drug design, you can read more on our website at: https://www.cyclicarx.com. For a short summary of POEM, I encourage you to read more about it here!

Related Posts

My Experience Building a Predictive Model to Limit hERG Liabilities

Since joining Cyclica nearly two years ago, I’ve conducted several R&D projects investigating and...


Predicting Drug Target Interactions: Advances and Pitfalls



Flipping the (Drug Discovery) Problem on its Head with Polypharmacology

For context: medicines have traditionally been designed to target a single protein with high...