The Complexity of Data in Machine Learning for Drug Discovery

Machine Learning (ML) pipelines are characterized by their multi-component iterative lifecycle, which largely depends on the available data. Data is not only the basis of ML models but also a critical factor to determine usefulness and performance. However, more important than large volumes of data to train, validate, and analyze ML models is the quality of the data that becomes fundamental to the success of the ML projects. 


In particular, for drug discovery, the basis for the development of ML pipelines is biological and chemical data. Biological data is complex. And this is not surprising because biological data is the result of studies of complex phenomena and systems where several entangled variables are in play. Therefore, expecting unidimensional or single data types is unrealistic. Instead, biological data is heterogeneous, conditional, high-dimensional, and research-biased. 

  • Heterogeneous. Molecular biology has developed specialized data-driven methods and technologies to characterize and quantify biological systems at different levels: genomics, proteomics, transcriptomics, and metabolomics among others. Each one of them are derived from specific experimental essays and are reported in different formats [1].
  • Conditional. To develop ML algorithms in drug discovery ground truths are a must and they are determined by reported experimental measurements. However, reproducing experiments may result in different, and sometimes conflicting outcomes. For example, running the same protocol in two different labs may lead to different results,and ground truths may be elusive [2]. 
  • High-dimensional. In biological data, the number of samples (observations) is often limited and much fewer than the number of variables (features) due to limited available resources. For example, human genome expression arrays can probe for the expression of c.a. 47,000 transcripts in a single sample [3], meaning that the number of variables are five orders of magnitude higher than the number of samples! 
  • Research-biased. Chemical and biological findings are usually focused or biased towards specific outputs. It is more likely to find more data in areas with higher market value or on topics that are considered ‘hot’. A remarkable example of this bias can be found in the increased scientific production of COVID-19-related studies between December 2019 and April 2020 [4].

    In addition to the above-mentioned complexity of biological data, we find that with rapid technological and computing power advances, large amounts of fast-paced, generated data are available (Fig 1.). A blessing and a curse, because high volumes of data do not directly translate to information. Hence, large amounts of complex data make data engineering tasks challenging: it requires significant data sanitization, standardization, and selection steps that can dramatically affect model performance. Important to note, all these tasks are not generalizable. Data curation is highly dependent on the nature of the problem. Every problem is very different and requires well-structured data with their respective ground truths, and high-quality data is critical to developing ML solutions that address specific problems in drug discovery. 


    Figure 1. Examples of fast-paced generation of biological data. (A) The overall growth of released structures per year in the Protein Data Bank. (B) Summary of entities and quantities available in the ChEMBL 31v database.

    Feedback loop

    Collecting data is just the first step in the ML lifecycle for drug discovery (Fig 2). It requires a significant amount of work to be transformed into meaningful information. Once the well-structured data is ready, then it’s time to develop the algorithm whose accuracy will depend on the quality of the data. After the model is built, next is to assess its performance, a key step in the development cycle of ML pipelines in drug discovery, and this typically occurs over multiple iterations. During this process, the ML model is refined, whereby the accuracy of the predictions can be further improved. The result often suggests either retraining the model with the updated data or implementing a new feature to maintain or improve model performance.  


    Figure 2. Machine Learning iterative lifecycle.

    However, ML models are not static. After their deployment, it is almost certain there will be more data available to plug into the pipeline and keep the loop active. The quantity and diversity of biological data pose a great challenge to integrate into ML pipelines for drug discovery. It is almost impossible to prevent the impact of changes in the data. With no updated data, the predictive power of our ML model will decrease over time. Feedback loops in drug discovery pipelines are paramount: new data is collected, the data distribution changes and the model needs to be retrained. In the end, we don’t aim for perfection; we aim for improving accuracy.  

    Data is dynamic and sometimes the growth of data can occur unexpectedly. Therefore it is important to build adaptable ML models in drug discovery pipelines, capable of evolving together with the data.


     [1] D. Gomez-Cabrero et al., “Data integration in the era of omics: Current and future challenges,” BMC Systems Biology, vol. 8, no. Suppl 2, p. I1, 2014, doi: 10.1186/1752-0509-8-s2-i1.

    [2] S. Webb, “Deep learning for biology,” Nature, vol. 554, no. 7693, pp. 555–557, Feb. 2018, doi: 10.1038/d41586-018-02174-z.

    [3] R. Clarke et al., “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data,” Nature Reviews Cancer, vol. 8, no. 1, pp. 37–49, Jan. 2008, doi: 10.1038/nrc2294.

    [4] R. Lucas-Dominguez, A. Alonso-Arroyo, A. Vidal-Infer, and R. Aleixandre-Benavent, “The sharing of research data facing the COVID-19 pandemic,” Scientometrics, vol. 126, no. 6, pp. 4975–4990, Apr. 2021,  doi:10.1007/s11192-021-03971-6.



Estefania Barreto-Ojeda, PhD

Estefania Barreto-Ojeda, PhD

Computational Scientist

Related Posts

What differentiation really means (and what it does not)

It’s only a minor exaggeration to say that I get asked about differentiation on a daily basis....


Deep Learning Across Label Confidence Distribution via Transfer Learning

When doing supervised machine learning, we would like to identify a relationship between some...


Response to Scripps Research Institute Paper on DCAF-1

A vast majority of protein targets in the human proteome have been deemed undruggable given the...