Research-biased. Chemical and biological findings are usually focused or biased towards specific outputs. It is more likely to find more data in areas with higher market value or on topics that are considered ‘hot’. A remarkable example of this bias can be found in the increased scientific production of COVID-19-related studies between December 2019 and April 2020 [4].
In addition to the above-mentioned complexity of biological data, we find that with rapid technological and computing power advances, large amounts of fast-paced, generated data are available (Fig 1.). A blessing and a curse, because high volumes of data do not directly translate to information. Hence, large amounts of complex data make data engineering tasks challenging: it requires significant data sanitization, standardization, and selection steps that can dramatically affect model performance. Important to note, all these tasks are not generalizable. Data curation is highly dependent on the nature of the problem. Every problem is very different and requires well-structured data with their respective ground truths, and high-quality data is critical to developing ML solutions that address specific problems in drug discovery.

Figure 1. Examples of fast-paced generation of biological data. (A) The overall growth of released structures per year in the Protein Data Bank. (B) Summary of entities and quantities available in the ChEMBL 31v database.
Feedback loop
Collecting data is just the first step in the ML lifecycle for drug discovery (Fig 2). It requires a significant amount of work to be transformed into meaningful information. Once the well-structured data is ready, then it’s time to develop the algorithm whose accuracy will depend on the quality of the data. After the model is built, next is to assess its performance, a key step in the development cycle of ML pipelines in drug discovery, and this typically occurs over multiple iterations. During this process, the ML model is refined, whereby the accuracy of the predictions can be further improved. The result often suggests either retraining the model with the updated data or implementing a new feature to maintain or improve model performance.

Figure 2. Machine Learning iterative lifecycle.
However, ML models are not static. After their deployment, it is almost certain there will be more data available to plug into the pipeline and keep the loop active. The quantity and diversity of biological data pose a great challenge to integrate into ML pipelines for drug discovery. It is almost impossible to prevent the impact of changes in the data. With no updated data, the predictive power of our ML model will decrease over time. Feedback loops in drug discovery pipelines are paramount: new data is collected, the data distribution changes and the model needs to be retrained. In the end, we don’t aim for perfection; we aim for improving accuracy.
Data is dynamic and sometimes the growth of data can occur unexpectedly. Therefore it is important to build adaptable ML models in drug discovery pipelines, capable of evolving together with the data.
References
[1] D. Gomez-Cabrero et al., “Data integration in the era of omics: Current and future challenges,” BMC Systems Biology, vol. 8, no. Suppl 2, p. I1, 2014, doi: 10.1186/1752-0509-8-s2-i1.
[2] S. Webb, “Deep learning for biology,” Nature, vol. 554, no. 7693, pp. 555–557, Feb. 2018, doi: 10.1038/d41586-018-02174-z.
[3] R. Clarke et al., “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data,” Nature Reviews Cancer, vol. 8, no. 1, pp. 37–49, Jan. 2008, doi: 10.1038/nrc2294.
[4] R. Lucas-Dominguez, A. Alonso-Arroyo, A. Vidal-Infer, and R. Aleixandre-Benavent, “The sharing of research data facing the COVID-19 pandemic,” Scientometrics, vol. 126, no. 6, pp. 4975–4990, Apr. 2021, doi:10.1007/s11192-021-03971-6.