AlphaFold2 coverage on multiple proteomes and impact on training MatchMaker
In the next post of our AlphaFold Series, we examined AlphaFold Human Proteome coverage, broken down by target class. This fourth post of the series applies the same methodology to profile the AlphaFold2 (AF2) enabled opportunities in new species modelled by the EBI-DeepMind database.
Non-Human Structural Biology and the Impact of AlphaFold2
With human health and medicine driving global efforts in life science research, it's unsurprising that human proteins comprise the largest portion of experimentally-determined three-dimensional structures (RSCB statistics page). Moreover, the remaining entries primarily represent human health interests, including model organisms for medical research, agricultural species, and pathogens. However, given the global interest in microbiomes, industrial chemical processing, biomaterial production, and energy production among others, structural biology initiatives in these realms are expanding!
Figure 1: Distribution of source organisms of protein structures in the Protein DataBank, adapted from the RSCB PDB at https://www.rcsb.org/stats/distribution-modified-organism-gene on November 17h, 2021 (excludes naturally-sourced proteins).
Here we assess the impact of AF2 on expanding structural coverage of non-human proteins. The methodology for this multi-species protein structural coverage study can be found in our last blog post and Cyclica’s 2017 proteome structural coverage study. Specifically, we examine residue level coverage available for each species in the PDB, SwissModel Repository and AF2 Database, as well as a breakdown by pLDDT confidence values. While most non-human species have much lower PDB coverage, they were previously well-served by homology modeling approaches. Relative to humans, prokaryotic proteomes are most accessible to homology modelling methods, presumably due to lower proportions of intrinsically disordered proteins (IDPs). Inversely, plasmodium has low proportional coverage, as apicomplexan parasites are known to have high IDP ratios.
Figure 2. Residue-level coverage of proteome for 11 species from with structural representation in the PDB, SwissModel, and EBI AlphaFold2 model databases.
Our coverage analysis faithfully reproduced a community assessment released last September by Akdel et al., who estimated that the new database adds roughly 25% high confidence structured protein content relative to SwissModel. This community assessment further comments on the accuracy and utility of the pLDDT confidence values, which may be more accurate at predicting disordered residue proteins than dedicated disorder prediction models. Moreover, the residue-specific nature of the confidence values allows users to determine when the denovo AF2 models could outperform distantly-related homology models. Lastly, Akdel et al. demonstrate applications of AF2 structures in variant effect prediction and pocket detection applications, concluding that these modelled structures can be applied to diverse structural biology applications with near experimental quality when properly factoring in confidence scores.
Figure 3: Breakdown of AlphaFold pLDDT confidence values bins for proteome segments not previously accessible to homology modeling approaches.
Impact on Cyclica’s MatchMaker
So far, our internal observations with AF2 proteomes echo the conclusions of the community assessment. Specifically, we used the AF2 proteomes both to create new proteome screening libraries and to train MatchMaker, our universal drug-target interaction prediction model (see our recent review on DTI models in Current Protocols). MatchMaker uses the local structures of 3d pockets as a basis to learn biophysical patterns between proteins and ligands that can generalize to previously unseen systems. Introducing AF2 protein structures into our training data pipelines and screening proteomes has so far increased the number of DTIs we can map to protein structures by ~18% and the overall number of testable proteins for our validation studies by ~20%. In our most challenging validation test that simulates dataless targets, the introduction of AF2 data to MatchMaker led to a 25% increase in the number of highly-predictive proteins. Taken together, the addition of AF2 proteomes to MatchMaker improves our ability to interrogate molecular interactions for both human and non-human proteins substantially. Notably, though, AF2 does not replace the use of experimental structures altogether, as the latter provide more information around quaternary structure, ligand binding sites, macromolecular interfaces, and post-translational modifications.
In our next AlphaFold Series blog series, we will introduce a new deep learning framework that will address some limitations of using modelled structures mentioned above. It was developed in collaboration with Bo Wang at the Vector Institute in Toronto and will be revealed at the 2021 NeurIPS MLSB workshop on December 13th, 2021.
Stay tuned !!
Dr. Stephen MacKinnon, Vice President, Research & Development
Dr. David Kuter, Director of Scientific Computing
Jay Huang, Computational Scientist
Edited by Dr. Ali Madani, Machine Learning Team Lead