In this post of our AlphaFold Series, we look at the gaps in proteome coverage addressed by AlphaFold, broken down by protein class.
The concept of structural coverage has been a longstanding topic of interest to Cyclica, as we believe that modeling a drug’s total impact on human physiology is key to developing safer medicines. In 2017, we published a structural coverage study, which examined pharmaceutically important protein categories. This coverage analysis measured the sequence similarity of each human protein to the PDB to estimate the portion of proteins ‘accessible’ to reliable homology modeling prediction engines. This study inspired Cyclica’s transition from pose-dependent, docking-based proteome screening to MatchMaker, a pose-independent, deep learning-based next-gen alternative. MatchMaker was intentionally designed with relaxed requirements on atomistic-level accuracy, training and evaluating with modeled binding site structures from homology model databases.
So, what is the proteome coverage of AlphaFold2?
We modified our 2017 analysis to look specifically at the human proteome coverage from the PDB, SwissModel Repository, and AlphaFold Database. As with our previous work, we calculated structural coverage on a residue-by-residue basis by mapping available structures to the UniProt’s canonical human proteome, then applied filters for several pharmaceutically relevant protein keywords. In this iteration, however, we did not exclude disordered residues. Additionally, rather than binning residues by BLAST-based similarity thresholds, we first mapped each protein using their respective database accessions, then performed residue-level mapping using pairwise sequence alignments.
Figure 1. Residue-level coverage of the human proteome from three sources of experimental and predicted protein structures.
If we only consider protein coverage, AlphaFold provides models for nearly 100% of all residues in the canonical UniProt proteome, with five times the coverage of experimental sources (PDB) and twice the reach of pre-existing homology model databases (SwissModel). However, this coverage observation is a product of AlphaFold’s decision to release a structure model for every portion of every sequence, independent of prediction quality or disorder status. In contrast, the SwissModel repository will not model residues if a relevant template is not identified or if prediction quality drops below a predetermined threshold.
How reliable are these net new contributions from AlphaFold?
While AlphaFold does not exclude low-confidence portions of protein structures from their model database, they provide the residue-level confidence metric, named pLDDT scores, ranging from 0 to 100. We further examined the residue coverage exclusive to AlphaFold, corresponding to the blue area from Figure 1, by binning residues according to the pLDDT confidence scores. Over half of this new coverage has ‘Low Confidence’ pLDDT scores under 50, which the AlphaFold database guidelines consider likely to be disordered. In contrast, approximately a third of new coverage relative to homology model databases is placed in the ‘Confident’ or ‘Very High’ categories, where predictions are expected to yield correct backbone topology. For pose-independent drug design technologies such as MatchMaker, this increase translates to thousands of newly structurally enabled proteins across all target classes.
Figure 2: Breakdown of AlphaFold pLDDT confidence values bins for proteome segments not previously accessible to homology modeling approaches.
In the next post of our AlphaFold Series, we will look into coverage considerations for non-human species. Given the existing research bias towards human proteins in the PDB, we expect the boost in structural coverage attributed to AlphaFold2 will be even more pronounced in non-mammalian species.
Dr. Stephen MacKinnon, VP, Research and Development
Dr. David Kuter, Director of Scientific Computing