Protein residue characterization using AlphaFold

In this post of our AlphaFold Series, we present the new technology developed at Cyclica Inc. in collaboration with Dr. Bo Wang’s group at Vector Institute for protein residue characterization using protein structures provided in AlphaFold2 Database (AF2).

Protein structure prediction has undoubtedly made a tremendous leap in predictive accuracy with this year’s release of AlphaFold2 and RoseTTAFold.  As focus begins to shift towards the application of predicted protein structures, we must acknowledge their limitations relative to experimentally-determined proteins. Specifically, predicted structures lack the quaternary structure, ligand binding sites, macromolecular interfaces, and post-translational modifications often found in experimental structures, which provides research scientists with the important context required to interpret protein structures. 

At this year’s NeurIPS MLSB workshop, we are introducing a new graph convolutional network (GCN) that uses modelled protein structures from the AlphaFold Database to perform residue classification tasks.  The model is designed to augment any arbitrary residue classification task with 3D structural information obtained from a homology model, creating geometrically-aware representations.  For instance, post-translational modifications determined via mass-spectrometry, genetic mutation datasets or uniprot sequence annotations can be directly modelled without the need for experimental structure coverage in the PDB.

Screen Shot 2021-12-14 at 11.54.31 AM


Figure 1. GCN framework to embed representation of an AlphaFold2 predicted protein structure and perform residue classification tasks.

Our framework generates protein residue graphs constructed from each predicted protein structure in the AlphaFold human proteome, where nodes correspond to individual residues and edges correspond to inter-residue contacts within pre-set distances. The generated graphs and node features extracted from protein structure information are then used in the training of GCN models.

We used our framework to characterize residues across more than ten different tasks and without task specific model tuning achieved high performance across them such as 0.7-0.9 ROC-AUC for ligand binding, peptide binding, nucleic acid binding and metal ion binding tasks (Figure 2). This experiment is conducted using 3630 total proteins consisting of 1,749,863 residues. The 3630 total proteins are randomly splitted into 3160 training proteins and 5x100 validation proteins and only validation performances are shown in Figure 2.

We generated separate models for four distance cut-offs when creating the protein contact network (5Å, 8Å, 10Å, and 15Å), which led to a few interesting observations. For example, distance cut-off has the maximum effect on performance of nucleic acid binding site prediction with maximum ROC AUC difference of 0.062. In the case of nucleic acid binding, the task-dependent preference for larger distance contact networks cutoffs may be a sign that the model is recognizing long-range bulk electrostatic effects. Such observations helped us to explain the effect of modeling parameters like distance cut-off on performance of our deep learning framework on task-specific residue characterizations. These results gave us confidence in the flexibility of our residue characterization framework, built on top of AlphaFold2 protein structure data, as well as its capabilities in achieving high performance across multiple residue characterization tasks. 



Figure 2. Proof-of-concept Area Under the Receiver Operating Curves for identification of residues as part of ligand and peptide binding sites. ROC-AUC of 0.5 and 1 shows random and perfect model performances, respectively. 

This work is accepted to be presented at the NeurIPS MLSB 2021 conference and will subsequently be released as an open-source code base upon peer-review publication.  By providing this machine learning framework to the community, we hope to accelerate the discovery of new AF2 proteome applications.  Researchers with any residue-based datasets looking to generate robust predictive models will be instantly able to integrate the structural information provided by DeepMind and the EBI-EMBL.

Dr. Nasim Abdollahi, Machine Learning Researcher
Dr. Ali Madani, Machine Learning Team Lead

Edited by Dr. Stephen MacKinnon, VP of Research & Development

Related Posts

Full Speed Ahead: What DeepMind’s Newest Revelation means for Scientific/Research Advancement

Last Week, DeepMind announced on their blog that they had made available, in collaboration with...


A closer look at the coverage of the AlphaFold Human Proteome

In this post of our AlphaFold Series, we look at the gaps in proteome coverage addressed by...


The Protein Universe:  Structural Biology Has Entered Its Phylogenetic Era

Over the course of the past year, Deepmind+EBI released a complete set of protein structure models...