Demystifying Protein Structure Prediction Models: AlphaFold, RosettaFold, ESMFold and beyond

The purpose of this blog post is to provide an introduction to protein structure prediction models, and explain the important differences between each with a focus on input data and application.

Proteins are the machines of the body. They are large molecules (i.e. macromolecules) key in determining the phenotype and fate of living cells, not only in the human body but all living species. They are gatekeepers of disease, the physical manifestation of DNA’s genetic code, including disease-causing variants and mutations. In response, small molecule drug therapies are designed to strategically target pathological protein states in order to treat human diseases.

Targeting proteins using synthetically generated chemical compounds is a complex process starting from understanding the structure of target proteins. These structures have been traditionally generated using costly and elaborative experimental processes like x-ray crystallography and cryo–electron microscopy. However, the tremendous efforts during the last two decades eventually resulted in the invention of computational models like AlphaFold2 with structure predictions close to experimental accuracy.


Starting from AlphaFold21, there have been multiple computational models for protein structure predictions with three different competitive edges including 1) accuracy; 2) speed; and 3) reliance on additional gene sequencing information for inference (i.e. structure prediction).   While the ultimate goal of each model is to accurately predict the location of each atom in the protein, inference requirements can have notable impact on model application.

Need for MSA

AlphaFold21 and RoseTTAFold2 are the first two deep learning based models for highly accurate prediction of protein structures. They rely on Multiple Sequence Alignments (MSAs) as inputs to their models, which map the evolutionary relationship between corresponding residues of genetically-related sequences.  They are derived from large, public, genome-wide gene sequencing databases that have grown exponentially since the emergence of next-gen sequencing in the late 2000s.  It is widely accepted that MSA-dependent structural prediction tools gain 3D-positional context clues from pairs of residues that co-evolve with one-another over time, implying spatial proximity.  Since MSA-dependent models are driven by evolutionary information, structure prediction applications are limited to naturally-occurring protein sequences. 

More recent tools have attempted to eliminate the need for MSAs in their predictive models by using language modeling applied on individual protein sequences (Figure 1). For example, OmegaFold3 has a language modeling component called OmegaPLM that uses transformers and attention mechanisms to learn per residue and residue-pair representations for each sequence of protein. OmegaFold3, along with other single sequence based models such as HelixFold-Single4 and Meta’s model, called ESMFold5, have higher potential in predicting structure of orphan proteins and antibody design, as they don’t require MSAs as their input. However, they have lower general accuracy for proteins with MSAs compared to AlphaFold2 and RoseTTAFold.  In contrast to MSA-dependent models, the domain of applicability for language model based approaches may extend beyond naturally-occurring protein sequence for structure prediction - this may include mutated protein structure prediction or protein engineering tasks.

blog_nov2022_Protein Structure Prediction modelsFigure 1. Schematic illustration of MSA versus language model based protein structure prediction models.

Speed versus accuracy

Prediction runtime of single sequence based models is lower compared to models that requires MSAs as input. For example, ESMFold is 60 times faster than AlphaFold2 for short protein sequences, although this difference is of lower significance for long sequences. Lower computational cost, or higher speed, is an important factor to predict protein structures repetitively. But if there is an available database that eliminates the need for prediction on the fly for applications like small molecule drug discovery, prediction speed and cost would be of less importance.  Inference speed and its protein-size dependence may also have implications on future protein engineering applications, notably those using sequence optimization approaches or for future models capable of predicting large multi-subunit protein structures. 

Databases or tools?

AlphaFold2 - the most famous model that was developed by Google’s DeepMind - entered drug discovery pipelines not just because of the high accuracy of the tool, but because of the availability of predicted protein structures of multiple organisms in a public database. Thinking about the drug discovery process, for example using small molecules, protein structure is the starting point for a given disease-target protein pair and structure prediction does not need to be repeated multiple times. Hence, the already available structures would suffice for most computational drug discovery and design pipelines.  Precomputed databases are particularly helpful for any applications that make use of multiple protein structures.  Possible applications include proteome-scale structural similarity tasks or the use of predicted protein structures as a feature embedding strategy for other predictive applications like protein residue characterization.

Local versus global performance

Although one model may have high accuracy compared to another model when using similar datasets like the  14th Critical Assessment of protein Structure Prediction (CASP14), one model might not beat all other models for all applications and protein classes and species in structure prediction tasks. For example, there have been efforts in showing the accuracy of RoseTTAFold on mutation effect prediction6. Hence, application of predictive models need to be considered when models are assessed and chosen based on performance.  Stated differently, the ‘best overall’ protein structure prediction tool may not be the ‘best for any task’.

Functional considerations, such as inference requirements (i.e. MSA-dependence), inference speed, database availability, or the ability to run predictive models on a local computer will determine a model’s domain of application. We are excited to see what new tools and databases will become available in the next few years and how each new improvement will address existing limitations and expand the overall useability and applicability of predicted protein structures.

References

  1. Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.
  2. Baek, Minkyung, et al. "Accurate prediction of protein structures and interactions using a three-track neural network." Science 373.6557 (2021): 871-876.
  3. Wu, Ruidong, et al. "High-resolution de novo structure prediction from primary sequence." BioRxiv (2022).
  4. Fang, Xiaomin, et al. "Helixfold-single: Msa-free protein structure prediction by using protein language model as an alternative." arXiv preprint arXiv:2207.13921 (2022).
  5. Lin, Zeming, et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction." bioRxiv (2022).
  6. Mansoor, Sanaa, et al. "Accurate Mutation Effect Prediction using RoseTTAFold." bioRxiv (2022).

Authors: Dr. Ali Madani, Director of Machine Learning and Dr. Stephen MacKinnon, Chief Platform Officer

 

Dr. Ali Madani, Director of Machine Learning

Dr. Ali Madani, Director of Machine Learning

Ali develops new deep learning models to improve drug-target interaction prediction. He completed his Ph.D. in Computational Biology at the University of Toronto, developing new feature selection approaches from omics profiles of patient tumors that are predictive of their survival and their response to drugs.

Related Posts

What's the fuss behind polypharmacology & multi-targeted drug design?

Originally posted on LinkedIn: https://bit.ly/3bpyKpG

It is well understood that while the common...

CONTINUE READING

Predicting Drug Target Interactions: Advances and Pitfalls

 

CONTINUE READING

The Complexity of Data in Machine Learning for Drug Discovery

Machine Learning (ML) pipelines are characterized by their multi-component iterative lifecycle,...

CONTINUE READING