The Protein Universe: Structural Biology Has Entered Its Phylogenetic Era
Over the course of the past year, Deepmind+EBI released a complete set of protein structure models representing the entire genomes of 48 species, including humans, key research organisms and several additional species of global health interest. This data has tremendous potential in the development of new medicines, so long as their inherent limitations are factored into their respective usage. While individual AlphaFold2 (AF2) structures may lack the atomistic detail and molecular context offered by experimental structure determination approaches (X-ray, NMR, Cryo-EM), the sheer scope and completeness of this dataset offers an alternative form of value, particularly favoring proteome-scale applications.
This current Deepmind/EBI release extends well beyond the realm of medical relevance. By releasing 200 million protein structures, there are now *one thousand* AF2-modeled structures for each x-ray, cryo-em, and NMR structure combined. At this time, the impact can only be imagined! Maybe this opens new opportunities in infectious diseases, pandemic readiness, agriculture, veterinary science or the pet industry. Perhaps they even inspire new biomaterials based on the inter-species analysis of silk proteins from all sequenced spider genomes. Time will tell!
But the inter-species analysis is really what this new release offers to the community. Now, rather than ask how a protein of interest behaves in human vs mouse, we can ask how it behaves in humans, chimps, gorillas, orangutans… as far down the evolutionary tree as applicable to the question on-hand. For instance, comparing light opsin proteins among all vertebrates could offer structural insights into the evolution of color vision.
Prior to my life at Cyclica, my academic research had me scripting through hundreds of thousands of protein structures, documenting the fundamental rules of homomeric assembly. As a structural bioinformatician, I hope this tidal wave of new data inspires new basic science studies that challenges our fundamental understanding of proteins. Individual protein systems will now have access to a “phylogenetic” dimension of data. I’m eager to see how some hypothetical future 3D variation of a phylogenetic tree could inform our understanding of a protein's evolutionary trajectory.
Granted, any basic science analyses centered on systematic analysis of AF2 data will inevitably spark a debate on the validity of interpreting these AI predictions as ground truth. Had this conversation happened last year, my answer would be a resounding ‘no’. Having worked directly with AF2 proteomes, observing successful applications and few artefacts, I admit I could be swayed.
If we accept that evolutionary co-variation is mostly a product of 3D spatial proximity, then the complete set of protein sequences that make up AF2’s input multiple sequence alignments (MSA) should be considered a form of experimental data. Stated differently, The AF2 model effectively uses experimental sequencing data to identify long-range contacts and builds a conforming protein structure. A similar strategy is used by NMR for structure determination, where magnetic resonance is instead used to identify long range contacts. Hybrid NMR + Rosetta approaches developed by the Baker group over the past 20 years have equally muddied the boundaries between structure prediction and experimental structure determination.
All experimentally-determined structures are artificial representations of real biological systems. And yet, we accept them as valid structure models despite their flaws and artefacts, since we understand and embrace their limitations. Is AF2 really an AI-based structure prediction? Or AI-based structure determination? Is the distinction meaningful?