Dec 01 2020
AI Mostly Solves Protein Folding
I wasn’t planning on writing about artificial intelligence (AI) two days in a row, but I also can’t plan the news. I also couldn’t pass up this item – London-based AI company, DeepMind, has mostly solved the extremely difficult problem of protein folding. If you are not already familiar with this issue, this may not sound like a big deal, but it is. So first lets’ give some background on the problem itself.
Biology is largely about proteins. Proteins are what genes code for, they make up enzymes, receptors, structural building blocks, antibodies, the basic machinery of cells, and more. Yes, lipids and carbohydrates are critical as well, but these are largely chaperoned by proteins. Proteins determine whether a cell is a liver or heart cell, and largely whether an organism is a human or sea cucumber.
Proteins are comprised of a sequence of amino acids, from a repertoire of 20 different amino acids. The specific sequence of amino acids is what is determined by the GATC genetic code in DNA, with three-letter codes for each amino acid. But a protein is more than just a sequence of amino acids. By itself a long chain of amino acids is a polypeptide – it doesn’t become a protein until that long chain is folded into a unique three-dimensional shape. Predicting how a long chain of amino acids will fold into a precise shape is the protein folding problem.
This is more difficult than it may at first seem. Imagine a chain with each link one of 20 possible different shapes, and that chain can be hundreds or even thousands of links long (titin is the largest known protein; its human variant consists of 34,351 amino acids). The number of possible ways to fold the protein gets magnified with each additional link. The resulting possibilities is staggering – too much for even the most powerful computer to crunch through. Determining how a sequence of amino acids actually folds is therefore determined mostly by direct laboratory study, using techniques such as X-ray crystallography and NMR spectroscopy. But this takes a long time – years for the largest proteins.
But to be clear – it is not as if we have made no progress at all until now. We have actually made great progress, and the latest advance is just a large incremental step. But it is big enough to note. Computers can already made reasonable predictions of folding for some proteins. There are also three different aspect of the protein folding problem: how do specific amino acids affect folding, what is the mechanism of protein folding, and how to predict the final folding pattern from the amino acid sequence alone. We have largely solved the first two, and are now working on the third part.
On that score there has, in fact, been a challenge since 1994, much like the Turing test for AI – the CASP (Critical Assessment of Techniques for Protein Structure Prediction). Every two years the challenge takes 100 proteins whose structure was determined through gold standard laboratory techniques and the challenge is to match that structure by using a computer to predict the structure just from the amino acid sequence. It is this challenge that the current news is about. The DeepMind program achieved an accuracy of 90 (on a 0-100 scale) in two-thirds of the proteins, and scored well in the rest. A score of 90 or better is considered at the level with laboratory techniques.
Looking at the results from CASP-13 in 2018 the best results are all in the 60-73 range, so this is a significant improvement. It’s also interesting to note that thousands of models are presented by hundreds of labs around the world, and these are only the best results. So yes, this is an important threshold achievement.
Why should the general public care about all this? Granted, this kind of research is a couple of steps removed from direct applications that will benefit the public, but this is no different than the human genome project, or research breakthroughs into stem-cell technology, or similar major basic science advances. This kind of breakthrough will help up and down the research chain. In fact, it will accelerate biological research itself. These are often the most impactful scientific breakthroughs that don’t get the attention they deserve, because they make research itself faster, cheaper, and more efficient. But only lab jockeys really appreciate their utility and impact.
In this case, there is also a translational research implication that has more obvious benefits. We can sequence genes now very fast and cheap, and from that sequence we can easily translate the exact sequence of amino acids. Predicting how that polypeptide will fold into a functional protein now closes the loop – we can go fully from a genetic sequence to a protein. This will help is virtually every aspect of biological research, given the long list of critical functions that proteins carry out. This will help us better understand diseases, how drugs interact with cells, and how viruses invade bodies and interact with the immune system. We are also learning that some disease, like Alzheimer’s disease, may ultimately be a problem of misfolded proteins.
But don’t expect a flood of innovations tomorrow. That kind of hype surrounding the human genome project was ultimately, in my opinion, harmful to public perception of science. These kinds of basic breakthroughs takes years and even decades to percolate through the system to affect tangible benefits. We are now reaping the benefits of the human genome project, and are still waiting for many stem-cell breakthroughs in the clinic two decades later. But the benefits will come. For now it is enough to celebrate a milestone achievement in basic science. It also reinforces the potential benefits of AI and deep learning – this is definitely another win in the AI column.