If you thought about each of our cells like a miniature city, our proteins would play every role from transportation to housing, from banking to agriculture. They have complex three-dimensional structures and they do pretty much everything in our cells. I’d argue that our proteins should receive more attention than our genome because our genome is ironically the simplest parts of who we here. As information propagates through the transcriptome (RNA) to the proteome, our cells get more and more complex, and that’s why protein structures are crucial to try and decipher this complexity.
Of course — one of the best ways to understand protein function is through the structure. Yet this protein folding problem is something that we’ve been struggling to solve for decades. If you allowed a protein with an average number of amino acid residues to fold, the protein would take an eternity (no exaggerations) to sample all possible conformations (If you’re more interested in this, check out something called Levinthal’s paradox (it essentially states that there are an astronomical number of protein conformations)). It’s like having a super complex math problem and trying all possible values on a continuous spectrum from one to infinity, hoping to eventually land somewhere.
Thankfully, we don’t use this approach to protein structure determination, namely because, similar to a math problem, we do know partially how to solve the protein structure problem. There are two main approaches.
The physics-based approach
The physics-based approach uses what is called “force fields.” Here’s essentially the equation for a force field:
UCHARMM = U(bonded) — U(non-bonded)
The U bonded term represents the energy between two bonded atoms whereas the U(non-bonded term) represents what is known as van’ der Waals forces, temporary dipole-dipole electrostatic attractions between two atoms that aren’t bonded.
Each of these two energy terms breaks up into more subtle but important subsections.
U(bonded) = U(bond) + U(angle) + U(UB) + U(dihedral) + U(improper) + U(CMAP)
The Non-bonded Parameters
The non-bonded terms consist of partial dipole-dipole attractions, dipole-dipole repulsions, and full electrostatic attractions based on Columb's law.
The Statistics based approach
Rather than using physics and the exact distance, position, and angle between each pair of amino acid residues, Rosetta, a protein folding software developed by the Baker Lab and other collaborators at the Institute for Protein Design instead uses statistics to model proteins. In other words, given a certain set of residues with a certain set of parameters, what’s the probability of seeing a set of amino acids occupy a certain conformation?
Think about it if you saw a ball at the top of a ramp, you could predict that it would roll down the ramp given your basic intuitions about gravity and synthesized information from previous experiences. In other words, you wouldn’t need to model the exact position of every atom composing the ball, the ramp, and the surrounding molecules in the air.
Rosetta and other statistical modeling software use this same kind of “intuition,” which allows it to make more assumptions as compared to a force field, thus allowing for quicker and more convenient calculations.
You can think of the initial polypeptide sequence as the ball starting at the top of the ramp, and the action of it rolling down the ramp as energy minimization. The goal is to naturally let the protein fold so as to minimize energy to achieve the lowest free energy.
The way that Rosetta determines the energy of a structure, or where the ball is in the hilly landscape is through different scoring terms. These scoring terms can be attributed to different aspects of chemical bonding and interactions.
Some examples of scoring terms that you would see in Rosetta:
- fa_atr, fa_rep → These two terms represent the attraction and repulsion energy terms
Other energy terms include solvation energy, which is calculated by taking the standard free energy of the solvent-exposed structure when all groups are exposed and subtracting the interactions between groups so that groups that aren’t solvent-exposed are taken into consideration.
Another approach to protein modeling, which goes hand in hand with Rosetta is molecular dynamics simulations (MD). Molecular dynamics does exactly what it sounds like — it simulates the motion of different molecules.
The equations below show how these molecular dynamics simulations work:
The first equation is relatively straightforward — it tells us that the new position of a particular atom is determined by its last position and the velocity (similar to distance = rate * time).
The second equation shows us how to calculate the new velocity of the atom. Further, this is calculated by taking the negative derivative of the potential energy.
Think about it this way, the more negative the instantaneous rate of the decrease of potential energy with respect to changes in x, the faster x can move to reach the potential energy minimum. (The more negative the derivate of the potential energy function (U), a larger value is added to the initial velocity (negative * negative = positive).
The only problem with this approach is that structures can get situated in potential energy minima — that is, they have reached a low point in energy compared to their immediate surroundings in the energy landscape, but there is a lower point in which the structure could be situated.
The way to avoid this process is through what’s known as simulated annealing.
Let’s take a small detour to understand exactly how simulated annealing functions. When I was a kid, I used to do origami. And when folding complex structures in origami, sometimes you actually need to unfold the structure in order to make more creases, so that when you fold the structure back up, the creases are waiting there for you, allowing you to fold the structure into something even more complex.
That’s exactly what happens with simulated annealing. We simulate a higher temperature in a process called Metropolis simulated annealing, and this allows the folded protein structure to escape low energy minima.
Remember the ball going down the ramp? This process allows the ball to go back up the ramp, with the hopes that it will find an even steeper downhill path.
You might be thinking — what if the protein structure temperature gets so high that the structure never makes it way downhill? The way to avoid this is through implementing an odds ratio. Odds ratios basically set the probability that the protein will occupy a higher energy state.
Here’s the equation:
As you can see, an increased temperature means a decrease in the magnitude of the negative exponent to which e is raised. Therefore, a larger temperature value will result in a smaller exponent, which would thereby result in a larger probability of sampling a test point in the energy landscape.
Rosetta AbInitio Demo
Rosetta is a biophysical modeling software that uses simulations and scoring functions to optimize structures. In this project, I used what are known as ab initio protein folding methods to uncover the three dimensional structure of the nsp9 RNA binding domain of SARS-COV2 given solely sequence information.
The way that this is done is through fragment library generation. Fragment libraries software .gz files containing hundreds of thousands of 3–9 length amino acid sequences. These sequences also have their own unique secondary structures and thus can be used to approximate the secondary and ultimately tertiary structures of the proteins themselves.
The first thing that we do in rosetta in order to generate these fragments is submit a set of files including our protein sequence FASTA file, the secondary structure prediction (each amino acid is characterized as belonging to an alpha helix, beta pleated sheet, and loop regions) file, and the actual pdb file (to see how our generated protein structures fare in comparison to the actual protein structure.
My inputted options file for fragment generation. I generated 3 and 9 amino acid long fragments (1000 in total) using the secondary structure prediction file in addition to the sequence and protein database file (for benchmarking and validation)
In the file above you can that there is a “BestFragmentsProtocols/input_files/simple.wghts” file. Essentially, fragment picking involves multiple scoring metrics, each of which correlates to a different method of secondary structure and protein geometry evaluation. The weights file assigns weights to each method and you can use it to consider one type of evaluation over another.
Here you can see that there are three types of scores. There’s the RamaScore, SecondarySimilarity, and Fragment Crmsd.
The RamaScore metric essentially evaluates the number of identical residues between the target sequence and chosen fragment.
The Secondary Similarity metric does exactly what you think it doe — it compares the secondary structures of the fragments to the inputted ss2 file. Finally the fragment Crmsd essentially takes the root mean square deviation of the alpha carbon atom on each individual amino acid.
Now, I was able to submit the task to the Fragment picker in rosetta and get two fragment files out.
This is what one of the outputted fragment files look like after the fragment picker has done its job. Let’s go over what each row of each line means. The “vall_pos” argument gives the position of each fragment in relation to the original pdb fasta sequence. A pdbid and fragment id is assigned to each fragment and the values for each of the three metrics along with a total weighted score (a linear combination of each of the individual scores).
Now that we have the fragments, we can now conduct ab initio folding procedures. Ab initio folding is conducted by taking each of the fragments and constructing them together to approximate the secondary and tertiary structures of a protein.
Ab initio folding uses monte-carlo simulations to simulate protein folding. It can output thousands of files for a particular protein, each with a differing root mean square deviation (rmsd) when compared to the protein target structure (in my case the 6W4B file, representing nsp9 RBD (RNA binding domain). In some cases, the exact protein structure isn’t available, and in this case, you use what is known as homologous structures to approximate the general structure of the protein.
What’s known as a folding funnel is present when the RMSD goes down with the energy of the protein structure (a structure should be more negative as the structure comes closer to approximating the structure of the native protein).
Given my limited computational resources, I was able to fold a structure with an RMSD of ~6 angstroms when compared to the native nsp9-RBD pdb file. (This was the best closest structure to the native structure out of 65 generated structures).
In the past few years, however, we’ve begun to use deep learning to make our fragment libraries, and this has helped us surpass the ability of biophysical models to conduct fragment assemblies and make models with higher levels of accuracy to the native structure and lower root mean square deviations.
One such model is DeepFragLib, which a machine learning model utilizing what are known as bidirectional LSTM models to make predictions about the best fragments to represent a particular protein structure.
In addition to the classifier, there is also a regression model that takes in a given set of fragments and outputs a predicted root mean square deviation to estimate the quality of structures outputted given a set of fragments. The regression model is ResneXt, which is a modified version of ResNet.
The BLSTMs essentially conduct binary classification on the fragment library of interest, and each model is trained on particular fragment size.
DeepFragLib workflow. The CLA models are the classification BLSTMs that sample fragments and decide which ones are most representative of a given structure.
The authors of the paper used two major tricks including cyclic dilation and knowledge distillation.
Cyclic dilation occurs when there is spacing between the units in a convolutional kernel.
Notice how the “receptive field” or the area covered by the green 3 by 3 grid projected onto the blue convolutional feature space covers a larger area on the blue grid than a 3 by 3 space of the same size as the green square.
This allows the convolutional filter to gain access to and capture patterns in the entire sequence as opposed to a particular region. This is essentially what the scientists did in order to distinguish near-native fragments from decoy fragments in the ResneXt based rmsd prediction model.
Knowledge distillation takes a given neural network, with it’s millions of parameters, nodes, and edges, and condenses
The results of the DeepFragLib fragment picker (red) as opposed to other algorithms
In the above visual, you can see that DeepFragLib provides higher values as compared to the other fragment pickers for the three metrics of choice (precision, position averaged precision, and coverage).
Coverage is the number of residues that are captured by a near native fragment in the fragment library. Precision is the proportion of native fragments in the library, independent of the query sequence itself. Finally, position averaged precision is the average of the proportion of native sequences for each given residue starting point.
A set of distributions showing how the DeepFragLib library compares when compared to other protein fragment picking softwares (e.g. NNMake from Rosetta).
The equation for TM-score, when L(target) is the length of amino acid sequences, L(aligned) is the amount of overlap between the two sequences (likely the length of the shorter sequence if no overhang), and d0(L(target)), a constant used for normalization.
TM-score is the best ultimate method for determining how well a fragment picking algorithm does as it shows how close the tertiary structure of the generated ab initio folded protein resembles the native structure (or the homologous structure most similar in sequence to the native structure).
(Images in this section are from the DeepFragLib paper)
Why is this Important?
When COVID-19 first came to light, we didn’t have any protein structures for the key viral players in it’s life cycle, including proteases (that chop up RNA for processing) along with helicases and polymerases that allow for the transcription of the RNA. Getting these structures took weeks — and we couldn’t sit still while the virus claimed thousands of lives.
Rosetta allowed scientists to model these proteins: since we already have genome sequencing technology that allows scientists to get the genome sequence of the virus in a matter of minutes to hours at maximum, we could translate the genetic sequence to a protein sequence given the codon code discovered decades ago.
And that’s where ab initio comes in. Given these protein sequences, scientists were able to get somewhat accurate representations of the three-dimensional structure of proteins — giving scientists a headstart towards understanding the biology of the virus, which would ultimately go on to claim more than one million lives.
Nevertheless, the structures that we’re getting right now aren’t the most accurate and don’t suffice for drug discovery and that’s why we still have to ultimately resort to traditional methods like x-ray diffraction as opposed to these more technologically advanced methods.
If you’d like to discuss any of the topics above, I’d love to get in touch with you! I’m actually currently trying to optimize ML models to make predictions about efficient aptamer drugs for coronavirus protein targets (N protein). (Send me an email at firstname.lastname@example.org or message me on LinkedIn) Also feel free to check out my website at mukundhmurthy.com.
Please sign up for my monthly newsletter here if you’re interested in following my progress :)