Ab Initio Protein Folding

An overview of some current methods to take a protein’s sequence — and get it’s three dimensional structure

Mukundh Murthy
7 min readMay 10, 2020

Proteins do pretty much everything in our cells, from moderating the regulatory mechanisms in our cells to catalyzing other very important metabolic and cellular reactions. Therefore, acquiring structures for these proteins is super important. However, conventional mechanisms for protein structure determination are extremely time and cost intensive. X-ray diffraction, the more traditional method, can take weeks to months to acquire a high resolution structure due to the inherent inefficiencies in the scientific procedures used to prepare proteins for crystallization.

In the past decade, we’ve seen an exponential increase in the number of computational tools, the most prominent of which is Rosetta, a biophysical modeling tools that can do ab initio protein folding simulations (that is, getting a 3 dimensional outputted protein file with sequence input). These kinds of computational tools are extremely valuable because they can help us to simulate and acquire protein structures without all of the resources involved in other more traditional physical methods.

In the next few articles, I’ll be going through a few tutorials surrounding the different methods used for in silico methods for protein structure methods, capturing both purely biophysical methods (i.e. rosetta) and more cutting edge deep learning methods.

Rosetta is a biophysical modeling software that uses simulations and scoring functions to optimize structures. In this project, I used what are known as ab initio protein folding methods to uncover the three dimensional structure of the nsp9 RNA binding domain of SARS-COV2 given solely sequence information.

The way that this is done is through fragment library generation. Fragment libraries software .gz files containing hundreds of thousands of 3–9 length amino acid sequences. These sequences also have their own unique secondary structures and thus can be used to approximate the secondary and ultimately tertiary structures of the proteins themselves.

The first thing that we do in rosetta in order to generate these fragments is submit a set of files including our protein sequence FASTA file, the secondary structure prediction (each amino acid is characterized as belonging to an alpha helix, beta pleated sheet, and loop regions) file, and the actual pdb file (to see how our generated protein structures fare in comparison to the actual protein structure.

My inputted options file for fragment generation. I generated 3 and 9 amino acid long fragments (1000 in total) using the secondary structure prediction file in addition to the sequence and protein database file (for benchmarking and validation)

In the file above you can that there is a “BestFragmentsProtocols/input_files/simple.wghts” file. Essentially, fragment picking involves multiple scoring metrics, each of which correlates to a different method of secondary structure and protein geometry evaluation. The weights file assigns weights to each method and you can use it to consider one type of evaluation over another.

Here you can see that there are three types of scores. There’s the RamaScore, SecondarySimilarity, and Fragment Crmsd.

The RamaScore metric essentially evaluates the number of identical residues between the target sequence and chosen fragment.

The Secondary Similarity metric does exactly what you think it doe — it compares the secondary structures of the fragments to the inputted ss2 file. Finally the fragment Crmsd essentially takes the root mean square deviation of the alpha carbon atom on each individual amino acid.

Now, I was able to submit the task to the Fragment picker in rosetta and get two fragment files out.

This is what one of the outputted fragment files look like after the fragment picker has done its job. Let’s go over what each row of each line means. The “vall_pos” argument gives the position of each fragment in relation to the original pdb fasta sequence. A pdbid and fragment id is assigned to each fragment and the values for each of the three metrics along with a total weighted score (a linear combination of each of the individual scores).

Now that we have the fragments, we can now conduct ab initio folding procedures. Ab initio folding is conducted by taking each of the fragments and constructing them together to approximate the secondary and tertiary structures of a protein.

Ab initio folding uses monte-carlo simulations to simulate protein folding. It can output thousands of files for a particular protein, each with a differing root mean square deviation (rmsd) when compared to the protein target structure (in my case the 6W4B file, representing nsp9 RBD (RNA binding domain). In some cases, the exact protein structure isn’t available, and in this case, you use what are known as homologous structures to approximate the general structure of the protein.

What’s known as a folding funnel is present when the rmsd goes down with the energy of the protein structure (a structure should be more negative as the structure comes closer to approximating the structure of the native protein).

A protein alignment generated in pymol with ~6.0Å rmsd

Given my limited computational resources, I was able to fold a structure with an rmsd of ~6 angstroms when compared to the native nsp9-RBD pdb file. (This was the best closest structure to the native structure out of 65 generated structures).

In the past few years, however, we’ve begun to use deep learning to make our fragment libraries, and this has helped us surpass the ability of biophysical models to conduct fragment assemblies and make models with higher levels of accuracy to the native structure and lower root mean square deviations.

One such model is DeepFragLib, which a machine learning model utilizing what are known as bidirectional LSTM models to make predictions about the best fragments to represent a particular protein structure.

In addition to the classifier, there is also a regression model that takes in a given set of fragments and outputs a predicted root mean square deviation to estimate the quality of structures outputted given a set of fragments. The regression model is ResneXt, which is a modified version of ResNet that uses

The BLSTMs essentially conduct binary classification on the fragment library of interest, and each model is trained on a particular fragment size.

DeepFragLib workflow. The CLA models are the classification BLSTMs that sample fragments and decide which ones are most representative of a given structure.

The authors of the paper used two major tricks including cyclic dilation and knowledge distillation.

Cyclic dilation occurs when there is spacing between the units in a convolutional kernel.

Notice how the “receptive field” or the area covered by the green 3 by 3 grid projected onto the blue convolutional feature space covers a larger area on the blue grid than a 3 by 3 space of the same size as the green square.

This allows the convolutional filter to gain access to and capture patterns in the entire sequence as opposed to a particular region. This is essentially what the scientists did in order to distinguish near-native fragments from decoy fragments in the ResneXt based rmsd prediction model.

Knowledge distillation takes a given neural network, with it’s millions of parameters, nodes, and edges, and condenses


The results of the DeepFragLib fragment picker (red) as opposed to other algorithms

In the above visual, you can see that DeepFragLib provides higher values as compared to the other fragment pickers for the three metrics of choice (precision, position averaged precision, and coverage).

Coverage is the number of residues that are captured by a near native fragment in the fragment library. Precision is the proportion of native fragments in the library, independent of the query sequence itself. Finally, position averaged precision is the average of the proportion of native sequences for each given residue starting point.

A set of distributions showing how the DeepFragLib library compares when compared to other protein fragment picking softwares (e.g. NNMake from Rosetta).
The equation for TM-score, when L(target) is the length of amino acid sequences, L(aligned) is the amount of overlap between the two sequences (likely the length of the shorter sequence if no overhang), and d0(L(target)), a constant used for normalization.

TM-score is the best ultimate method for determining how well a fragment picking algorithm does as it shows how close the tertiary structure of the generated ab initio folded protein resembles the native structure (or the homologous structure most similar in sequence to the native structure).

Why is this Important?

When COVID-19 first came to light, we didn’t have any protein structures for the key viral players in it’s life cycle, including proteases (that chop up RNA for processing) along with helicases and polymerases that allow for the transcription of the RNA. Getting these structures took weeks — and we couldn’t sit still while the virus claimed thousands of lives.

Rosetta allowed scientists to model these proteins: since we already have genome sequencing technology that allows scientists to get the genome sequence of the virus in a matter of minutes to hours at maximum, we could translate the genetic sequence to a protein sequence given the codon code discovered decades ago.

And that’s where ab initio comes in. Given these protein sequences, scientists were able to get somewhat accurate representations of the three dimensional structure of proteins — giving scientists a headstart towards understanding the biology of the virus, which would ultimately go on to claim more than one million lives.

Nevertheless, the structures that we’re getting right now aren’t the most accurate and don’t suffice for drug discovery and that’s why we still have to ultimately resort to traditional methods like x-ray diffraction as opposed to these more technologically advanced methods.

Thanks for reading! Feel free to check out my other articles on Medium and connect with me on LinkedIn!

If you’d like to discuss any of the topics above, I’d love to get in touch with you! I’m actually currently trying to optimize ML models to make predictions about efficient drugs for coronavirus protein targets. (Send me an email at mukundh.murthy@icloud.com or message me on LinkedIn)

Please sign up for my monthly newsletter here if you’re interested in following my progress and feel free to checkout my website at mukundhmurthy.com



Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.