Repurposing LSTMs to Generate Potential Leads for the Coronavirus
If you haven’t already, I would highly recommend reading my primer article on the coronavirus here.
This past few weeks, the coronavirus, which originally came from a seafood market in Wuhan, China, has been dominating our headlines. It’s starting to impact us more close to home and schools and offices close down. More generally, the virus has stealthily spread to almost 137,000 people and has collectively killed 5,000 deaths across the world. The number of cases in the United States is on an exponential rise with more than 100 confirmed cases in Washington, New York, California, and Massachusetts. Schools and businesses have shut down, the Dow Jones has collapsed to a record low since the 1987 stock market class, and sports institutions like the NBA have canceled games for the first time since almost a century. Experts have been advising that we stockpile food and sanitizing equipment as they predict a large scale pandemic that could result in a quarantine situation for the majority of the population. The urgency and magnitude of the situation is further revealed by the recent funneling of 1.5 trillion dollars from the Federal Reserve into financial securities. The worst part is — it looks like the pandemic might only get worse since we don’t have a vaccine or small molecule drug to cure the virus.
Now here’s the million dollar question: What can we do?
When the ebola virus his, around 10,000 people died in Western Africa. The only reason that we were able to stop the Ebola Virus’s spread was because it could only spread through direct contact. In other words, we were unprepared, we got lucky, and we didn’t have an organized and prepared system to fight the virus.
According to long-term futurists, our goal as millennials in the 21st century is to elevate the human race to a state of divinity almost, through the merger of infotech and biotech. Nevertheless, we always knew that more epidemics were coming, and that our unpreparedness could lead to large amounts of decimation and unexpected global financial consequences.
In his 2015 TED talk “The next outbreak? We’re not ready,” Bill Gates asserts that “The failure to prepare could allow the next epidemic to be dramatically more devastating than Ebola” and that “If anything kills over 10 million people in the next few decades, it’s most likely to be a highly infectious virus rather than a war.”
What’s one way to show that we’re more prepared this time?: Through Developments in Exponential Technology
Over the past few weeks, I’ve seen multiple developments in AI that are able to use inputs such as CT scans to different between positive test cases for the coronavirus and other similar respiratory tract infections. There was an actually an amazing webinar recently, which showed the successful development of similar algorithms and models which showed extremely high rates of sensitivity (true positive rate, around 94%) and specificity (true negative rate, around 99%) for classify COVID-19 along with other diseases. However, there is a caveat: viral testing is the only specific diagnosis method, as CT scans of patients with the coronavirus overlap with those of patients with the flu, H1N1, SARS, and MERS.
Due to the nature of the infodemic alongside the simultaneous pandemic, many people don’t know that the this isn’t the first time that humanity has dealt with “the coronavirus.” In fact, the “coronavirus” represents a large umbrella over subclasses of viruses that cause the common cold. And we have had previous outbreaks of novel strains in the past.
The SARS coronavirus strain appeared in China in the year 2002, but was actually very quickly and efficiently through top down government and community quarantines that limited human-human spread. Nevertheless, this year marked our starting point for our discovery of drugs that cure any coronavirus strain-related symptoms. Although we never finished it, we began the drug discovery process for SARS targets, and through biochemical potency assays and preclinical trials, we found a class of HIV protease and nucleoside analogue inhibitors that looked efficacious. Almost two decades later, it’s imperative that we finish what we started off.
Machine learning models can not only be used to classify symptoms but also to generate new drugs for the coronavirus. As always, clinical trials will need to be used to validate generated results, but using SARS-CoV and MERS-CoV inhibitors as chemical templates and generating variations of these efficacious molecules that still resemble the originals in terms of chemical similarity allows us to …
- take advantage of our previous experience with coronavirus strains
- based on sequence and genetic analyses, use specific homology models (e.g. bat coronavirus, or SARS-CoV) as templates to generate new drugs.
- saves years out of the extremely long small molecule drug development process.
It’s essentially a very slight variation of drug repurposing for coronavirus receptors. In essence, we’re taking previous drugs that look like they would have worked, modifying them slightly, and then evaluating their new efficacy. It’s also likely that given the time-effective nature of this problem, it would be more safe and effective to invest more into small molecule drugs to cure the virus rather than vaccines to prevent infection acquisition. By cutting out the portion of the drug discovery process where we have to screen trillions of molecules and generate leads, we can get quality results much quicker. Not only that, but previous coronavirus vaccines have caused worsening, not improvement.
In addition, here are the potential downsides with making vaccines as compared to repurposed drugs:
- Vaccines have the potential to be more dangerous. Scientists need to make sure that they remove all of the harmful/toxic genes from the 2019-nCoV genome before prescribing it as a vaccine.
- A vaccine that can be made quickly generally is harder to manufacture in large quantities. On the other hand, vaccines that can be scaled up more easily require a longer period of time to make (6–8 months).
- We still need to conduct preclinical studies for the vaccines whereas these results have already been published for previous strains of the 2019-nCoV.
- There are blurry lines as to what qualifies as “safe and effective” for vaccines. With all of the above time commitments and research efforts, the vaccine might only turn out to work for 50% of patients.
- The 2019-nCoV virus readily undergoes mutations that will change the efficacy of the vaccine.
Now that we have a general background into the coronavirus and current developments, let’s dive into what I worked on.
Here’s a breakdown of what I’m working on:
In order to generate variations of previous drug that may be efficacious for the 2019-nCoV, I…
- Chose a target of the 2019-nCoV to experiment on and generate drug molecules for.
- Repurposed an existing LSTM model to generate drugs for the coronavirus.
- I then used PyRX — a spin-off of the docking software Autodock to dock the generated drugs to the original SARS target as well as the new 2019-nCoV target. This allows us to test the efficacy of our new drugs.
Step 1: Choosing a Target
In drug discovery, a target is the protein molecule that you choose to inhibit or activate based on its given functionality and role in biochemical pathways. The first thing to be done is choosing a given target. If we think as each small-molecule-target pair as a lock and a key, it makes sense that we have to try and find which lock that we want to try and open before we try to design a key.
But in order to find the lock or best target for the virus, we have to understand how the virus replication and life cycle works.
OK. Woah. Let’s break this down.
- Break into cell
- Make use of its machinery
- Use machinery and code together to proliferate and infect other parts
Now let’s apply each of these steps to the coronavirus. Imagine that you’re the coronavirus. You’re going to break into a human cell.
1. Break into Cell
The first thing you have to do is break into the cell in order to get access to all of its unique and helpful machinery. You use your spike glycoproteins, large molecules on the outside of your surface to bind to the receptors of the cell and you gain secret access.
2. Make use of Machinery using your Handy Toolbox
Now that you’ve gotten secret access into the cell, you’ve got to make use of its machinery for your advantage.
The first thing that do is release your secret weapon, your genome, which allows you to do all the burglary in the cell. Your genome consists of two open reading frames, which allow you to transcribe two different versions of polyproteins.
You’re still not ready to hijack the cell’s machinery though. You still have set up your hijacking device. The polyproteins need to be turned “on” by breaking them apart into fifteen non-structural proteins, which are about to do all of the dirty work. You break apart the polyproteins using proteases, which are like handy scissors. You specifically use what are called papain-like proteases (PLpro) and 3C-lke proteases (3CLpro).
Ok now you’re part way through your set up process but you’re not yet done.
You now use all of the non-structural proteins (nsp) generated by the scissoring process to generate transcription complex, which transcribes the viruses positive strand RNA to negative strand mRNA for the synthesis of proteins.
3. Use machinery to proliferate and infect
Use the mRNA produced by the machinery to translate structural and accessory proteins and proliferate
Target options for the coronavirus:
There is currently lots of effort going on right now for drug repurposing. Chloroquine and Remdesivir, two drugs shown to be potent inhibitors for the RNA-dependent RNA polymerase (which converts the positive RNA strand to the mRNA negative version). However, I decided to take a different angle on this project, and decided to use 3CLpro as a target, since without 3CLpro functionality, the virus cannot replicate.
Again, these are the molecular scissors which cut up the polyprotein made up of thousands of amino acids into multiple non-structured proteins which compose the replication-transcription complex.
Now we have an understanding of the viral replication system and potential efficacious targets which we can choose. How do we find existing inhibitors for the SARS-CoV strain based on previous research.
Step 2: Generating Variations of the Molecules
In order to generate variations of the molecules I…
- Used SQL to parse through the ChemBL drug database and find relevant inhibitors for SARS-CoV, which were efficacious beyond a certain nM threshold. Prepare data for training.
- Train the LSTM model on the millions of molecules in the ChemBL database and generate variations.
- “Fine-tune” the model on a particular target and have it generate ~100 molecules with similar chemical similarities to the original representative molecules from ChemBL.
Ok now before I start on my specific project, a quick crash course on LSTMs. Those of who having prior knowledge of expertise in this area may skip this part.
LSTMs are a type of machine learning algorithm (along with CNNs, RNNs, etc.) that stand for Long-Short Term Memory networks. They are used for sequence analysis and predicting the future of sequences given its past. They are most commonly used in sentiment analysis and Natural-language processing for text-to speech and general language analysis. However, they can also be used to predict chemical structures through character generation of what’s known as SMILES strings. We’ll get to more about SMILES in the next section.
LSTMs are useful because they combat the common vanishing gradient problem. This means that as algorithms backpropagate through layers, the value of the resulting gradients get smaller and smaller. Therefore, as the gradients get smaller and smaller, the machine learning model is unable to learn as fast as it could before, regardless of alpha, its learning rate. LSTMs on the other hand learn to regulate the flow and useful information through the use of different gates and vector operations.
The basic premise of an LSTM is that it has multiple cells, each to evaluate a specific word or character in a sequence. It is able to combine input information into the current cell with “hidden state” information from previous cells to pass on relevant information to the next cell.
Here is a description of the different types of gates:
- Forget Gate: This gate takes an addition of the input and hidden state vectors from a previous cell and then applies the sigmoid activation function, which squishes values between 0 and 1. The “forgotten” elements are represented by vector elements closer to zero.
- Input Gate: The input gate decides what element of the vector element to retain. In effect it decides what to “remember.” The input vector + hidden state information is applied to a sigmoid function and then separately applied to a tanh activation function. The two output vectors are then multiplied together to determine the new input gate output.
- Cell state Calculation: The “cell state” is calculated by taking the output from the forget gate output, multiplying with the previous cell state and then adding the input information. Therefore, the cell can retain information from the previous cell while also learning new information.
- Output Gate: The output gate prepares the output of the LSTM cell, which includes preparing the current cells cell state for export as well as preparing the new hidden state.
The process repeats for subsequent layers of the LSTM. With all of this information, the LSTM can make predictions about future elements in the sequence!
TL;DR: Why LSTM’s?
LSTMs are generative models, which means they will allow us to generate new drugs by adding elements to a sequence
LSTMs have long term memory meaning that they can store and recognize patterns across a large variety of sequences
The sequences that I analyzed for my project were SMILES strings, which are 1D sequence representations for 2D chemical organic molecule structures.
SMILES (stands for simplified molecular-input line-entry system) strings are sequences of characters that represent a two dimensional chemical structure as shown below:
The SMILES strings above represents the molecule below:
I’m not going to go into the exact conversion between smiles string format and chemical molecule, because that requires a thorough understanding of organic chemistry and stereochemistry which we do not have time for in this article. Nevertheless, there are several observations that we can make, such as the fact that the CO represents the C-O bond in the molecule, and the hashtag represents the triple bond between Carbon and Nitrogen in the molecule.
I trained an existing LSTM model on millions of molecules from the ChemBl database, allowing the model to adjust its weights and parameters so that it can accurately predict the next character given the first n elements of a SMILES string.
molregno IN (
standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")
AND standard_units = "nM"
AND standard_value < 1000
AND standard_relation IN ("<", "<<", "<=", "=")
molecule_type = "Small molecule"
This is the SQL code that I used to parse through the database.
- The standard types — Kd, Ki, Kb, IC50, EC50 — are molecular and enzymatic metrics for the efficacy of the drugs — in other words, how effective they are in inhibiting or activating the respective drug target.
- The standard value is the threshold that I set for the activity of the drugs in the Chembl database. In other words, drugs that take a concentration of less than 1000nM (nanomolar) to reach the standard efficacy or inhibition were selected from the database.
Now we have all of the small molecule drugs downloaded into our model. Let’s look into how we construct the LSTM model and how we train it.
st = SmilesTokenizer()
n_table = len(st.table)
weight_init = RandomNormal(mean=0.0,
seed=self.config.seed)self.model = Sequential()
Here you can see that there are two LSTM layers stacked on top of each other along with a subsequent dense layer that actually makes the predictions for subsequent SMILES characters given an original fragment.
Also not the instantiation of the SmilesTokenizer class towards the beginning of this code. The SMILES tokenizer takes each smile string for each molecule and basically one hot encodes each character in the string (one-hot encoding is where each element of the string is represented by vector of zeroes where the position of the one determines the identity of the specific element. After all of this data is curated, we’re ready to train.
Training looked something like this:
history = self.model.fit_generator(
Here you can see the many conventional machine learning hyperparameters, including numer of epochs, and the splitting of data into traning and validation sets. All of these hyperparameters were predefined in a separate configuration file.
def _generate(self, sequence):
while (sequence[-1] != 'E') and (len(self.st.tokenize(sequence)) <=
x = self.st.one_hot_encode(self.st.tokenize(sequence))
preds = self.model.predict_on_batch(x)[-1]
next_idx = self.sample_with_temp(preds)
sequence += self.st.table[next_idx]sequence = sequence[1:].rstrip('E')
return sequencedef sample_with_temp(self, preds):
streched = np.log(preds) / self.config.sampling_temp
streched_probs = np.exp(streched) / np.sum(np.exp(streched))
return np.random.choice(range(len(streched)), p=streched_probs)
Sampling with Temperature: In sampling, the LSTM uses the model’s pre-trained and newly updated weights to generate entirely new sequences! You can see in the code above that characters are added until the sequence meets the maximum length. The temperature factor (represented by T in the softmax function) represents another variable that can be optimized to control the inverse relationship between chemical structural diversity and valid SMILES strings (the more diverse the chemical structures, the smaller the percentage of valid SMILES strings).
I generated 30,000 samples based on the Chembl input data and it generated a PCA distribution as shown below. The distribution shows us that chemical properties of the original and generated molecules mostly overlap, which is exactly what we want.
Now that the LSTM model has been trained and parameters have been set to generate valid SMILES strings, we now need to modify our model a tiny bit so that it generates not just any drug, but drugs that bind efficiently to 3CLpro.
How do we do this?
We do this through a process called fine-tuning, where we select a few drugs (5–20) from the latent space of efficacious inhibitors for 3Clpro and then generate variations of those specific molecules. This was done through the use use of a process called “fine-tuning.”
config = process_config('experiments/2019-12-23/LSTM_Chem/config.json')modeler = LSTMChem(config, session='finetune')
finetune_dl = DataLoader(config, data_type='finetune')finetuner = LSTMChemFinetuner(modeler, finetune_dl)
finetuner.finetune()finetuned_smiles = finetuner.sample(num=100)
We’ve generated 100 small molecule variations of the 10 representative drugs that we chose.
Now here comes the conceptually tricky part. There were three datasets we dealt with.
- Representative molecules for fine tuning — these are the molecules that the model created variations of
- Original inhibitors for SARS 3Clpro — this data set includes the representative molecules but also includes more molecules, which might have slightly smaller binding affinities.
- Generated moleclues — these are the molecules that we generated through LSTM sampling
Ok now there’s a slight problem. Some of the molecules that we generated are extremely similar to the original selected training molecules. This isn’t very useful. Although generative models can traditionally be used to generate and output that resembles the input, making variations is what makes them more useful. At the same time, you don’t want generated molecules to be too distant from the original molecules in the principle component space. Therefore, we need a metric to measure chemical similarity: Tanimoto Similarity (or coefficient)
Here’s the original code that I then modified to select generated molecules that were chemically similar to the original molecules but not the same. I paired up generated molecules to the original inhibitors with the closest chemical similarity.
finetuned_smiles = finetuner.sample(num=100)TM_similarity = 
idxs = 
for Fbv in Fbvs:
a = np.array(DataStructs.BulkTanimotoSimilarity(Fbv, Sbvs))
m = a < 1
idx0 = np.where(m, a,np.nanmin(a)-1).argmax()
idx = np.where(m.any(), idx0, np.nan)TM_similarity.append(DataStructs.BulkTanimotoSimilarity(Fbv, Sbvs)[int(idx)])
nsmols = [smols[idx] for idx in idxs]
Here Fbvs is the set of chemical fingerprints for the original SARS inhibitors and nsmols is the set of generated new molecules. By finding the indexes for nsmols which have the highest Tanimoto chemical similarity with Fbvs, we can pair up the generated molecules with the original ones which have the highest chemical similarity.
More generally, here’s the vector showing the calculated tanimoto similarities between each finetuning molecule and the most chemically similar original inhibitor.
[0.9354838709677419, 0.9375, 0.9411764705882353, 0.76, 0.9333333333333333, 0.9411764705882353, 0.8823529411764706, 0.9375, 0.9393939393939394, 0.8103448275862069]
As you can see all of the Tanimoto similarities (on a scale from 0 to 1) are relatively high. Molecules with Tanimoto coefficients above 0.85 are generally called “chemically similar.”
Yay! Now we have our generated molecules. Onto the final step of the process.
Step 3: Protein-ligand docking
Now we have generated new “keys” for the 3Clpro lock with various signatures of ridges and bumps, we need to test the efficacy. How can we do so without actually being in a lab? We can do so through the use of virtual chemical simulation platforms called docking platforms.
Autock Vina is a popular docking platform, and I used a platform called PyRx, which makes docking a little bit easier.
The first thing I had to do was convert the generated SMILES strings for the new molecules into SDF files (just a different file format for chemical molecules. Other formats includ Mol2, pdb, etc.).
Now I had to download the structure of the lock that I was trying to open. I went on the pdb website looking for the structure of SARS-CoV 3Clpro, but then something caught me by surprise.
Scientists had already discovered the 3D crystal structure of 3Clpro for the 2019-nCoV!!! This meant that I could use the Wuhan coronavirus structure itself for testing the efficacy of the generated drugs rather than using the old SARS virus and extrapolating to this new strain.
I had originally planed to use sequence homology to map the mutations from 2019-nCoV to the SARS 3Clpro structure and use the nCSM-lig web server to calculate differences in binding affinities, but instead I used the new 3D crystal structure during docking.
Here’s how docking worked:
- First I downloaded all of the relevant ligands and proteins to the pyrx platform.
- I minimized the free energy of the ligands and chose the most stable conformation
- Found catalytic residues of 2019-nCoV 3Clpro and restricted the binding site to proximal area near the active site.
- Ran autodock vina and outputted ligand-protein complex pdbqt file along with binding affinity
Conclusion and Next Steps
This project show us that AI can be used to generate drug target leads much faster than the traditional drug development process.
Nevertheless, AI can’t completely validate drugs by itself.
The best way to generate the drugs with strongest binding affinity would likely be to repeat this process two or three times (meaning after finding the drugs with strongest binding affinity from autodock, re-input them back into the LSTM algorithm as representative molecules and have the model generate more variations.
For the future, I plan to conduct ADME and toxicity simulations on some of the molecules at which point I might reach out to research labs and pharma companies who might be willing to help me test out compounds. I’m also in the process of researching about and developing a novel Quantitative Structure-Activity Relationship algorithm for the coronavirus.
Credit and Works Consulted
Credit goes to topazape for developing the LSTM model that I was able to repurpose.
Autodock Vina: O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455–461
nCSM-lig: Pires, Douglas E V, and David B Ascher. “CSM-lig: a web server for assessing and comparing protein-small molecule affinities.” Nucleic acids research vol. 44,W1 (2016): W557–61. doi:10.1093/nar/gkw390
Generative LSTM paper: Gupta, Anvita, et al. “Generative Recurrent Networks for De Novo Drug Design.” Molecular Informatics, vol. 37, no. 1–2, 2017, p. 1700111., doi:10.1002/minf.201700111.
Homology Modeling Paper: Stoermer, Martin. “Homology Models of Wuhan Coronavirus 3CLpro Protease.” 2020, doi:10.26434/chemrxiv.11637294.v1.
If you’d like to discuss any of the topics above, I’d love to get in touch with you! I’m actually currently trying to optimize ML models to make predictions about efficient drugs for coronavirus protein targets. (Send me an email at email@example.com or message me on LinkedIn)
Please sign up for my monthly newsletter here if you’re interested in following my progress :)