Repurposing LSTMs to Generate Potential Leads for the Coronavirus

In his 2015 TED talk “The next outbreak? We’re not ready,” Bill Gates asserts that “The failure to prepare could allow the next epidemic to be dramatically more devastating than Ebola” and that “If anything kills over 10 million people in the next few decades, it’s most likely to be a highly infectious virus rather than a war.”

  • take advantage of our previous experience with coronavirus strains
  • based on sequence and genetic analyses, use specific homology models (e.g. bat coronavirus, or SARS-CoV) as templates to generate new drugs.
  • saves years out of the extremely long small molecule drug development process.
  • Vaccines have the potential to be more dangerous. Scientists need to make sure that they remove all of the harmful/toxic genes from the 2019-nCoV genome before prescribing it as a vaccine.
  • A vaccine that can be made quickly generally is harder to manufacture in large quantities. On the other hand, vaccines that can be scaled up more easily require a longer period of time to make (6–8 months).
  • We still need to conduct preclinical studies for the vaccines whereas these results have already been published for previous strains of the 2019-nCoV.
  • There are blurry lines as to what qualifies as “safe and effective” for vaccines. With all of the above time commitments and research efforts, the vaccine might only turn out to work for 50% of patients.
  • The 2019-nCoV virus readily undergoes mutations that will change the efficacy of the vaccine.

Here’s a breakdown of what I’m working on:

  1. Chose a target of the 2019-nCoV to experiment on and generate drug molecules for.
  2. Repurposed an existing LSTM model to generate drugs for the coronavirus.
  3. I then used PyRX — a spin-off of the docking software Autodock to dock the generated drugs to the original SARS target as well as the new 2019-nCoV target. This allows us to test the efficacy of our new drugs.

Step 1: Choosing a Target

  1. Break into cell
  2. Make use of its machinery
  3. Use machinery and code together to proliferate and infect other parts

1. Break into Cell

2. Make use of Machinery using your Handy Toolbox

Step 2: Generating Variations of the Molecules

  1. Used SQL to parse through the ChemBL drug database and find relevant inhibitors for SARS-CoV, which were efficacious beyond a certain nM threshold. Prepare data for training.
  2. Train the LSTM model on the millions of molecules in the ChemBL database and generate variations.
  3. “Fine-tune” the model on a particular target and have it generate ~100 molecules with similar chemical similarities to the original representative molecules from ChemBL.


A diagram of an LSTM showing all of its gates
A Key for the various operations shown above
  • Forget Gate: This gate takes an addition of the input and hidden state vectors from a previous cell and then applies the sigmoid activation function, which squishes values between 0 and 1. The “forgotten” elements are represented by vector elements closer to zero.
  • Input Gate: The input gate decides what element of the vector element to retain. In effect it decides what to “remember.” The input vector + hidden state information is applied to a sigmoid function and then separately applied to a tanh activation function. The two output vectors are then multiplied together to determine the new input gate output.
  • Cell state Calculation: The “cell state” is calculated by taking the output from the forget gate output, multiplying with the previous cell state and then adding the input information. Therefore, the cell can retain information from the previous cell while also learning new information.
  • Output Gate: The output gate prepares the output of the LSTM cell, which includes preparing the current cells cell state for export as well as preparing the new hidden state.


DISTINCT canonical_smiles
molregno IN (
DISTINCT molregno
standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")
AND standard_units = "nM"
AND standard_value < 1000
AND standard_relation IN ("<", "<<", "<=", "=")
molecule_type = "Small molecule"
  • The standard types — Kd, Ki, Kb, IC50, EC50 — are molecular and enzymatic metrics for the efficacy of the drugs — in other words, how effective they are in inhibiting or activating the respective drug target.
  • The standard value is the threshold that I set for the activity of the drugs in the Chembl database. In other words, drugs that take a concentration of less than 1000nM (nanomolar) to reach the standard efficacy or inhibition were selected from the database.
st = SmilesTokenizer()
n_table = len(st.table)
weight_init = RandomNormal(mean=0.0,
self.model = Sequential()
input_shape=(None, n_table),
input_shape=(None, n_table),
def train(self):
history = self.model.fit_generator(
def _generate(self, sequence):
while (sequence[-1] != 'E') and (len( <=
x =
preds = self.model.predict_on_batch(x)[0][-1]
next_idx = self.sample_with_temp(preds)
sequence +=[next_idx]
sequence = sequence[1:].rstrip('E')
return sequence
def sample_with_temp(self, preds):
streched = np.log(preds) / self.config.sampling_temp
streched_probs = np.exp(streched) / np.sum(np.exp(streched))
return np.random.choice(range(len(streched)), p=streched_probs)
a) For training, the SMILES characters where padded with A’s until they reached the length of the longest SMILES string in the training batch. The first n-1 characters of the sample where given as input and the last n-1 characters was the “label” for that particular sample. b) during sampling, G was used to represent the starting point and subsequent characters were predicted using the two LSTM layers and the Dense layer softmax function. c) A higher temperature factor leads to greater structural diversity
Note how the original and generated molecules have similar chemical properties. This is what we want, since we don’t want the generated drugs to be significantly different from previous efficacious drugs.
Another visual showing how the properties of the molecules are very similar. MW stands for molecular weight and logP is a measure of the molecule’s hydrophobicity (how polar/nonpolar the molecule is).
config = process_config('experiments/2019-12-23/LSTM_Chem/config.json')modeler = LSTMChem(config, session='finetune')
finetune_dl = DataLoader(config, data_type='finetune')
finetuner = LSTMChemFinetuner(modeler, finetune_dl)
finetuned_smiles = finetuner.sample(num=100)
  1. Representative molecules for fine tuning — these are the molecules that the model created variations of
  2. Original inhibitors for SARS 3Clpro — this data set includes the representative molecules but also includes more molecules, which might have slightly smaller binding affinities.
  3. Generated moleclues — these are the molecules that we generated through LSTM sampling
All three datasets (finetuning molecules, original inhibitors, and generated molecules) all superimposed on each other.
finetuned_smiles = finetuner.sample(num=100)TM_similarity = []
idxs = []
for Fbv in Fbvs:
a = np.array(DataStructs.BulkTanimotoSimilarity(Fbv, Sbvs))
m = a < 1
idx0 = np.where(m, a,np.nanmin(a)-1).argmax()
idx = np.where(m.any(), idx0, np.nan)
TM_similarity.append(DataStructs.BulkTanimotoSimilarity(Fbv, Sbvs)[int(idx)])

print (idx)
nsmols = [smols[idx] for idx in idxs]
Here is an example of two chemically similar molecule, the one on the left representing an originally discovered SARS inhibitor and the one on the right represented the most chemically similar generated one.
[0.9354838709677419, 0.9375, 0.9411764705882353, 0.76, 0.9333333333333333, 0.9411764705882353, 0.8823529411764706, 0.9375, 0.9393939393939394, 0.8103448275862069]
PyMol visualizations of some of the molecules that I generated.

Step 3: Protein-ligand docking

  1. First I downloaded all of the relevant ligands and proteins to the pyrx platform.
  2. I minimized the free energy of the ligands and chose the most stable conformation
  3. Found catalytic residues of 2019-nCoV 3Clpro and restricted the binding site to proximal area near the active site.
  4. Ran autodock vina and outputted ligand-protein complex pdbqt file along with binding affinity

Conclusion and Next Steps

Credit and Works Consulted




Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Graph Visualization with nxviz, ScaleDown for TinyML, Starting Deep RL, and Jobs

Day 1 Coursera — Google Data Analytics Certification

The Continuing Path of Data Science Maturity

Data Modeling: A Study in Contrasts

Words about data modeling scattered like a cloud… data, requirements, understanding, business, rules, discussion, and more

Data Visualization vs. Data Analytics

Discover The Powerful Way To Self-Learn Data Science Effectively

Data Science and Data Preprocessing

Analyzing Movie Reviews Sentiment

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mukundh Murthy

Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.

More from Medium

Poor Man’s GPU

Peering into humanity through music

LASIK Vision enhancing technology as a permanent solution to vision problems?

A Blazingly-Fast DNA Search Algorithm