Repurposing LSTMs to Generate Potential Leads for the Coronavirus

In his 2015 TED talk “The next outbreak? We’re not ready,” Bill Gates asserts that “The failure to prepare could allow the next epidemic to be dramatically more devastating than Ebola” and that “If anything kills over 10 million people in the next few decades, it’s most likely to be a highly infectious virus rather than a war.”

What’s one way to show that we’re more prepared this time?: Through Developments in Exponential Technology

  • take advantage of our previous experience with coronavirus strains
  • based on sequence and genetic analyses, use specific homology models (e.g. bat coronavirus, or SARS-CoV) as templates to generate new drugs.
  • saves years out of the extremely long small molecule drug development process.
  • Vaccines have the potential to be more dangerous. Scientists need to make sure that they remove all of the harmful/toxic genes from the 2019-nCoV genome before prescribing it as a vaccine.
  • A vaccine that can be made quickly generally is harder to manufacture in large quantities. On the other hand, vaccines that can be scaled up more easily require a longer period of time to make (6–8 months).
  • We still need to conduct preclinical studies for the vaccines whereas these results have already been published for previous strains of the 2019-nCoV.
  • There are blurry lines as to what qualifies as “safe and effective” for vaccines. With all of the above time commitments and research efforts, the vaccine might only turn out to work for 50% of patients.
  • The 2019-nCoV virus readily undergoes mutations that will change the efficacy of the vaccine.

Here’s a breakdown of what I’m working on:

In order to generate variations of previous drug that may be efficacious for the 2019-nCoV, I…

  1. Chose a target of the 2019-nCoV to experiment on and generate drug molecules for.
  2. Repurposed an existing LSTM model to generate drugs for the coronavirus.
  3. I then used PyRX — a spin-off of the docking software Autodock to dock the generated drugs to the original SARS target as well as the new 2019-nCoV target. This allows us to test the efficacy of our new drugs.

Step 1: Choosing a Target

In drug discovery, a target is the protein molecule that you choose to inhibit or activate based on its given functionality and role in biochemical pathways. The first thing to be done is choosing a given target. If we think as each small-molecule-target pair as a lock and a key, it makes sense that we have to try and find which lock that we want to try and open before we try to design a key.

  1. Break into cell
  2. Make use of its machinery
  3. Use machinery and code together to proliferate and infect other parts

1. Break into Cell

The first thing you have to do is break into the cell in order to get access to all of its unique and helpful machinery. You use your spike glycoproteins, large molecules on the outside of your surface to bind to the receptors of the cell and you gain secret access.

2. Make use of Machinery using your Handy Toolbox

Now that you’ve gotten secret access into the cell, you’ve got to make use of its machinery for your advantage.

Step 2: Generating Variations of the Molecules

In order to generate variations of the molecules I…

  1. Used SQL to parse through the ChemBL drug database and find relevant inhibitors for SARS-CoV, which were efficacious beyond a certain nM threshold. Prepare data for training.
  2. Train the LSTM model on the millions of molecules in the ChemBL database and generate variations.
  3. “Fine-tune” the model on a particular target and have it generate ~100 molecules with similar chemical similarities to the original representative molecules from ChemBL.

LSTMs

LSTMs are a type of machine learning algorithm (along with CNNs, RNNs, etc.) that stand for Long-Short Term Memory networks. They are used for sequence analysis and predicting the future of sequences given its past. They are most commonly used in sentiment analysis and Natural-language processing for text-to speech and general language analysis. However, they can also be used to predict chemical structures through character generation of what’s known as SMILES strings. We’ll get to more about SMILES in the next section.

A diagram of an LSTM showing all of its gates
A Key for the various operations shown above
  • Forget Gate: This gate takes an addition of the input and hidden state vectors from a previous cell and then applies the sigmoid activation function, which squishes values between 0 and 1. The “forgotten” elements are represented by vector elements closer to zero.
  • Input Gate: The input gate decides what element of the vector element to retain. In effect it decides what to “remember.” The input vector + hidden state information is applied to a sigmoid function and then separately applied to a tanh activation function. The two output vectors are then multiplied together to determine the new input gate output.
  • Cell state Calculation: The “cell state” is calculated by taking the output from the forget gate output, multiplying with the previous cell state and then adding the input information. Therefore, the cell can retain information from the previous cell while also learning new information.
  • Output Gate: The output gate prepares the output of the LSTM cell, which includes preparing the current cells cell state for export as well as preparing the new hidden state.

COc(c1)cccc1C#N

The SMILES strings above represents the molecule below:

SELECT
DISTINCT canonical_smiles
FROM
compound_structures
WHERE
molregno IN (
SELECT
DISTINCT molregno
FROM
activities
WHERE
standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")
AND standard_units = "nM"
AND standard_value < 1000
AND standard_relation IN ("<", "<<", "<=", "=")
INTERSECT
SELECT
molregno
FROM
molecule_dictionary
WHERE
molecule_type = "Small molecule"
);
  • The standard types — Kd, Ki, Kb, IC50, EC50 — are molecular and enzymatic metrics for the efficacy of the drugs — in other words, how effective they are in inhibiting or activating the respective drug target.
  • The standard value is the threshold that I set for the activity of the drugs in the Chembl database. In other words, drugs that take a concentration of less than 1000nM (nanomolar) to reach the standard efficacy or inhibition were selected from the database.
st = SmilesTokenizer()
n_table = len(st.table)
weight_init = RandomNormal(mean=0.0,
stddev=0.05,
seed=self.config.seed)
self.model = Sequential()
self.model.add(
LSTM(units=self.config.units,
input_shape=(None, n_table),
return_sequences=True,
kernel_initializer=weight_init,
dropout=0.3))
self.model.add(
LSTM(units=self.config.units,
input_shape=(None, n_table),
return_sequences=True,
kernel_initializer=weight_init,
dropout=0.5))
self.model.add(
Dense(units=n_table,
activation='softmax',
kernel_initializer=weight_init))
def train(self):
history = self.model.fit_generator(
self.train_data_loader,
steps_per_epoch=self.train_data_loader.__len__(),
epochs=self.config.num_epochs,
verbose=self.config.verbose_training,
validation_data=self.valid_data_loader,
validation_steps=self.valid_data_loader.__len__(),
use_multiprocessing=True,
shuffle=True,
callbacks=self.callbacks)
def _generate(self, sequence):
while (sequence[-1] != 'E') and (len(self.st.tokenize(sequence)) <=
self.config.smiles_max_length):
x = self.st.one_hot_encode(self.st.tokenize(sequence))
preds = self.model.predict_on_batch(x)[0][-1]
next_idx = self.sample_with_temp(preds)
sequence += self.st.table[next_idx]
sequence = sequence[1:].rstrip('E')
return sequence
def sample_with_temp(self, preds):
streched = np.log(preds) / self.config.sampling_temp
streched_probs = np.exp(streched) / np.sum(np.exp(streched))
return np.random.choice(range(len(streched)), p=streched_probs)
a) For training, the SMILES characters where padded with A’s until they reached the length of the longest SMILES string in the training batch. The first n-1 characters of the sample where given as input and the last n-1 characters was the “label” for that particular sample. b) during sampling, G was used to represent the starting point and subsequent characters were predicted using the two LSTM layers and the Dense layer softmax function. c) A higher temperature factor leads to greater structural diversity
Note how the original and generated molecules have similar chemical properties. This is what we want, since we don’t want the generated drugs to be significantly different from previous efficacious drugs.
Another visual showing how the properties of the molecules are very similar. MW stands for molecular weight and logP is a measure of the molecule’s hydrophobicity (how polar/nonpolar the molecule is).
config = process_config('experiments/2019-12-23/LSTM_Chem/config.json')modeler = LSTMChem(config, session='finetune')
finetune_dl = DataLoader(config, data_type='finetune')
finetuner = LSTMChemFinetuner(modeler, finetune_dl)
finetuner.finetune()
finetuned_smiles = finetuner.sample(num=100)
  1. Representative molecules for fine tuning — these are the molecules that the model created variations of
  2. Original inhibitors for SARS 3Clpro — this data set includes the representative molecules but also includes more molecules, which might have slightly smaller binding affinities.
  3. Generated moleclues — these are the molecules that we generated through LSTM sampling
All three datasets (finetuning molecules, original inhibitors, and generated molecules) all superimposed on each other.
finetuned_smiles = finetuner.sample(num=100)TM_similarity = []
idxs = []
for Fbv in Fbvs:
a = np.array(DataStructs.BulkTanimotoSimilarity(Fbv, Sbvs))
m = a < 1
idx0 = np.where(m, a,np.nanmin(a)-1).argmax()
idx = np.where(m.any(), idx0, np.nan)
TM_similarity.append(DataStructs.BulkTanimotoSimilarity(Fbv, Sbvs)[int(idx)])
idxs.append(int(idx))

print (idx)
nsmols = [smols[idx] for idx in idxs]
Here is an example of two chemically similar molecule, the one on the left representing an originally discovered SARS inhibitor and the one on the right represented the most chemically similar generated one.
[0.9354838709677419, 0.9375, 0.9411764705882353, 0.76, 0.9333333333333333, 0.9411764705882353, 0.8823529411764706, 0.9375, 0.9393939393939394, 0.8103448275862069]
PyMol visualizations of some of the molecules that I generated.

Step 3: Protein-ligand docking

Now we have generated new “keys” for the 3Clpro lock with various signatures of ridges and bumps, we need to test the efficacy. How can we do so without actually being in a lab? We can do so through the use of virtual chemical simulation platforms called docking platforms.

  1. First I downloaded all of the relevant ligands and proteins to the pyrx platform.
  2. I minimized the free energy of the ligands and chose the most stable conformation
  3. Found catalytic residues of 2019-nCoV 3Clpro and restricted the binding site to proximal area near the active site.
  4. Ran autodock vina and outputted ligand-protein complex pdbqt file along with binding affinity

Conclusion and Next Steps

This project show us that AI can be used to generate drug target leads much faster than the traditional drug development process.

Credit and Works Consulted

Credit goes to topazape for developing the LSTM model that I was able to repurpose.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mukundh Murthy

Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.