Machine Learning in Drug Discovery

Mukundh Murthy
11 min readJun 25, 2020

Why is it that we still haven’t been able to figure out how to cure diseases like Alzheimer’s? It’s mainly because:

  1. We don’t understand the main mechanism of the disease (the pathways involved in misfolded amyloid accumulation). That means that when we choose a target, we’re not entirely sure if it is ‘druggable’ — that is — whether it will have the effect that we intend it to have.
  2. We don’t entirely understand the relationship between small molecule structure and physiological properties and side effects.

We’ll see more generally how machine learning is allowing us to get a better shot at more broadly solving both of these problems.

There are three main stages of drug discovery. Stage A involves identifying and validating the chosen target. Stage 2 involves narrowing down leads and optimizing them based on molecular properties. Finally, Stage C involves clinical trials to check for pharmacodynamic and pharmacokinetic side effects.

The diagram above shows a simplified version of the pipeline for drug discovery. Each of the three technologies outlined in this diagram (ML x Protein Folding, multi-omics, and virtual clinical trials) will guide us in our exploration of how machine learning with revolutionizing the field of drug discovery.

Navigating Chemical Space

The entire space of small molecules is made up of more than 10⁶³ molecules, this space is narrowed down into 10²² and eventually 10¹⁴ molecules.

Drug Screening Pipeline

Screening through this space often takes more than 2 decades.

There are two main ways that small molecule drugs are designed — through structure-based drug discovery and ligand-based drug discovery. In structure-based drug discovery, the 3-dimensional structure of the protein target is known whereas, in ligand-based drug discovery, no structures of proteins are acquired.


  • Check out this article to learn more about why drugs take two decades to make
  • RoBERTa HuggingFace pre-trained model — check out this pretrained transformer model on the ZINC small molecule database that one of my friends made

Evaluating The Efficacy of Drugs

How is the efficacy of a drug evaluated? This is most often done through free energy of binding. However, this approach often leads to side effects discovered later on in clinical trials. Side effects are often due to the pharmacokinetic and pharmacodynamic effects of drugs, which stem from the differential metabolism and binding efficacy of drugs from person to person based on mutations. However, solely measuring drugs based on physical properties such as logP and outdated Lipinski's rule of 5 leads to unknown toxicities and side effects.

QSAR is the study of understanding the physical properties of drugs based on structural and pharmacophore components (bring in Chemprop).

Companies Changing the Status Quo

OneThreeBio is revolutionizing the way that small molecule properties are predicted.

OneThreeBio is working towards a future where machine learning models can predict the clinical outcomes of small molecules through data-driven approaches.

We lose 12 trillion dollars a year in clinical trials. Much of the loss can be avoided by halting testing for drugs that seem to have unfavorable characteristics and properties.

In a recent paper from 2016, scientists at one three bio worked to use both physicochemical and toxicity measures of both the small molecule and the target to ultimately assess toxicity.

Although we’ve had metrics for toxicity for decades, these metrics are often rigid, inaccurate, and may actually prevent us from finding efficacious drugs rather than eliminating toxic ones. For example, Lipinksi’s rule of 5 (Ro5) sets rigid numbers for the number of hydrogen bonds acceptor and donors, along with integer constraints on the number of atoms of a particular type. Other metrics such as QED (drug-likeness metric) have been defined, but these haven’t reduced trial attrition rates.

OneThreeBio defined a new metric called predicting the odds of clinical trial outcomes (PROCTOR) as a way to integrate multiple types of data to make a more informed decision concerning the toxicity of a small molecule drug.

Instead of solely looking into structural features of the small molecule, they utilized properties of the target as well, including gene expression levels, mutation frequency, network connectivity (which was defined as the number of gene neighbors for the drug target as well as the “between-ness” — the number of shortest paths in the network which passed through the given gene.

The scientists found many features were correlated with each other and utilized principal component analysis to decorrelate features.

Ultimately, they found that target characteristics had played a greater role in the random forest classification decisions.

In another paper, scientists integrated inhibition assay data, cell expression data, along with traditional small molecule data in order to classify whether two small molecules share the same protein target through bayesian optimization.


Protein Representations

As you saw in the beginning, target acquisition and validation are the first steps in drug discovery. If the small molecule is targeting a protein-protein interaction, then multiple protein surfaces must be characterized in order to design an efficacious small molecule.

The protein folding problem, however, has been extremely hard to solve. Check out my article here to learn more about existing methods.

Recent cutting edge studies include UniRep, where scientists in the George church lab at Harvard utilized an mLSTM to acquire a different type of representation for protein structures. By feeding sequences into the recurrent model, they were able to take the resulting hidden state vector and use that as a representation.

The main way that protein structures are being brought in to machine learning in drug discovery is through the sequence representations. These sequence representations are more numerous than structural representations and actually encode partial amounts of structural information that can be deduced by the machine learning models.

A recent paper used amino acid sequences as another input into a transformer-based architecture to determine the binding energy of a small molecule to its protein target.


Protein and RNA biologic discovery with machine learning

The astronomical number of possible sequences with just a few nucleotides shows how hard it is to navigate the fitness landscape. Machine learning algorithms that allow us to more quickly traverse the fitness landscape will revolutionize the way that we think about not only protein and RNA biologic drug discovery, but also regular small molecule drug discovery.

Navigating evolutionary landscapes with machine learning

Just consider a 15 nucleotide sequence of RNA. There are 4¹⁵ possible sequences that could emerge — that’s more than one billion. And that's only for nucleic acids, which have 4 possible nucleotides. Protein sequences, on the other hand, have 20 possible amino acids, so the entire sequence space would consist of 20¹⁵ possible sequences for a fifteen amino acid sequence.

Machine learning models are being used to more easily navigate these fitness landscapes to converge on sequences that satisfy the user-defined goal (whether that be binding affinity to a protein or a physical property like fluorescence).

More specifically, here are two types of models being utilized

  • Gaussian process models — these models are characterized by a prior distribution and a posterior distribution. The models are able to fit a given dataset with input features and labels to the distribution and model uncertainty in the feature space. The parameters of a gaussian process model consist of the covariance matrix (aka kernel) and the mean vector. These show the values of the input variables as well as how far away they are from each other in the feature space.
  • Variational autoencoders — these models are used as a form of representation learning, where the model compresses a given sequence or structure into a latent space, which contains variables, each of which represents a linear combination of the original features. These latent variables are often more conceptually valuable than the original variables themselves.

VAEs are used to can be used to capture phylogenetic relationships between proteins and assess similarities in differences in function.


  • Repository of DNA and RNA binding trained models for genomics — Kipoi
  • RNA splicing classification GitHub code in Keras (convolutional neural network (CNN))

Using ML to Understand Biological Mechanisms

The main way that we can begin to design more efficacious drugs

Understanding actual biological mechanisms are key to unlocking new therapeutics. Introduce multi-omics, PPI research, and the ‘undruggability’ issue due to IDPs, lack a binding site, allosteric.

One of the ways that we’ve been working to understand disease mechanisms is through making disease modules, which help us understand how certain genes are clustered together in different tissues as well as abnormalities in gene expression between normal vs diseased individuals. The expression data used here is transcriptomics RNA sequencing data.

We’re beginning to gain the capabilities to interpret different layers of neural networks and uncover different layers of biological complexity deeper and deeper into the layers.

Of course, there’s also the possibility that we’ll discover tools to gather more robust kinds of data. Recently, scientists discovered a new type of data gathering method called prox-seq which will be able to measure the proximities of proteins to each other within the context of a cell. This will allow us to gather more robust sources of data in drug studies on protein-protein interactions (PPIs) — the way this works is that antibodies carrying single-stranded DNA molecules have these single stranded molecules hybridize when in close contact. The antibodies themselves are made to bind to specific regions on proteins.


Resources for learning:

ML in Biophysics

Understanding biological mechanisms mean understanding the detailed and complex interactions that take place between the macromolecules in our cells.

We normally assess the binding between say, a peptide, and a protein through a process called docking. Docking often uses molecular dynamics, which are extremely computationally intensive simulations that take weeks for the simulation of even a microsecond. Thus, we’re unable to simulate more biologically relevant timescales (milliseconds).

Simple feedforward neural networks are allowing us to take the high dimensional feature spaces of the molecular dynamics simulations and extract more simple abstract features that are more easily interpretable.

Simulations on even larger timescales (i.e. the one second timescale on which amyloid aggregation occurs), will allow for the modeling of real-time pathology → this could enable a whole new era of disease research and an incredible understanding of chronic diseases like cancer and other neurodegenerative disorders.

Company Case Study — Atomwise

Atomwise is using conventional neural networks to acquire a receptive field that captures interactions between atoms in a protein structure (structure-based drug discovery).

Just like CNNs for facial recognition can take features like edges, and curved lines and then compose these into more abstract features (eyes, nose), atomwise’s software takes different types of chemical interactions such as pi-stacking and hydrogen bonds, and composes these into more abstract structures, such as different binding pockets.


Undruggable Targets

Discovering a drug isn’t as simple as choosing an important protein participating in a vital signal transduction cascade — that’s what makes diseases like cancer so hard to treat. Instead, many targets have become known as ‘undruggable’ — this is the next main challenge that we must face in drug discovery.

Daphne Koller — “Many (perhaps most) of the “low-hanging fruit” — druggable targets that have a significant effect on a large population — have been discovered. If so, then the next phase of drug development will need to focus on drugs that are more specialized — whose effects may be context-specific, and which apply only to a subset of patients.”

KRAS and MYC are probably two of the most notorious undruggable targets in all of the biomedicine — they just might be the solution to cancer.

MYC is a transcription factor that plays a key role in cell proliferation leader to tumor formation and metastasis. KRAS and other proteins in the RAS family are a set of molecular switches that interact with downstream effectors that regulate cytoplasmic signaling. Finally, these signals propagate and lead to the formation of transcription factors that turn on gene expression.

The reason why proteins like MYC and KRAS are undruggable is that these proteins either

  1. Function through protein-protein interactions
  2. Are intrinsically disordered (they don’t have a rigid static structure) (>80% of proteins fall into this category)

All of the information that you learned about machine learning playing a role in our understanding of pathways and protein structure is contributing to the efforts to decipher these undruggable targets.

More specifically, we’re using artificial intelligence to uncover new protein-protein interactions that we can then use to design novel peptide therapeutics.


  • Delphi — PPI prediction with solely sequence information


Ultimately, deep learning will revolutionize the way that we think about medicine — from helping us understand our connectome to help us gain a more thorough understanding of biological pathways and mechanisms.

Nevertheless, a number of problems to be solved remain.

In DeepGenomic’s review of “Deep learning in biomedicine,” Weinberg et. all bring up the necessity for there to exist model transparency and accountability for the various stakeholders involved in disease diagnosis and therapeutic treatment (including friends, family, physicians, labs, and companies).

DNN transparency is something that we need to constantly be improving — not only for the sake of transparency itself, but also to allow for the creation of causal statements that justify a particular classification or label by a model in order to avoid confusion with confounding variables.

Thanks for reading! Feel free to check out my other articles on Medium and connect with me on LinkedIn!

If you’d like to discuss any of the topics above, I’d love to get in touch with you! I’m actually currently trying to optimize ML models to make predictions about efficient aptamer drugs for coronavirus protein targets (N protein). (Send me an email at or message me on LinkedIn) Also feel free to check out my website at

Please sign up for my monthly newsletter here if you’re interested in following my progress :)

Book Recs

  • Deep Learning in the life sciences — Bharath Ramsundar
  • Deep Medicine: How Artificial Intelligence Can Make Healthcare Human — Eric Topol

Key Takeaways

  • Machine learning is helping us understand the relationship between structure and function, both for therapeutics and small molecules themselves — mainly through computing abstract representations.
  • Navigating through physicochemical space manually is time and cost intensive — ML and a new revolution in active learning is allowing us to actually have ML help us with designing the experiments themselves in order to minimize the time and money spent.
  • Biological systems are complex ones, which dynamic relationships between nodes and interactions. While we cannot possibly know and measure every variable, ML allows us to extract what’s important.



Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.