Takeaways from the Harvard-Stanford Symposium on Drug Discovery

Mukundh Murthy
6 min readDec 3, 2020

2020 has objectively been a hard time for science, with the rearrangement of project priorities and funding and the lack of serendipitous in-person conversations at conferences. We won’t have pictures like those taken at the Solvay conference, scientists shoulder to shoulder. However, we do have Zoom screenshots, and that’ll do :)

I recently had the amazing opportunity to attend the Harvard-Stanford Symposium on Drug discovery! Topics ranged from domain extrapolation in chemical spaces to protein-protein interaction (PPI) prediction studies and design of COVID diagnostics. Below are the summaries of the individual talks.

Regina Barzilay — MIT

Professor Barzilay talked about how there are four components to improving the use of AI in drug discovery:

  • Representation
  • Generalization in Low Resource Training
  • Uncertainty Estimation
  • Mechanism Understanding


Molecules have many different representations. Here are some of the most prominent.

  • SMILES string → a string-based representation of molecules
  • ECFP Bit Fingerprint → A vector made of zero and ones. Each element in the vector represents the presence of a particular molecular substructure.
  • Rdkit descriptors → Rdkit descriptors are engineered features. That means they are specific properties extracted from the molecular structure as opposed to the structure itself. Descriptors include information on the molecule’s polarity, electron hybridization, and physical properties like LogP and hydrophobicity of the molecule.
  • Graphs → Graphs are simple network structures with atoms as nodes and bonds as edges.

Barzilay’s lab has been working on the graph representation of molecules as it is the ‘raw’-est representation of molecules. That means that since the representation has not been manipulated from the original molecular structure, models can learn the most information from the graphs.

The problem comes, however, in turning the raw graph representation of molecules in numerical vectors and matrices that models can then interpret.

An easy method of aggregation across nodes is to featurize each atom as a vector based on specific atomic properties and it’s environment and then simply sum the vectors for each atom in a molecule. However, when this is done, much of the information about the molecular structure is lost.

The Barzilay lab at MIT is using a method involving a distance metric called Wasserstein distance to aggregate node embeddings. (Here’s a helpful video on Wasserstein distances). Essentially, instead of summing the node embeddings over all atoms and then sending it into an MLP, they add trainable parameters that are point clouds of embeddings. The final molecule embedding is then computed by taking the Wasserstein distance between each point cloud and the matrix containing all the node embeddings. The distance between each point cloud and each column in the matrix is computed by minimizing the cost of the transport plan (a set of values that dictate how to transform the node embedding into the trainable parameter embedding of a specific point cloud) multiplied by the L2 norm distance between the node embedding and the embedding from the point cloud.

Andreas Bender — AstraZeneca

Visual from Andreas Bender’s talk

This was a great bird’s eye view talk about drug discovery. Here are my key takeaways.

Question → Data → Representation → Method

In drug discovery and in machine learning in general, it’s often easy to fall prey to starting with a data or even starting with a method, instead of starting with a fundamental question that you’re trying to answer.

When we do this, results often only display marginal improvements over existing results. Ultimately, only new questions can result in the breakthroughs we’re looking for.

Going back to the visual above, you can see that there are three points in the pyramid of drug discovery (molecular structure, protein/mode of action, and phenotype). Right now, the bridge between molecular structure and protein mode of action is somewhat solid; however, the the other two bridges – phenotypic response data and pathways (signaling or target-disease association) are not.

Model Validation is Process Validation

The way we validate molecules prospectively (even through using supposedly unbiased validation sets such as MUV, more recently LIT-PCBA, etc.) is inherently flawed since we don’t have an underlying understanding of the chemical space — so we’re essentially replacing one type of bias for another.

Also, even if we were to design an entirely unbiased datasets, the endpoints that we’re using are proxies. So we’re essentially training a model within a larger conceptual model. Even if we were to optimize to a particular metric such as binding affinity and generalize across all chemical space, we would still need to traverse the biological chasm (the bottom bridge of the pyramid from protein to phenotype).

This is where I believe companies like Cellarity are changing the paradigm around model endpoints. Instead of having the endpoint represent more arbitrary quantities like LogD or IC50/inhibition, we can instead model cellular phenotypes and clinical outcomes directly, and in doing so, we can harness biological complexity.

James Collins — BROAD Institute

Dr. Collins showed so many innovations in point of care testing for COVID-19 using a new face mask as well as other tests using Cas13. However, the work I found more interesting was a collaboration with the Barzilay lab, attempting to discover new antibiotics to overcome antibiotic resistance in bacteria.

Visual from Dr. Collin’s talk

The group used trained models from Chemprop on binary prediction tasks for around 2K molecules to predict whether a molecule was a bactericidal against E. coli. After doing this training, they applied the model to the BROAD drug repurposing hub, and ranked molecules based on three criteria.

  1. The molecule’s structural similarity to other antibiotics (using a tanimoto threshold)
  2. The molecule’s toxicity as according to ClinTox
  3. And most importantly, it’s predicted potency as an antibacterial

Out of the molecules in the repurposing set, they found halicin (a previous drug for diabetics) to be the molecule which satisfied all three criteria. They further validated the molecule through in-vivo tests (wound and mice models) and even found the mechanism of halicin.

Halicin acts by shifting the pH gradient at mitochrondrial membranes, thereby affecting the membrane potential and the proton motive force’s ability to construct new ATP.

Yoshua Bengio — MILA

This was, in my opinion, the most exciting talk of the evening. Yoshua Bengio talked about work being done at MILA where active learning was being used to fight against COVID-19.

The entire workflow looked like this:

primary screen of RNA transcripts → heterograph with targets and molecules (representing biological network) → active learning to suggest new combinations of molecules → do a dose response assay → update knowledge graph and repeat

The really exciting part of this study is the use of active learning to explore uncertain but high potential parts of the chemical space. The name of the active learning algorithm used was Gaussian processs- upper confidence bound (GP-UCB). The probability modeled in the Gaussian Process model was P(synergistic drug effect | certain combination of drugs).

Active learning in this project has allowed the scientists to traverse a much larger chemical space than a brute force search, especially given that there’s a much larger number of drug combinations than drugs alone.

Visual from Dr. Bengio’s Talk. Graph ML applied on the host interactome and outside perturbations (drug synergies) provides more meaningful representations

Challenges and Future Directions

Deriving an accurate causal graph of the biological mechanism would allow the active learning algorithm to selectively explore parts of the chemical space that are more likely to have therapeutic and phenotypic effects.

However, integrating the multi-omics information (proteomic, transcriptomic, genomic) from a large variety of experiments and settings leads to a large amount of noise in biological datasets, making it harder for models to find a signal. Hopefully, including more logistical features in datasets (such as experiment type, batch number) could help models tease apart noise from signal within each particular subset of the data.

Hey! I’m Mukundh Murthy, a 17 year old passionate about the intersection between machine learning and drug discovery. Thanks for reading this article! I hope you found it helpful :)

Feel free to check out my other articles on Medium and connect with me on LinkedIn!

If you’d like to discuss any of the topics above, I’d love to get in touch with you! (Send me an email at mukundh.murthy@icloud.com or message me on LinkedIn) Also, feel free to check out my website at mukundh-murthy.squarespace.com.

Please sign up for my monthly newsletter here if you’re interested in following my progress :)



Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.