Using CNNs to Identify Transcription Factor Binding Motifs in DNA Sequences

Mukundh Murthy
6 min readJan 29, 2020

--

Imagine a world where Artificial intelligence is able to diagnose an unknown disease and recommend a treatment through genome sequencing along with Electronic Health Record and time-series biosensor data. This could become a reality.

Each human being is producing an unimaginable amount of -omics data every single second. Why not use deep neural networks to analyze the billions of interactions between the different layers of data that make up our health to generate novel insights that can be used to create novel therapeutics and suggest treatments?

All the layers of -omics data produced by a human being that can be analyzed by DNNs for holistic diagnosis

Before we can even begin to think about this reality, however, we need to understand how to analyze single layers of data and relationships between them before we can compile and integrate all of these sources of data.

I decided to try replicating a CNN created by members of a collaboration between Stanford and the Scripps Institute. In doing so, I analyzed interactions between the human proteome and the genome.

More specifically, I looked at how certain transcription factors (proteins that are essential for the transcription of DNA) selectively bind to very specific sequences of DNA (also known as DNA motifs). This information can be used to explain the biological mechanisms behind diseases and identify new protein drug targets.

Step 1: Data Curation

The first step in the process involved curating the data by converting the genome sequence into a form recognizable as input into the neural network.

I first uploaded my sequences and converted the sequences into a pandas dataframe as shown below.

Since the independent nitrogenous bases (A, T, C, G) cannot be used as inputs into the neural network, each base needed to be converted into a 4x1 vector made of zeroes and ones in a process called one-hot encoding. For example, the nitrogenous base Adenine would be encoded as the vector [1 0 0 0] while Thymine would be encoded as [0 0 0 1].

Here is an example of the DNA sequence which is one-hot encoded the position of the 1 in each column for the sequence tell us exactly which nucleotide (A, T, C, or G) is being represented

So far, I’d only dealt with the input data and features. The binary output data (whether a particular transcription factor binds to the sequence or not) also needed to be curated. Although the binary data was originally represented in the form of a 0 or 1, each number was then encoded into a 1x2 vector where the position of the 1 details whether the transcription factor binds or not.

In machine learning, data needs to be randomly split into a training set and a validation (or testing set). The loss function value for each data set tells us how well the machine learning model is performing only for that particular dataset. This means that if the loss value for the validation set is greater than that of the training set, then the CNN cannot be generalized to outside sequences.

Here the input data and labels are being split into training sets and testing sets.

Step 2: Making and Training the CNN Model

Convolutional Neural Networks are used to extract hidden features within data (in this case, DNA sequences). The data is passed through applied convolutional filters which select for a particular feature. Then, the size of the dimensions of the data is reduced in the max pooling layer. Finally after the features are learned, the model can be used to classify the data.

First, I defined the CNN model that would be used for training.

This CNN model consists of five main layers. The Convolutional layer is used for extracting features from the model and the max pooling layer is used to consolidate the information. The dense layers are used to predict whether a particular DNA sequence is a binding site for a trancription factor or not.
A plot of training vs validation loss

I then plotted the loss (binary cross entropy) for the validation vs the training data sets. You can see that at some point, the two intersect each other and the validation loss becomes greater than that of the training set, which means the model is becoming too specific for the sequences within the training data set. The goal is to capture the parameters of the model at the intersection point so that the model can be generalized to sequences outside of the training set.

Step 3: Visualizing Final Accuracy and Measuring Saliency

I then outputted a “confusion matrix,” which can be used to visualize the overall accuracy of the CNN (It captures the number of true positives, false positives, true negatives, and false negatives within a data set). The number of true positives was at 98%, which means that the neural network accurately classified DNA sequences that bind to transcription factors 98% of the time!

Now here’s where things get interesting. We know that the CNN accurately classifies the DNA sequences 98% of the time, but we don’t know how the CNN reaches this conclusion, or any essential biological mechanisms that might be at play. This is commonly known as the “Black Box” problem within AI, where models are opaque and it is difficult to interpret how the model reached its conclusion.

Fortunately, with this dataset and model, I was able to partially deduce a biological mechanism for transcription factor binding by measuring saliency values, which essentially measure how “important” a particular nucleotide is for transcription factor binding. This method can be used to elucidate particular “motifs” where the transcription factors bind.

Saliency is calculated by finding the gradients that correspond to a particular nucleotide from backpropagation (backprop is the process by which weights in a neural network are updated based on a calculated loss). The magnitude of gradients represents the “saliency” for a particular nucleotide.

Saliency Values are shown above. The columns highlighted in orange correspond to the nucleotides with the greatest saliency values (those that compose a DNA transcription factor binding motif)

After calculating the magnitude of saliency for the bases, we identify CGACCGAACTC as the most common binding motif for transcription factors.

Future Directions and Key Takeaways

Models like these can be used to identify the mechanism by which specific coding and noncoding acts. For example, a variant that disrupts the CGACCGAACTC binding motif would results in each upregulation or downregulation (loss or gain of function) of a particular gene.

Nevertheless, it’s important to acknowledge that these are relatively simplified models. Genes operate in highly complex Gene Regulatory Networks and in order to gain real insights that incorporate multi-faceted medical data, we’ll need to make greater sense of these networks to understand how genes interact with each other and other aspects of our proteome and metabolome.

Thanks for reading! Feel free to check out my other articles on Medium and connect with me on LinkedIn!

If you’d like to discuss any of the topics above, I’d love to get in touch with you! (Send me an email at mukundh.murthy@icloud.com or message me on LinkedIn)

Please sign up for my monthly newsletter here if you’re interested in following my progress :)

--

--

Mukundh Murthy
Mukundh Murthy

Written by Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.

No responses yet