Deep-Learning for Computational Genomics
Written by Nicolas Kim '26
Edited by Surya Khatri '24
Depiction of a neural network
With the increasingly pervasive impact of diseases such as cancer, diabetes, Alzheimers, and more – have you ever wondered how these diseases afflict countless individuals that (externally at least) did not seem to show any propensity to having them? The answer lies in the same mechanism that explains everything from why you look so similar to your parents, why you feel tired when your sleep schedule is thrown off kilter, and why your knees scab over when you scrape them.
What unites these highly disparate phenomena is the fact that at their core, your body’s genetic information – and the products encoded by the specific genes that comprise it – are what control their occurrence. More specifically, the regulation of these genes – and the corresponding fluctuations in their expression – change the amount of proteins produced by the genes, in turn deciding whether or not some cellular response occurs due to the action of these encoded proteins.
There are many factors that are involved in regulating gene expression, ranging from mutations that disrupt a particular DNA sequence to a malfunctioning protein. One prominent group of regulators out of the rest, however, is the collection of enzymes that modify histones, the proteins around which strands of DNA are spooled around to form condensed chromosomes in the nucleus . The modification of these proteins by the addition or removal of chemical groups are an epigenetic change (i.e., some change that affects gene expression levels without altering an underlying DNA sequence) that in turn modifies the accessibility of DNA to other proteins that can transform its information it encodes into proteins. For example, the acetylation of histones renders DNA to be less tightly bound to histones, making it accessible to DNA-binding proteins that can increase levels of gene expression.
Beyond the general role of histone modifications in controlling gene expression, it has been noted that different histone modifications lead to different effects on the regulation of certain genes, in addition to the fact that the malfunctioning of histone regulation is implicated in cancer occurrence. Now, with advancements in artificial intelligence, several computational models have been developed to predict changes in gene expression levels based on different histone modifications.
Current Advancements: DeepChrome as a Computational Tool
Some of the existing computational approaches employ Random Forests, Support Vector Machines, and linear regression models to represent the relationships between differential histone modifications and levels of gene expression. However as of yet, such models have failed to capture the subtleties in how a histone modification (represented by a “signal” in sequencing data) is distributed around given genomic regions – instead averaging out the “signal” of a modification (as in the past usage of linear regression models) over the entire region or only selecting the most “relevant” parts of the region for a given signal (as in the case of support vector machines) [2,3].
DeepChrome, however, is the new model rising to the fore of this subfield of computational genomics, utilizing novel deep learning methods to map the modification profiles of large-scale input data to different levels of gene expression. Unlike the aforementioned methods, DeepChrome incorporates the full representation of any given genomic region into its analysis, automatically modeling the interactions between different parts of a region for a given signal, for any given input across cell types .
DeepChrome itself uses a convolutional neural network (CNN) – a type of neural network often used to detect specific features in visual data to perform tasks like face recognition or image classification – for its implementation. Like any other neural network, a CNN seeks to make a computer perform a task after being “trained” on representative data for executing said task.
Neural networks themselves are composed of thousands of interconnected nodes, or artificial neurons, that seek to mimic the interactions between real neurons in the human brain. These nodes are subdivided into distinct layers, which individually transform the input data and spit out an output. Much like how the information from a firing human neuron is relayed to others for the completion of a task, each layer of neurons in a CNN feeds into the next layer – in other words, having one layer’s output data as the next layer’s input data – until the layers terminate in a final output layer that represents the network’s predictions based on the training data.
For a CNN, at each level, a filter of numeric weights – seeking to detect a particular feature from the input data – is applied to the matrix representing the layer’s input data. This filter is convolved (or in other words, “moves”) across the input matrix, multiplying each of its elements by each corresponding part of the input matrix until some smaller output map is produced from the original input. Each output is then subjected to a function that – after applying some operation on it – determines whether or not the resulting value of the output does/does not meet some threshold value for sending all of the output data along as an input to the next layer. In effect, this step models whether or not a real human neuron should “fire” in response to some signal (in this case, the previous layer’s input data) being passed along to it. And these two steps in tandem – filtering and applying an activation function on the input data – in addition to a final few steps that apply different functions/operations, ultimately lead to a final output (framed in terms of the task that the CNN seeks to accomplish).
CNNs “learn” as they continue to be trained on more and more datasets, adjusting the values of weights and thresholds involved in convolution as they go, so that they constantly optimize their own performance/prediction capabilities. DeepChrome in turn uses this underlying logic to solve complex genomics problems, extracting relevant features from the histone modification data to determine which particular modification profiles correspond to higher/lower levels of gene expression.
To do so, DeepChrome takes in large amounts of data representing histone modification profiles across different cell types and genomic regions as its input . This input is represented as a matrix whereby each row represents a different histone modification and each column is a separate “bin” subdividing the entire genomic region to be evaluated. After feeding this input through multiple layers (by taking the part of the input “most” correlated with gene expression each time to be the next input), DeepChrome formulates the question of how these profiles influence gene expression as a binary classification task – framing the effect of any one modification (and the interactions between different modifications) as either causing high (+1) or low (-1) gene expression – and outputs the corresponding value for any given pattern of histone modifications.
DeepChrome Results and Future Steps
After being trained on data spanning multiple cell types, DeepChrome was able to outperform all other baseline models (from Support Vector Machines to Random Forest Classifiers) trained on the same data by a substantial margin – showing the greatest ability to distinguish between high and low levels of gene expression for the given data. Beyond these performance metrics, some other salient findings were, for one, that repressor marks such as H3K9me3 (methylation) and H3K27me3 end up cooperating in “silencing” the expression of genes, as well as a validation of the canonical understanding of modification marks such as H3K4me3 and H3K36me3 being correlated with high levels of gene expression . Ultimately, the results prove a strong case for DeepChrome being used for the future exploration of genetic regulations due to epigenetic changes such as histone modifications, as well the possible identification of the regulation of specific genes by certain epigenetic genes, opening up avenues for the development of therapeutics that can manipulate these changes to prevent disease.
Karlić, R., Chung, H. R., Lasserre, J., Vlahovicek, K., & Vingron, M. (2010). Histone modification levels are predictive for gene expression. Proceedings of the National Academy of Sciences of the United States of America, 107(7), 2926–2931. https://doi.org/10.1073/pnas.0909344107
Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature biotechnology, 33(8), 831–838. https://doi.org/10.1038/nbt.3300
Singh, R., Lanchantin, J., Robins, G., & Qi, Y. (2016). DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics (Oxford, England), 32(17), i639–i648. https://doi.org/10.1093/bioinformatics/btw427