Skip to content

h5li/Lab-Research-On-DNA-Methylation

Repository files navigation

Lab-Research-On-DNA-Methylation

Project Title: Interpreting correlations between DNA Methylation Levels and DNA sequences

Overview:

DNA methylation is known for regulating gene expression. However, methylation levels have different pattern between brain cell types. We are interested in understanding where these differences come from and what specific reasons cause such variance of methylation levels. We hypothesize that methylation level differences are caused by DNA sequences correlated with cell type. This purpose of this project is to explain the correlation between DNA methylation levels and DNA sequences, with the help of machine learning models. We will first extract features of DNA sequences and apply linear models such as Lasso regression. Those features with nonzero coefficients would be interesting candidates for further analysis. Coefficient of determination will be calculated to evaluate our model performance. Furthermore, to improve prediction performances, neural networks will also be used to extract higher level features and help predict DNA methylation levels.

Brief Description of Projects and Methods:

DNA methylation has been well known for modifying the function of the genes and affecting gene expression. This project aims to predict and interpret DNA methylation at differentially methylated regions (DMRs), which have different methylation levels in different brain cell types. The dataset we use comes from whole genome bisulfite sequencing of mouse brain samples, combined with corresponding DNA sequences. These datasets are large-scale and high-dimensional, with ~60,000 DMRs for each of 16 cell types. The evaluation of our model will be the mean squared difference between predicted methylation values and observed methylation values. Features are extracted by scanning DNA sequences and counting the occurences of 2080 possible 6 base pair sequences (kmers, k=6). Other than Kmers occurrences, more features will be extracted and added such as CpG attributes, DNA structure and histone modification of that DNA sequences. The first method we apply is LASSO regression. By controlling the value of the regularization parameter, alpha, we will see how our models perform in terms of mean squared error with respect to number of non-zero coefficients. Afterwards, some nonlinear machine learning models will also be used to further improve our prediction results and help us interpret the correlations. For example, convolutional neural network might be helpful to extract high-level features and thus we can interpret these DNA sequences from macroscopic views.

Releases

No releases published

Packages

No packages published