GitHub - littleblackfish/apis-model: Discriminative modeling to support cis-regulation of DNA methylation

This repository contains a discriminatory modeling framework that is utilized downstream of the analysis found in apis-methylation. As in the case of the other repository, it is not polished for public use but mainly documents the workflow for sake of reproducibility.

In conclusion of upstream analysis we observe that a substantial fraction of DNA methylation signal in the honey bee (and some other Hymenopterans) is robust to caste, developmental stage, tissue-of-origin and environmental history. Lacking dynamic response, we remark that these static epigenetic patterns can be efficiently carried by the underlying genetic sequence, and predict that they are strongly coupled to it. We show this to be the case through effectiveness of models that predict robustly methylated sites based on surrounding sequence. We term this approach of deferring mechanistic explanation in favor of no-frills empirical support "in-silico assaying".

The primary task here is to predict a robustly methylated CpG site given a sequence context of a few hundred bases. We set up logistic regression models trained on kmer compositions as a linear baseline, and show that shallow convolutional neural networks offer significantly improved performance, yielding over 0.93 area under precision-recall curves (AUPRC) for the honey bee.

A secondary task is to predict non-robustly methylated CpG sites, which still make up a very small subset of the genome. We show that the same modeling approach yields reduced performance where AUPRC generally remains below 0.6, suggesting decreased coupling to sequence context for dynamically methylated CpG sites.

A tertiary task is to predict robustly methylated sites within the methylome, reinforcing distinct cis-regulation of robustly methylated sites.

A number of scripts are provided to encode output from apis-methylation in preparation for modeling. In general, we start with pandas DataFrames that are indexed by seqid and position of all CpG sites in the genome, further annotated in terms of genomic features and methylation levels. In summary:

extract_windows.py extracts windows of a given radius for each CpG site from a reference genome into a multifasta file, optionally excluding overlapping windows. Methylation levels are json-encoded in fasta descriptions.
encode_kmer.py prepares a kmer composition representation of a given multifasta file.
encode_onehot.py prepares a one-hot encoded representation of a given multifasta file.

With sequences in place, those scripts with the "train_" prefix execute training of various models for either task, with cross validation over chromosome based folds. Logistic models, being cheap, are utilized in largely disposable fashion while convolutional models are retained. Initial hyperparameter sampling for convolutional models are stored in a json file, while a single set was chosen for wider application.

Finally, performance of the resulting models are summarized in some Jupyter notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
figures		figures
models		models
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figures

figures

models

models

notebooks

notebooks

scripts

scripts

.gitignore

.gitignore

README.md

README.md

environment.yml

environment.yml

Repository files navigation

About

Releases 1

Packages

Languages

littleblackfish/apis-model

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages