Skip to content

DeepLncCTCF for identification and analysis of consensus RNA motifs binding to the genome regulator CTCF

Notifications You must be signed in to change notification settings

shuzhenkuang/DeepLncCTCF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepLncCTCF for identification and analysis of consensus RNA motifs binding to the genome regulator CTCF

DeepLncCTCF is a deep learning model to discover the RNA recognition patterns of CTCF and to identify candidate lncRNAs that may interact with CTCF. It utilized convolutional neural networks (CNNs) and attention-based bi-directional long short-term memory (BLSTM) network. We implemented the DeepLncCTCF model in Python using Keras 2.2.4 on a high performance computing cluster.

This documentation is part of the supplementary information release for DeepLncCTCF. For details of this work, please refer to our paper "Identification and analysis of consensus RNA motifs binding to the genome regulator CTCF" (S. Kuang and L. Wang, 2020).

Requirements

  • python3
  • numpy
  • pandas
  • sklearn
  • keras >=2.0
  • tensorflow
  • h5py
  • scipy
  • Bio

Input Format

The input files are in FASTA format, but the description line (the line with ">" symbol in the begining) should start with class label. An example seqeunce is as follows:

>1
CGGCCUCCCCAGCGCAGGGCUCCUCGUUUGAGGGGAGGUGACUUCCCUCCCAGCAGGCUCUUGGACACAGUAAGCUUCCCCAGCCCUGCCUGAGCAGCCUUUCCUCCUUGCCCUGUUCCCCACCUCCCGGCUCCAGGUGAGCGGGCCCUGGAGCUUGCAGUCGGAGGGCCUUGGGCAAGAUCGCCUCCUCCCCUCCAGCCC

Training and Evaluation

Our data for constructing the model are available in the Data directory. If you want to train your own model with DeepLncCTCF, you can just substitute the input with your own data. The command line to train and evaluate DeepLncCTCF as follows:

$ python train.py -f human_positive_seq.fa -n human_negative_seq.fa -o human.output

During the training, the best weights will be automatically stored in a "hdf5" file. Our fully trained model have been uploaded in the Weights directory.

Testing

If you want to evaluate the model on a separate test data, you can run the following command line:

$ python test.py -f test_positive_seq.fa -n test_negative_seq.fa -o test.output

Please make sure to download the "hdf5" file in the Weights directory or generate your own best weights.

Motif Visualization

If you want to visualize the kernals of the first convolution layer and get its frequency and location information, you can run the following command line:

$ python get_motifs.py -f human_positive_seq.fa -n human_negative_seq.fa

Same as the Testing process, "hdf5" file with the best weights is needed.

Predicting CTCF-binding RNA sites on lncRNAs

We applied the trained DeepLncCTCF model to predict CTCF-binding RNA sites on human lncRNAs, which were further used to select candidate CTCF-binding lncRNAs. To predict the CTCF-binding RNA sites on lncRNAs using trained DeepLncCTCF model, you can run the following command line:

$ python prediction.py -f lncRNA_seq.fa -o prediction.output

About

DeepLncCTCF for identification and analysis of consensus RNA motifs binding to the genome regulator CTCF

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages