miRNA_models

This repository contains models for predicting miRNA-mediated repression described in our paper.

Requirements

python 3.6 or higher
python packages listed in requirements.txt (swap out tensorflow-gpu with tensorflow if not using GPUs). To install, run pip install -r requirements.txt.
RNAplfold from the ViennaRNA package

Modules

cnn

This module contains code for building and training the combined CNN and biochemical models in train.py and monitoring the progress using Tensorboard. The parse_data_utils file contains helper functions for parsing tfrecords information and assembling it into the correct input format. The function models.seq2ka_predictor() builds the CNN for predicting KD from miRNA and target sequences.

Here is an example of how to predict relative KD values using the trained model from our paper:

python cnn/generate_12mer_kds.py \
--name all \
--mirdata sample_data/inputs/mirseqs.txt \
--mirlen 10 \
--passenger \
--load_model cnn/trained_model/model-100 \
--outfile sample_data/outputs/kds/MIR_kds.txt

rnaplfold

This module folds target sites in many different sequence contexts to calculate the basal accessibility of each site. We recommend using the only_canon flag to only calculate this value for canonical sites and avoid very long compute times.

First partition 12-nt kmers into 10 files:

python rnaplfold/partition_seqs.py \
--mirseqs sample_data/inputs/mirseqs.txt \
--nbins 10 \
--outdir sample_data/outputs/SA_background/sequences \
--only_canon \
--passenger

Then generate 200 random contexts for each sequence in each file and fold them. You may want to use a Makefile or snakemake to automate this step:

for mirname in mir122_pass mir133_pass ; do \
for ix in 0 1 2 3 4 5 6 7 8 9 ; do \
bsub -R "rusage[mem=4096]" \
python rnaplfold/get_SA_bg.py \
--sequence_file sample_data/outputs/SA_background/sequences/canon_"$mirname"_"$ix".txt \
--temp_folder sample_data/outputs/SA_background/bg_vals/canon_"$mirname"_"$ix"_TEMP \
--num_bg 200 \
--num_processes 24 \
--outfile sample_data/outputs/SA_background/bg_vals/canon_"$mirname"_"$ix"_bg_vals.txt; \
done; \
done

Finally, parse RNAplfold outputs for all background sequences. Because the contexts are random, the final results will be very slightly different each time.

python rnaplfold/combine_results.py \
--mirseqs sample_data/inputs/mirseqs.txt \
--nbins 10 \
--num_bg 200 \
--infile_seqs sample_data/outputs/SA_background/sequences/canon_MIR_IX.txt \
--infile_bg sample_data/outputs/SA_background/bg_vals/canon_MIR_IX_bg_vals.txt \
--outfile sample_data/outputs/SA_background/bg_vals_processed/canon_MIR_bg_vals.txt \
--passenger

The sequence and partitioned bg_vals can be deleted at this point.

get_features

This module contains code that preprocesses data for both the biochemical model and the CNN. For best results, supply PCT scores.

First, navigate to temp folder and fold ORF + UTR3 sequences using RNAplfold

mkdir sample_data/outputs/rnaplfold/TEMP
cd sample_data/outputs/rnaplfold/TEMP
RNAplfold -L 40 -W 80 -u 15 < ../../../../sample_data/inputs/orf_utr3.fa

Then, navigate back and process results for easier querying later

python rnaplfold/process_mRNA_folding.py \
--transcripts sample_data/inputs/transcripts.txt \
--indir sample_data/outputs/rnaplfold/TEMP \
--outdir sample_data/outputs/rnaplfold/rnaplfold_orf_utr3/

The TEMP files can be deleted at this point. To calculate all features:

for mirname in mir122 mir133 mir122_pass mir133_pass ; do \
python get_features/write_sites.py \
--transcripts sample_data/inputs/transcripts.txt \
--mir "$mirname" \
--mirseqs sample_data/inputs/mirseqs.txt \
--kds sample_data/outputs/kds/"$mirname"_kds.txt \
--sa_bg sample_data/outputs/SA_background/bg_vals_processed/canon_"$mirname"_bg_vals.txt \
--rnaplfold_dir sample_data/outputs/rnaplfold/rnaplfold_orf_utr3/ \
--pct_file sample_data/inputs/pcts.txt \
--overlap_dist 12 \
--upstream_limit 15 \
--outfile sample_data/outputs/features/"$mirname".txt ; \
done

biochem_model

This module contains code for building, training, and using the biochemical model and biochemical+ models.

To use our trained parameters to predict scores:

for mirname in mir122 mir133 ; do \
python biochem_model/predict.py \
--features sample_data/outputs/features/"$mirname".txt \
--features_pass sample_data/outputs/features/"$mirname"_pass.txt \
--model biochem_model/trained_models/biochemplus.json \
--freeAGO -6.5 \
--freeAGO_pass -7.5 \
--outfile sample_data/outputs/predictions/"$mirname".txt ; \
done

The predictions will change slightly for an mRNA depending on the cohort of all other mRNAs because the structural accessibility score for noncanonical sites of a miRNA is determined by the average value of all canonical sites in the given mRNAs. If you chose to calculate background values for all sites of a miRNA instead, this would no longer be true.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
biochem_model		biochem_model
cnn		cnn
get_features		get_features
rnaplfold		rnaplfold
sample_data		sample_data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_cpu.txt		requirements_cpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biochem_model

biochem_model

cnn

cnn

get_features

get_features

rnaplfold

rnaplfold

sample_data

sample_data

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

requirements_cpu.txt

requirements_cpu.txt

Repository files navigation

miRNA_models

Requirements

Modules

cnn

rnaplfold

get_features

biochem_model

About

Releases

Packages

Languages

License

thythyp/miRNA_models

Folders and files

Latest commit

History

Repository files navigation

miRNA_models

Requirements

Modules

cnn

rnaplfold

get_features

biochem_model

About

Resources

License

Stars

Watchers

Forks

Languages