mmsplice

Predict splicing variant effect from VCF

Paper: Cheng et al. https://doi.org/10.1101/438986

Usage example

pip install mmsplice

Preparation

1. Prepare annotation (gtf) file

Standard human gene annotation file in GTF format can be downloaded from ensembl or gencode. MMSplice can work directly with those files, however, some filtering is higly recommended.

Filter for protein coding genes.
Filter out duplicated exons. The same exon can be annotated multiple times if it appears in multiple transcripts. This will cause duplicated predictions.

We provide a filtered version here. Note this version has chromosome names in the format chr*. You may need to remove them to match the chromosome names in your fasta file.

2. Prepare variant (VCF) file

A correctly formatted VCF file will work with MMSplice, however the following steps will make it less prone to false positives:

Quality filtering. Low quality variants lead to unreliable predictions.
Avoid presenting multiple variants in one line by splitting them into multiple lines. Example code to do this:
```
bcftools norm -m-both -o out.vcf in.vcf.gz
```
Left-normalization. For instance, GGCA-->GG is not left-normalized while GCA-->G is. Details on unified representation of genetic variants see Tan et al.
```
bcftools norm -f reference.fasta -o out.vcf in.vcf
```

3. Prepare reference genome (fasta) file

Human reference fasta file can be downloaded from ensembl/gencode. Make sure the chromosome names match with the GTF annotation file you use.

Example code

Check notebooks/example.ipynb

# Import
from mmsplice.vcf_dataloader import SplicingVCFDataloader
from mmsplice import MMSplice, predict_all_table
from mmsplice.utils import max_varEff

# example files
gtf = 'tests/data/test.gtf'
vcf = 'tests/data/test.vcf.gz'
fasta = 'tests/data/hg19.nochr.chr17.fa'
gtfIntervalTree = '../tests/data/test.pkl' # pickle exon interval Tree

# dataloader to load variants from vcf
dl = SplicingVCFDataloader(gtf,
                          fasta,
                          vcf,
                          out_file=gtfIntervalTree, # to pickle gtf IntervalTree
                          split_seq=False)

# Specify model
model = MMSplice(
    exon_cut_l=0,
    exon_cut_r=0,
    acceptor_intron_cut=6,
    donor_intron_cut=6,
    acceptor_intron_len=50,
    acceptor_exon_len=3,
    donor_exon_len=5,
    donor_intron_len=13)

 # Do prediction
 predictions = predict_all_table(model, dl, batch_size=1024, split_seq=False, assembly=False)

 # Summerize with maximum effect size
 predictionsMax = max_varEff(predictions)

VEP Plugin

Please check documentation of vep plugin under VEP_plugin/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github		.github
VEP_plugin		VEP_plugin
docs		docs
mmsplice		mmsplice
notebooks		notebooks
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

polojacky/MMSplice

Folders and files

Latest commit

History

Repository files navigation

mmsplice

Usage example

Preparation

1. Prepare annotation (gtf) file

2. Prepare variant (VCF) file

3. Prepare reference genome (fasta) file

Example code

VEP Plugin

About

Resources

License

Stars

Watchers

Forks

Languages