GitHub - ysm0128/stmp

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
README		README
Repository files navigation

Sequence to Medical Phenotypes (STMP) is a pipeline featuring variant annotation, prioritization, pharmacogenomics, and tools for analyzing genomic trios (mother, father, child).

** Release versions can be downloaded from https://github.com/AshleyLab/stmp/releases **

The toolkit currently uses an SQLite database for added portability.

External Dependencies (to be placed in the "third_party" folder -- see instructions below):
- ANNOVAR version 2015-03-22 15:29:59 (Sun, 22 Mar 2015)
- snpEFF version 4.1e (build 2015-05-02)

Other versions of the above tools may also work but are not currently supported.

Other dependencies (these must be in the user or system PATH before running STMP)
- bcftools version 1.2
- bedtools version 2.17.0

Python dependencies
- Pyyaml version 3.11
- xlwt version 1.0.0 (for exporting results to an Excel file)

-----------------------------------------------------------
Installation Instructions

Downloading software and dependencies
- Download the STMP release from here (https://github.com/AshleyLab/stmp/releases).
- Download ANNOVAR (http://annovar.openbioinformatics.org/en/latest/user-guide/download/) and snpEFF (http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip/download) and make sure they are copied/symlinked in a folder called "annovar" and "snpeff" within the third_party folder.
    E.g. ANNOVAR would be linked/copied to third_party/annovar (this folder should contain all files from the ANNOVAR download, including annotate_variation.pl and table_annovar.pl)
    E.g. snpeff would be linked/copied as third_party/snpeff/snpEff (this folder should include files such as snpEff.jar)
- Ensure Pyyaml is installed (via pip install, etc.)
- Ensure bedtools version 2.17.0 and bcftools are installed and in the user/system PATH. These can be either downloaded directly from the corresponding websites or installed via a program such as bcbio.
- Run the appropriate ANNOVAR command to download the datasets specified in Appendix 1 (e.g. "annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/" from within third_party/annovar to download the refGene dataset).

- If you would like to run trio tools: 
   - Copy stable/code/trio/annovar/summarize_annovarRDv2.pl to third_party/annovar
   - Run the appropriate ANNOVAR command to download the datasets specified in Appendix 2 (e.g. "annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/" from within third_party/annovar to download the refGene dataset).


Setting up STMP
- Run "python stmp.py --db_update". This will create a SQLite database file in the db folder and download and import the core datasets required for annotation and tiering.


-----------------------------------------------------------
Running STMP
- To run STMP on an input VCF:
python stmp.py --vcf=(path to input VCF) --output_dir=(output directory)

Example (cd to the unzipped STMP release folder you downloaded):
python stable/code/stmp.py --vcf=sample_input_data/genome_in_a_bottle/subset.rs.vcf --output_dir=sample_outputs/genome_in_a_bottle_output

This will run three different modules: annotation, tiering, and pharmacogenomics (pgx).

1) Annotation
This module annotates the input VCF with information from each of the above datasets. It outputs a TSV (tab-separated values) file with each annotation as a separate column (after the standard VCF columns). Annotation includes point annotation, functional annotation (using ANNOVAR and SnpEff), and region (range) annotation using bedtools. Intermediate outputs of specific annotations (e.g. point annotations) are available in the scratch folder within the output directory. The final output (each of these three annotation types joined into a single file) is written as a .tsv file in the specified output directory.

2) Tiering
This module takes the annotated TSV from the previous step and prioritizes the variants into different tiers (below). In addition to outputting a text file with tiering metrics (tiering_allvars.metrics), it outputs text files for each tier (tiering_allvars.tier0.txt, tiering_allvars.tier1.txt, etc.).

Tier 0: Variants classified as pathogenic or likely pathogenic according to ClinVar.

Tier 1: Loss of function variants (splice dinucleotide disrupting, nonsense, nonstop, and frameshift indels.

Tier 2: All rare variants cataloged in HGMD, regardless of functional annotation. Rarity is defined as minor allele frequency (MAF) no greater than 1% by default or according to use-defined criteria in any of the following population genetic surveys: ethnically- matched population in HapMap 2 and 3, the 1000 genomes phase 1 data33 from an ethnically-matched super population, and global allele frequency, the 1000 genomes pilot 1 project global allele frequency, 69 publicly available genomes released by Complete Genomics, and the NHLBI Grand Opportunity exome sequencing project global allele frequency.

Tier 3: All non-rare missense and non-frameshift indels.

Tier 4: All variants not meeting criteria for tiers 1-3.


3) Pharmacogenomics (pgx)
This module takes in a VCF file and outputs several text files summarizing variants with known pharmacogenomic effects. These include effects on drug dosage, efficacy, toxicity, and other interactions, as well as whether any variants in the input file match known "star" alleles associated with clinical drug response for 6 genes (CYP2C19, CYP2C9, CYP2D6, SLCO1B1, TPMT, VKORC1). Each of these files is output in the specified output directory.

For additional options, run "python stmp.py -h". For example, one can use the "--annotate_only" flag to run only the annotation module, the "--tiering_only" flag to run just the tiering module, or the "--pgx_only" flag to run just the pgx module. Note that tiering depends on the annotated output file, so annotation must be run before tiering.


4) Trio (separate script)
This module analyzes genome sequence data from a father, mother, and child. It takes as input a single VCF with different sample IDs for mother, father, and child.

Usage:
python trio/trioPipeline.py input output path_to_annovar path_to_matrix offspringID fatherID motherID

Example:
(Note: as the combined file is large, you must download the HG002, HG003, and HG004 VCFs from this site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_CallsIn2Technologies_05182015/) and manually combine them into a single VCF file using bcftools or similar. The sample command below assumes you have placed the combined file in the sample_input_data/genome_in_a_bottle/trio directory and called it trio.combined.vcf.)

python stable/code/trio/trioPipeline.py sample_input_data/genome_in_a_bottle/trio/trio.combined.vcf sample_outputs/trio_output/ third_party/annovar/ stable/code/trio/ HG002 HG003 HG004


-----------------------------------------------------------
Customization
- Different datasets can be imported and used for annotation. To import additional datasets, modify stable/code/config/datasets.yml to add information about the desired datasets. It is recommended that you make a backup copy of this file before modifying. Alternately, you can copy this file to a different location and use the --config flag to run stmp with it (e.g. python stmp.py --config=(path to YAML file)). For more information and examples regarding how to specify dataset information, see the specification file at stable/code/config/datasets_spec.yml and the existing datasets in stable/code/config/datasets.yml.


-----------------------------------------------------------
Acknowledgement

When using this tool in published works, please cite the below publication:

Dewey, F., et al. "Sequence to medical phenotypes: a framework for interpretation of human whole genome DNA sequence data." PLOS Genetics, 2015.


-----------------------------------------------------------
Appendices


Appendix 1: List of ANNOVAR datasets to download for functional annotation
GRCh37_MT_ensGeneMrna.fa
GRCh37_MT_ensGene.txt
hg19_example_db_generic.txt
hg19_example_db_gff3.txt
hg19_kgXref.txt
hg19_knownGeneMrna.fa
hg19_knownGene.txt
hg19_MT_ensGeneMrna.fa
hg19_MT_ensGene.txt
hg19_refGeneMrna.fa
hg19_refGene.txt
hg19_wgEncodeGencodeBasicV19Mrna.fa
hg19_wgEncodeGencodeBasicV19.txt



Appendix 2: List of ANNOVAR datasets to download for trio tools
decipher_chr.txt
decipher_copy_edit10.txt
decipher_gff.txt
ex1.human.log
galaxy_gff3.txt
gff3test.txt
gt_gff_test.txt
hapmap_3.3.hg19_all.sites.txt
hg18_cytoBand.txt
hg18_example_db_generic.txt
hg18_example_db_gff3.txt
hg18_refGeneMrna.fa
hg18_refGene.txt
hg18_refLink.txt
hg19_AFR.sites.2012_04.txt
hg19_AFR.sites.2012_04.txt.idx
hg19_ALL.sites.2010_11.txt
hg19_ALL.sites.2011_05.txt
hg19_ALL.sites.2011_05.txt.idx
hg19_ALL.sites.2012_02.txt
hg19_ALL.sites.2012_02.txt.idx
hg19_ALL.sites.2012_04.txt
hg19_ALL.sites.2012_04.txt.idx
hg19_AMR.sites.2012_04.txt
hg19_AMR.sites.2012_04.txt.idx
hg19_ASN.sites.2012_04.txt
hg19_ASN.sites.2012_04.txt.idx
hg19_avsift.txt
hg19_avsift.txt.idx
hg19_cg46.txt
hg19_cg46.txt.idx
hg19_cg69.txt
hg19_cg69.txt.idx
hg19.clinvar.2.18.13.txt
hg19_clinvarRegion.txt
hg19_clinvarUrl.txt
hg19_cosmic61.txt
hg19_cosmic61.txt.idx
hg19_cpgIslandExt.txt
hg19_dgv.txt
hg19_ensemblPseudogene.txt
hg19_ensGeneMrna.fa
hg19_ensGene.txt
hg19_esp5400_aa.txt
hg19_esp5400_aa.txt.idx
hg19_esp5400_all.txt
hg19_esp5400_all.txt.idx
hg19_esp5400_ea.txt
hg19_esp5400_ea.txt.idx
hg19_esp6500_aa.txt
hg19_esp6500_aa.txt.idx
hg19_esp6500_all.txt
hg19_esp6500_all.txt.idx
hg19_esp6500_ea.txt
hg19_esp6500_ea.txt.idx
hg19_esp6500si_aa.txt
hg19_esp6500si_aa.txt.idx
hg19_esp6500si_all.txt
hg19_esp6500si_all.txt.idx
hg19_esp6500si_ea.txt
hg19_esp6500si_ea.txt.idx
hg19_EUR.sites.2012_04.txt
hg19_EUR.sites.2012_04.txt.idx
hg19_evofold.txt
hg19_geneReviews.txt
hg19_genomicSuperDups.txt
hg19_gerp++gt2.txt
hg19_gerp++gt2.txt.idx
hg19_gwasCatalog.txt
hg19.hapmap2and3_ASW.txt
hg19.hapmap2and3_CEU.txt
hg19.hapmap2and3_CHB.txt
hg19.hapmap2and3_CHD.txt
hg19.hapmap2and3_GIH.txt
hg19.hapmap2and3_JPT.txt
hg19.hapmap2and3_LWK.txt
hg19.hapmap2and3_MEX.txt
hg19.hapmap2and3_MKK.txt
hg19.hapmap2and3_TSI.txt
hg19.hapmap2and3_YRI.txt
hg19_kgXref.txt
hg19_knownBiocyc.txt
hg19_knownGeneCEU.fa
hg19_knownGeneMrna.fa
hg19_knownGene.txt
hg19_knownGene.txt.fa
hg19_knownKegg.txt
hg19_ljb_all.txt
hg19_ljb_all.txt.idx
hg19_omimGene.txt
hg19_pgkbAnnot.txt
hg19_pgkbUrl.txt
hg19_phastConsElements46way.txt
hg19_pseudogeneYale70.txt
hg19_refGeneMrna.fa
hg19_refGene.txt
hg19_refLink.txt
hg19.regulome.cat1.txt
hg19_regulomeCat1.txt
hg19_rmsk.txt
hg19_snp130.txt
hg19_snp130.txt.idx
hg19_snp132.txt
hg19_snp135.txt
hg19_snp135.txt.idx
hg19_snp137.txt
hg19_targetScanS.txt
hg19_tfbsConsSites.txt
hg19_ucscGenePfam.txt
hg19_wgEncodeBroadHistoneGm12878H3k27acStdSig.txt
hg19_wgEncodeBroadHistoneGm12878H3k4me1StdSig.txt
hg19_wgEncodeBroadHistoneGm12878H3k4me3StdSig.txt
hg19_wgEncodeBroadHmmGm12878HMM.txt
hg19_wgEncodeBroadHmmH1hescHMM.txt
hg19_wgEncodeBroadHmmHmecHMM.txt
hg19_wgEncodeBroadHmmHsmmHMM.txt
hg19_wgEncodeBroadHmmHuvecHMM.txt
hg19_wgEncodeBroadHmmNhekHMM.txt
hg19_wgEncodeBroadHmmNhlfHMM.txt
hg19_wgEncodeGencodeCompV14Mrna.fa
hg19_wgEncodeGencodeCompV14.txt
hg19_wgEncodeGencodeManualV4.txt
hg19_wgEncodeRegDnaseClustered.txt
hg19_wgEncodeRegDnaseClusteredV2.txt
hg19_wgEncodeRegTfbsClustered.txt
hg19_wgEncodeRegTfbsClusteredV2.txt
hg19_wgRna.txt