forked from AshleyLab/stmp
ysm0128/stmp
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Sequence to Medical Phenotypes (STMP) is a pipeline featuring variant annotation, prioritization, pharmacogenomics, and tools for analyzing genomic trios (mother, father, child). ** Release versions can be downloaded from https://github.com/AshleyLab/stmp/releases ** The toolkit currently uses an SQLite database for added portability. External Dependencies (to be placed in the "third_party" folder -- see instructions below): - ANNOVAR version 2015-03-22 15:29:59 (Sun, 22 Mar 2015) - snpEFF version 4.1e (build 2015-05-02) Other versions of the above tools may also work but are not currently supported. Other dependencies (these must be in the user or system PATH before running STMP) - bcftools version 1.2 - bedtools version 2.17.0 Python dependencies - Pyyaml version 3.11 - xlwt version 1.0.0 (for exporting results to an Excel file) ----------------------------------------------------------- Installation Instructions Downloading software and dependencies - Download the STMP release from here (https://github.com/AshleyLab/stmp/releases). - Download ANNOVAR (http://annovar.openbioinformatics.org/en/latest/user-guide/download/) and snpEFF (http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip/download) and make sure they are copied/symlinked in a folder called "annovar" and "snpeff" within the third_party folder. E.g. ANNOVAR would be linked/copied to third_party/annovar (this folder should contain all files from the ANNOVAR download, including annotate_variation.pl and table_annovar.pl) E.g. snpeff would be linked/copied as third_party/snpeff/snpEff (this folder should include files such as snpEff.jar) - Ensure Pyyaml is installed (via pip install, etc.) - Ensure bedtools version 2.17.0 and bcftools are installed and in the user/system PATH. These can be either downloaded directly from the corresponding websites or installed via a program such as bcbio. - Run the appropriate ANNOVAR command to download the datasets specified in Appendix 1 (e.g. "annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/" from within third_party/annovar to download the refGene dataset). - If you would like to run trio tools: - Copy stable/code/trio/annovar/summarize_annovarRDv2.pl to third_party/annovar - Run the appropriate ANNOVAR command to download the datasets specified in Appendix 2 (e.g. "annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/" from within third_party/annovar to download the refGene dataset). Setting up STMP - Run "python stmp.py --db_update". This will create a SQLite database file in the db folder and download and import the core datasets required for annotation and tiering. ----------------------------------------------------------- Running STMP - To run STMP on an input VCF: python stmp.py --vcf=(path to input VCF) --output_dir=(output directory) Example (cd to the unzipped STMP release folder you downloaded): python stable/code/stmp.py --vcf=sample_input_data/genome_in_a_bottle/subset.rs.vcf --output_dir=sample_outputs/genome_in_a_bottle_output This will run three different modules: annotation, tiering, and pharmacogenomics (pgx). 1) Annotation This module annotates the input VCF with information from each of the above datasets. It outputs a TSV (tab-separated values) file with each annotation as a separate column (after the standard VCF columns). Annotation includes point annotation, functional annotation (using ANNOVAR and SnpEff), and region (range) annotation using bedtools. Intermediate outputs of specific annotations (e.g. point annotations) are available in the scratch folder within the output directory. The final output (each of these three annotation types joined into a single file) is written as a .tsv file in the specified output directory. 2) Tiering This module takes the annotated TSV from the previous step and prioritizes the variants into different tiers (below). In addition to outputting a text file with tiering metrics (tiering_allvars.metrics), it outputs text files for each tier (tiering_allvars.tier0.txt, tiering_allvars.tier1.txt, etc.). Tier 0: Variants classified as pathogenic or likely pathogenic according to ClinVar. Tier 1: Loss of function variants (splice dinucleotide disrupting, nonsense, nonstop, and frameshift indels. Tier 2: All rare variants cataloged in HGMD, regardless of functional annotation. Rarity is defined as minor allele frequency (MAF) no greater than 1% by default or according to use-defined criteria in any of the following population genetic surveys: ethnically- matched population in HapMap 2 and 3, the 1000 genomes phase 1 data33 from an ethnically-matched super population, and global allele frequency, the 1000 genomes pilot 1 project global allele frequency, 69 publicly available genomes released by Complete Genomics, and the NHLBI Grand Opportunity exome sequencing project global allele frequency. Tier 3: All non-rare missense and non-frameshift indels. Tier 4: All variants not meeting criteria for tiers 1-3. 3) Pharmacogenomics (pgx) This module takes in a VCF file and outputs several text files summarizing variants with known pharmacogenomic effects. These include effects on drug dosage, efficacy, toxicity, and other interactions, as well as whether any variants in the input file match known "star" alleles associated with clinical drug response for 6 genes (CYP2C19, CYP2C9, CYP2D6, SLCO1B1, TPMT, VKORC1). Each of these files is output in the specified output directory. For additional options, run "python stmp.py -h". For example, one can use the "--annotate_only" flag to run only the annotation module, the "--tiering_only" flag to run just the tiering module, or the "--pgx_only" flag to run just the pgx module. Note that tiering depends on the annotated output file, so annotation must be run before tiering. 4) Trio (separate script) This module analyzes genome sequence data from a father, mother, and child. It takes as input a single VCF with different sample IDs for mother, father, and child. Usage: python trio/trioPipeline.py input output path_to_annovar path_to_matrix offspringID fatherID motherID Example: (Note: as the combined file is large, you must download the HG002, HG003, and HG004 VCFs from this site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_CallsIn2Technologies_05182015/) and manually combine them into a single VCF file using bcftools or similar. The sample command below assumes you have placed the combined file in the sample_input_data/genome_in_a_bottle/trio directory and called it trio.combined.vcf.) python stable/code/trio/trioPipeline.py sample_input_data/genome_in_a_bottle/trio/trio.combined.vcf sample_outputs/trio_output/ third_party/annovar/ stable/code/trio/ HG002 HG003 HG004 ----------------------------------------------------------- Customization - Different datasets can be imported and used for annotation. To import additional datasets, modify stable/code/config/datasets.yml to add information about the desired datasets. It is recommended that you make a backup copy of this file before modifying. Alternately, you can copy this file to a different location and use the --config flag to run stmp with it (e.g. python stmp.py --config=(path to YAML file)). For more information and examples regarding how to specify dataset information, see the specification file at stable/code/config/datasets_spec.yml and the existing datasets in stable/code/config/datasets.yml. ----------------------------------------------------------- Acknowledgement When using this tool in published works, please cite the below publication: Dewey, F., et al. "Sequence to medical phenotypes: a framework for interpretation of human whole genome DNA sequence data." PLOS Genetics, 2015. ----------------------------------------------------------- Appendices Appendix 1: List of ANNOVAR datasets to download for functional annotation GRCh37_MT_ensGeneMrna.fa GRCh37_MT_ensGene.txt hg19_example_db_generic.txt hg19_example_db_gff3.txt hg19_kgXref.txt hg19_knownGeneMrna.fa hg19_knownGene.txt hg19_MT_ensGeneMrna.fa hg19_MT_ensGene.txt hg19_refGeneMrna.fa hg19_refGene.txt hg19_wgEncodeGencodeBasicV19Mrna.fa hg19_wgEncodeGencodeBasicV19.txt Appendix 2: List of ANNOVAR datasets to download for trio tools decipher_chr.txt decipher_copy_edit10.txt decipher_gff.txt ex1.human.log galaxy_gff3.txt gff3test.txt gt_gff_test.txt hapmap_3.3.hg19_all.sites.txt hg18_cytoBand.txt hg18_example_db_generic.txt hg18_example_db_gff3.txt hg18_refGeneMrna.fa hg18_refGene.txt hg18_refLink.txt hg19_AFR.sites.2012_04.txt hg19_AFR.sites.2012_04.txt.idx hg19_ALL.sites.2010_11.txt hg19_ALL.sites.2011_05.txt hg19_ALL.sites.2011_05.txt.idx hg19_ALL.sites.2012_02.txt hg19_ALL.sites.2012_02.txt.idx hg19_ALL.sites.2012_04.txt hg19_ALL.sites.2012_04.txt.idx hg19_AMR.sites.2012_04.txt hg19_AMR.sites.2012_04.txt.idx hg19_ASN.sites.2012_04.txt hg19_ASN.sites.2012_04.txt.idx hg19_avsift.txt hg19_avsift.txt.idx hg19_cg46.txt hg19_cg46.txt.idx hg19_cg69.txt hg19_cg69.txt.idx hg19.clinvar.2.18.13.txt hg19_clinvarRegion.txt hg19_clinvarUrl.txt hg19_cosmic61.txt hg19_cosmic61.txt.idx hg19_cpgIslandExt.txt hg19_dgv.txt hg19_ensemblPseudogene.txt hg19_ensGeneMrna.fa hg19_ensGene.txt hg19_esp5400_aa.txt hg19_esp5400_aa.txt.idx hg19_esp5400_all.txt hg19_esp5400_all.txt.idx hg19_esp5400_ea.txt hg19_esp5400_ea.txt.idx hg19_esp6500_aa.txt hg19_esp6500_aa.txt.idx hg19_esp6500_all.txt hg19_esp6500_all.txt.idx hg19_esp6500_ea.txt hg19_esp6500_ea.txt.idx hg19_esp6500si_aa.txt hg19_esp6500si_aa.txt.idx hg19_esp6500si_all.txt hg19_esp6500si_all.txt.idx hg19_esp6500si_ea.txt hg19_esp6500si_ea.txt.idx hg19_EUR.sites.2012_04.txt hg19_EUR.sites.2012_04.txt.idx hg19_evofold.txt hg19_geneReviews.txt hg19_genomicSuperDups.txt hg19_gerp++gt2.txt hg19_gerp++gt2.txt.idx hg19_gwasCatalog.txt hg19.hapmap2and3_ASW.txt hg19.hapmap2and3_CEU.txt hg19.hapmap2and3_CHB.txt hg19.hapmap2and3_CHD.txt hg19.hapmap2and3_GIH.txt hg19.hapmap2and3_JPT.txt hg19.hapmap2and3_LWK.txt hg19.hapmap2and3_MEX.txt hg19.hapmap2and3_MKK.txt hg19.hapmap2and3_TSI.txt hg19.hapmap2and3_YRI.txt hg19_kgXref.txt hg19_knownBiocyc.txt hg19_knownGeneCEU.fa hg19_knownGeneMrna.fa hg19_knownGene.txt hg19_knownGene.txt.fa hg19_knownKegg.txt hg19_ljb_all.txt hg19_ljb_all.txt.idx hg19_omimGene.txt hg19_pgkbAnnot.txt hg19_pgkbUrl.txt hg19_phastConsElements46way.txt hg19_pseudogeneYale70.txt hg19_refGeneMrna.fa hg19_refGene.txt hg19_refLink.txt hg19.regulome.cat1.txt hg19_regulomeCat1.txt hg19_rmsk.txt hg19_snp130.txt hg19_snp130.txt.idx hg19_snp132.txt hg19_snp135.txt hg19_snp135.txt.idx hg19_snp137.txt hg19_targetScanS.txt hg19_tfbsConsSites.txt hg19_ucscGenePfam.txt hg19_wgEncodeBroadHistoneGm12878H3k27acStdSig.txt hg19_wgEncodeBroadHistoneGm12878H3k4me1StdSig.txt hg19_wgEncodeBroadHistoneGm12878H3k4me3StdSig.txt hg19_wgEncodeBroadHmmGm12878HMM.txt hg19_wgEncodeBroadHmmH1hescHMM.txt hg19_wgEncodeBroadHmmHmecHMM.txt hg19_wgEncodeBroadHmmHsmmHMM.txt hg19_wgEncodeBroadHmmHuvecHMM.txt hg19_wgEncodeBroadHmmNhekHMM.txt hg19_wgEncodeBroadHmmNhlfHMM.txt hg19_wgEncodeGencodeCompV14Mrna.fa hg19_wgEncodeGencodeCompV14.txt hg19_wgEncodeGencodeManualV4.txt hg19_wgEncodeRegDnaseClustered.txt hg19_wgEncodeRegDnaseClusteredV2.txt hg19_wgEncodeRegTfbsClustered.txt hg19_wgEncodeRegTfbsClusteredV2.txt hg19_wgRna.txt
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Python 84.2%
- Perl 15.8%