crg2

Clinical research pipeline for exploring variants in whole genome (WGS) and exome (WES) sequencing data

crg2 is a research pipeline aimed at discovering clinically relevant variants (SNVs, SVs) in whole genome and exome sequencing data. It aims to provide reproducible results, be computationally efficient, and transparent in it's workflow.

crg2 uses Snakemake and Conda to manage jobs and software dependencies.

Installation instructions

Download and setup Anaconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
Install Snakemake 5.10.0 via Conda: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
Git clone this repo, crg, and cre. crg2 uses various scripts from other two repos to generate final reports
Make a directory for Conda to install all it's environments and executables in, for example:

mkdir ~/crg2-conda

Navigate to the crg2 directory. Install all software dependencies using:

WGS:

cd crg2
snakemake --use-conda -s Snakefile --conda-prefix ~/crg2-conda --create-envs-only

WES: This will install additional tools like freebayes, platypus, mosdepth and gatk3.

cd crg2
snkamake --use-conda -s cre.Snakefile --conda-prefix ~/crg2-conda --create-envs-only

Make sure to replace ~/crg2-conda with the path made in step 4. This will take a while.

Install these plugins for VEP: LoF, MaxEntScan, SpliceRegion. Refer to this page for installation instructions: https://useast.ensembl.org/info/docs/tools/vep/script/vep_plugins.html. The INSTALL.pl script has been renamed to vep_install in the VEP's Conda build. It is located in the conda environment directory, under share/ensembl-vep-99.2-0/vep_install. Therefore, your command should be similar to: fb5f2eb3/share/ensembl-vep-99.2-0/vep_install -a p --PLUGINS LoF,MaxEntScan,SpliceRegion
Git clone cre: git clone https://github.com/ccmbioinfo/cre to a safe place
Replace the VEP path's to the VEP directory installed from step 6. Replace the cre path in crg2/config.yaml with the one from step 7.
AnnotSV 2.1 is required for SV report generation.

Download AnnotSV: wget https://lbgi.fr/AnnotSV/Sources/AnnotSV_2.1.tar.gz
Unpack : tar -xzvf AnnotSV_2.1.tar.gz
Set the value of $ANNOTSV in your .bashrc: export ANNOTSV=/path_of_AnnotSV_installation/bin
Modify AnnotSV_2.1/configfile:
- set -bedtools: bedtools
- set -overlap: 50
- set -reciprocal yes
- set -svtBEDcol: 4

To generate a gene panel from an HPO text file exported from PhenomeCentral or G4RD, add the HPO filepath to config["run"]["hpo"]. You will also need to generate Ensembl and RefSeq gene files as well as an HGNC gene mapping file.

Download and unzip Ensembl gtf: wget -qO- http://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz | gunzip -c > Homo_sapiens.GRCh37.87.gtf
Download and unzip RefSeq gff: wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz | gunzip -c > GRCh37_latest_genomic.gff
Download RefSeq chromosome mapping file: wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_assembly_report.txt
Run script to parse the above files: python scripts/clean_gtf.py --ensembl_gtf /path/to/Homo_sapiens.GRCh37.87.gtf --refseq_gff3 /path/to/GRCh37_latest_genomic.gff --refseq_assembly /path/to/GRCh37_latest_assembly_report.txt
Add the paths to the output files, Homo_sapiens.GRCh37.87.gtf_subset.csv and GRCh37_latest_genomic.gff_subset.csv, to the config["gene"]["ensembl"] and config["gene"]["refseq"] fields.
You will also need the HGNC alias file: download this from https://www.genenames.org/download/custom/ using the default fields. Add the path this file to config["gene"]["hgnc"].

You will need to have R in your $PATH to generate the str (EH and EHDN) reports. You will also need to install the following packages: dbscan, doSNOW.
```
$ R
> install.packages("dbscan")
> install.packages("doSNOW")
```

Running the pipeline

Make a folder in a directory with sufficient space. Copy over the template files samples.tsv, units.tsv, config.yaml, pbs_profile/pbs_config.yaml. You may need to re-copy config.yaml and pbs_config.yaml if the files were recently updated in repo from previous crg2 runs. Note that 'pbs_config.yaml' is for submitting each rule as cluster jobs, so ignore this if not running on cluster

mkdir NA12878
cp crg2/samples.tsv crg2/units.tsv crg2/config.yaml crg2/pbs_profile/pbs_config.yaml NA12878

Reconfigure the 3 files to reflect project names, sample names, input fastq, bam or cram files, a panel bed file (if any) or hpo file (if any) and a ped file (if any). Inclusion of a panel bed file or hpofile will generate 2 SNV reports with all variants falling within these regions. Inclusion of a ped file with unaffacted parents and an affected proband will allow generation of a de novo report. Note that the default input file type specified in the config is fastq; change this to bam or cram if the inputs are bam or cram files. If you using cram files as input, also make sure that you are specifying the correct cram references - old_cram_ref refers to the original reference the cram was aligned to, and new_cram_ref refers to the new reference used to convert the cram to fastq. You can get the old_cram_ref from the cram file header by running samtools view -H file_name.cram. If there are multiple fastqs per read end, these must be comma-delimited within the units.tsv file. At the moment, the pipeline does not support a mix of fastqs, bam or cram files within a project/family.

samples.tsv

sample
NA12878

units.tsv

sample	platform	fq1	fq2	bam	cram
NA12878	ILLUMINA	/hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_1.fq	/hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_2.fq

config.yaml

run:
  project: "NA12878"
  samples: samples.tsv
  units: units.tsv
  ped: "" # leave this line blank if there is no ped
  panel: "/hpf/largeprojects/ccmbio/dennis.kao/NA12878/panel.bed" # remove this line entirely if there is no panel bed file
  flank: "100000"
  input: "fastq" # default fastq; specify bam if input is bam or cram
  
cre: /hpf/largeprojects/ccm_dccforge/dccdipg/Common/pipelines/cre

ref:
  name: GRCh37.75
  
...

Activate the conda environment with Snakemake 5.10.0

(base) [dennis.kao@qlogin5 crg2]$ conda activate snakemake
(snakemake) [dennis.kao@qlogin5 crg2]$ snakemake -v
5.10.0

Test that the pipeline will run by adding the flag "-n" to the command in dnaseq.pbs.

(snakemake) [dennis.kao@qlogin5 crg2]$ snakemake --use-conda -s /hpf/largeprojects/ccm_dccforge/dccdipg/Common/pipelines/crg2/Snakefile --cores 4 --conda-prefix ~/crg2-conda -n
Building DAG of jobs...
Job counts:
	count	jobs
	1	all
	86	call_variants
	86	combine_calls
	86	genotype_variants
	2	hard_filter_calls
	1	map_reads
	1	merge_calls
	1	merge_variants
	1	recalibrate_base_qualities
	2	select_calls
	1	snvreport
	1	vcf2db
	1	vcfanno
	1	vep
	1	vt
	272

[Tue May 12 10:56:12 2020]
rule map_reads:
    input: /hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_1.fq, /hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_2.fq
    output: mapped/NA12878-1.sorted.bam
    log: logs/bwa_mem/NA12878-1.log
    jobid: 271
    wildcards: sample=NA12878, unit=1
    threads: 4

[Tue May 12 10:56:12 2020]
rule recalibrate_base_qualities:
    input: mapped/NA12878-1.sorted.bam, /hpf/largeprojects/ccm_dccforge/dccdipg/Common/genomes/GRCh37d5/GRCh37d5.fa, /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/gemini_data/dbsnp.b147.20160601.tidy.vcf.gz
    output: recal/NA12878-1.bam
    log: logs/gatk/bqsr/NA12878-1.log
    jobid: 270
    wildcards: sample=NA12878, unit=1

[Tue May 12 10:56:12 2020]
rule call_variants:
    input: recal/NA12878-1.bam, /hpf/largeprojects/ccm_dccforge/dccdipg/Common/genomes/GRCh37d5/GRCh37d5.fa, /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/gemini_data/dbsnp.b147.20160601.tidy.vcf.gz
    output: called/NA12878.GL000204.1.g.vcf.gz
    log: logs/gatk/haplotypecaller/NA12878.GL000204.1.log
    jobid: 239
    wildcards: sample=NA12878, contig=GL000204.1
    
...

Run the pipeline

You can invoke exome or genome pipeline by passing -v pipeline=wgs|wes to the PBS script.

as a single job using: qsub dnaseq.pbs -v pipeline=wes.
as multi-node jobs using: qsub dnaseq_cluster.pbs -v pipeline=wgs. Change the value of variables defined inside the above file according to your system Refer pbs_profile/cluster.md document for detailed documentation for cluster integration.

The SNV reports can be found in the directories:

report/coding/{PROJECT_ID}/{PROJECT_ID}.*wes*.csv
report/panel/{PROJECT_ID}/{PROJECT_ID}.*wgs*.csv
report/panel-flank/{PROJECT_ID}/{PROJECT_ID}.*wgs*.csv The SV reports can be found in the directory:
report/sv/{PROJECt_ID}.wgs.{VER}.{DATE}.tsv.

The STR reprots can be found in:

report/str/{PROJECT_ID}.EH.v1.1.{DATE}.xlsx
report/str/{PROJECT_ID}.EHDN.{DATE}.xlsx

Automatic pipeline submission

Genomes

parser_genomes.py script can be used to automate the above process for a batch of genomes.

usage: parser_genomes.py [-h] -f FILE -s {fastq,mapped,recal,decoy_rm} -d path
Reads sample info from TSV file (-f) and creates directory (-d) necessary to start crg2 from step (-s) requested.
optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Five column TAB-seperated sample info file; template sample file: crg2/sample_info.tsv
  -s {fastq,mapped,recal,decoy_rm}, --step {fastq,mapped,recal,decoy_rm}
                        start running from this folder creation(step)
  -d path, --dir path   Absolute path where crg2 directory struture will be created under familyID as base directory

The script performs the following operations for each familyID present in the sample info file

create run folder: <familyID>
copy the following files to run folder and update settings where applicable:
- config.yaml: update run name, input type, panel and ped (if avaialble in /hpf/largeprojects/ccmbio/dennis.kao/gene_data/{HPO,Pedigrees})
- units.tsv: add sample name, and input file paths
- samples.tsv: add sample names
- dnaseq_cluser.pbs: rename job (#PBS -N <familyID>)
- pbs_config.yaml
submit Snakemake job

Exomes

parser_exomes.py script can be used to automate the above process for a batch of exomes.

usage: parser_exomes.py [-h] -f FILE -d path -s {fastq,mapped,recal,decoy_rm}
                        -b BIOINFOS [BIOINFOS ...]

Reads sample info from csv file (-f) and creates directory (-d) necessary to
start crg2 from step (-s) requested.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Analyses csv output from STAGER
  -d path, --dir path   Absolute path where crg2 directory structure will be
                        created under familyID as base directory
  -s {fastq,mapped,recal,decoy_rm}, --step {fastq,mapped,recal,decoy_rm}
                        start running from this folder creation(step)
  -b BIOINFOS [BIOINFOS ...], --bioinfos BIOINFOS [BIOINFOS ...]
                        Names of bioinformaticians who will be assigned cases
                        separated by spaces, e.g. MC NH PX

The script performs the following operations for each familyID present in the analyses request file

create run folder: <familyID>
copy the following files to run folder and update settings where applicable:
- config.yaml: update run name, input type
- units.tsv: add sample name, and input file paths from MinIO uploads or previously run analyses
- samples.tsv: add sample names
- dnaseq_cluser.pbs: rename job (#PBS -N <familyID>)
- pbs_config.yaml
submit Snakemake job

Pipeline details

Pre-calling steps

Map fastq's to the human decoy genome GrCh37d5
Picard MarkDuplicates, but don't remove reads
GATK4 base recalibration
Remove reads mapped to decoy chromosomes

WGS: SNV

Call SNV's and generate gVCFs
Merge gVCF's and perform joint genotyping
Filter against GATK best practices filters
Decompose multiallelics, sort and uniq the filtered VCF using vt
Annotate using vcfanno and VEP
Generate a gemini db using vcf2db.py
Generate a cre report using cre.sh

WES: SNV

Call variants using GATK, Freebayes, Platypus, and SAMTools
Apply caller specific filters and retain PASS variants
Decompose multiallelics, sort and uniq filtered VCF using vt
Retain variants called by GATK or 2 other callers; Annotate caller info in VCF with INFO/CALLER and INFO/NUMCALLS.
Annotate using vcfanno and VEP
Generate a gemini db using vcf2db.py
Generate a cre report using cre.sh

SV

Call SV's using Manta, Smoove and Wham
Merge calls using MetaSV
Annotate VCF using snpEff and SVScores
Split multi-sample VCF into individual sample VCFs
Generate an annotated report using crg

STR

A. ExpansionHunter: known repeat location

Identify repeat expansions in sample BAM/CRAMs
Annotate repeats with disease threshold, gene name, repeat sizes from 1000Genome (mean& median)
Generate per-family report as Excel file

B. ExpansionHunterDenovo: denovo repeats

Identify denovo repeat in sample BAM/CRAMs
Combine individual JSONs from current family and 1000Genomes to a multi-sample TSV
Run DBSCAN clustering to identify outlier repeats
Annotate with gnoMAD, OMIM, ANNOVAR
Generate per-family report as Excel file

Reports

Column descriptions and more info on how variants are filtered can be found here:

SNV: https://docs.google.com/document/d/1zL4QoINtkUd15a0AK4WzxXoTWp2MRcuQ9l_P9-xSlS4

SV: https://docs.google.com/document/d/1o870tr0rcshoae_VkG1ZOoWNSAmorCZlhHDpZuZogYE

The WGS pipeline generates 6 reports:

wgs.snv - a report on coding SNVs across the entire genome
wgs.panel.snv - a report on SNVs within the panel specified bed file
wgs.panel.snv - a report on SNVs within the panel specified bed file with a 100kb flank on each side
wgs.sv - a report on SVs across the entire genome
EH - a report on repeat expansions in known locations
EHDN - a report on denovo repeats filtered from a case-control outlier analysis

The WES pipeline generates 4 reports for SNV:

clinical.wes.regular - report on coding SNVs in exonice regions using clinical filters as decribed here
clinical.wes.synonymous - report on synonymous SNVs in exonic regions using clinical filters as decribed in here
wes.regular - report on coding SNVs in exonic regions
wes.synonymous - report on synonymous SNVs in exonic regions

Extra targets

The following output files are not included in the main Snakefile and can be requested in snakemake command-line.

HPO annotated reports: Reports from coding, panel and panel-flank can be annotated with HPO terms whenever HPO file is available with us. This is done mainly for monthly GenomeRounds. HPO annotated TSV files are created in directory: report/hpo_annotated using the following command:

snakemake --use-conda -s $SF --conda-prefix $CP --profile ${PBS} -p report/hpo_annotated

The reason to not include this output in Snakefile is that the output of rule allsnvreport is a directory and hence snakemake will not check for the creation of final csv reports. So, users are required to make sure the csv reports are created in the three folders above, and then request for the output of rule annotate_hpo separately on command-line (or append to the dnaseq_cluster.pbs).

Other outputs

CNV and SV comparison outputs are not yet part of the pipeline. Please follow the steps 7 & 8 in crg to generate the following three TSVs (this is also required for GenomeRounds)

<FAMILYID>.<DATE>.cnv.withSVoverlaps.tsv
<FAMILYID>.unfiltered.wgs.sv.<VER>.<DATE>.withCNVoverlaps.tsv
<FAMILYID>.wgs.sv.<VER>.<DATE>.withCNVoverlaps.tsv

Benchmarking

SNV calls from WES and WGS pipeline can be benchmarked using the GIAB dataset HG001_NA12878 (family_sample) and truth calls from NISTv3.3.2

Copy all required files for run as here. The inputs in units.tsv is downsampled for testing purposes. Edit the tsv to use the inputs from HPF: ccmmarvin_shared/validation/benchmarking/benchmark-datasets
Copy crg2/benchmark.tsv to current directory. Note: benchmark.tsv uses HG001_NA12878 as family_sample name, so you should edit the "project" name in config.yaml
Edit the config.yaml to set "wes" or "wgs" pipeline
Edit dnaseq_cluster.pbs to include the target validation/HG001

Name		Name	Last commit message	Last commit date
Latest commit History 570 Commits
documentation		documentation
envs		envs
pbs_profile		pbs_profile
report		report
rules		rules
schemas		schemas
scripts		scripts
utils		utils
vcfanno		vcfanno
wrappers		wrappers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
benchmark.tsv		benchmark.tsv
config.yaml		config.yaml
crg2logolarge.png		crg2logolarge.png
dnaseq.pbs		dnaseq.pbs
dnaseq_cluster.pbs		dnaseq_cluster.pbs
get_dependency.sh		get_dependency.sh
parser_exomes.py		parser_exomes.py
parser_genomes.py		parser_genomes.py
programs.03-18-2021.txt		programs.03-18-2021.txt
sample_info.tsv		sample_info.tsv
samples.tsv		samples.tsv
units.tsv		units.tsv
validation.Snakefile		validation.Snakefile
validation.pbs		validation.pbs

License

mike8115/crg2

Folders and files

Latest commit

History

Repository files navigation