NGS Pipeline Framework for GIS

This folder contains workflow developed by the Research Pipeline Development Team (RPD)

Features

Cluster specifics are handled internally, i.e. users don't have to worry about scheduler usage.
Built-in check-pointing: Easy restart and skipping of already completed steps
Pipelines work out of the box on GIS's aquila (UGE) or the National Super Computing Center (NSCC) (PBS Pro) without any changes required by the user
Pipelines are divided into steps that are automatically run in parallel where possible and each step tries to make optimal use of resources

Overview

Pipelines are organized into directories of specific category, e.g. variant-calling
Each pipeline has its own subfolder there and the corresponding wrapper script has the same name (e.g. variant-calling/gatk/gatk.py)
Each pipeline folder contains a README file describing the pipeline (e.g. variant-calling/gatk/README.md)
Furthermore, each pipeline folder contains an example flowchart of the workflow, called example-dag.pdf (see e.g. variant-calling/gatk/example-dag.pdf)

Existing Installations

The following installations are available at different sites (referred to as RPD_PIPELINES below):

GIS: /mnt/projects/rpd/pipelines/
NSCC: /home/users/astar/gis/gisshared/rpd/pipelines/

Each of these contains one subfolder per pipeline version, e.g. $RPD_PIPELINES/pipelines.2017-01 (referred to as PIPELINE_ROOTDIR below).

Much of this framework assumes a certain setup and services to be present, as is the case in GIS / the NSCC. This repository is therefore of limited use to the general public. See INSTALL.md for simplistic installation instructions.

Some pipelines only work at a certain site (due to system or software incompatibilities etc.). Refer to the table of pipelines below for details regarding availability.

How to Run

There are two ways to invoke a pipeline: either call the convenience wrapper, plainly called run or invoke the pipeline specific scripts directly:

Using the convenience wrapper (recommended)

The basic usage is $PIPELINE_ROOTDIR/run name options, where name is a pipeline name and options are valid options for this pipeline.
An example (GATK) would be $PIPELINE_ROOTDIR/run gatk --help
Just calling $PIPELINE_ROOTDIR/run will print a list of available pipelines and simple usage information

Direct invocation

Directly call the wrapper of the particular pipeline that you want to run, e.g. for GATK: $PIPELINE_ROOTDIR/variant-calling/gatk/gatk.py
Note, in this case you need to have a Python3 interpreter in your PATH, which is not needed if you use the convenience wrapper (see above) In either case, you must not prefix the script with python.
Note, there is no need to submit the script itself, as long as you run it from a cluster node
If called correctly, jobs will be run on the cluster automatically
Use -h or --help to display usage information
Use the -v option, so that more information is printed
All scripts create an output directory (option -o) containing the run environment
Your results will be saved to a corresponding subdirectory called ./out/
Upon completion (success or error) an email will be send to the user (unless --no-mail was specified) pointing to the results. In addition a file called report.html will be generated containing some basic information about the analysis.
Should a pipeline fail for purely technical reasons (crash of a node, connectivity issues etc.) they can be easily restarted: cd into the output directory and qsub run.sh >> logs/submission.log (for GIS). Upon restart, partially created files will be automatically deleted and the pipeline will skip already completed steps
Note, that the output directory has to be on a shared filesystem, i.e. directories local to the cluster node like /tmp wont't work, unless run in local mode)

Example

Variant calling with GATK for an Exome sample with two fastq pairs

fq1_x=x_R1.fastq.gz
fq2_x=x_R2.fastq.gz
fq1_y=y_R1.fastq.gz
fq2_y=y_R2.fastq.gz
bed=/path/to/SeqCap_EZ_Exome_v3_primary.bed
outdir=/path/to/output-folder-for-this-analysis/
/path/to/pipelines/run gatk -o $outdir -1 $fq1_x $fq1_y -2 $fq2_x $fq2_y -s sample-name -t WES -l $bed
# or
# /path/to/pipelines/variant-calling/gatk/gatk.py -o $outdir -1 $fq1_x $fq1_y -2 $fq2_x $fq2_y -s sample-name -t WES -l $bed

List of Pipelines

Name	Category	Notes	@GIS	@NSCC
bcl2fastq	Production	Not for end-users	Y	Y
ChIP-seq	Chromatin Profiling		Y	Y
SG10K	Custom	Not for end-users	Y	Y
ViPR	GERMS		Y	Y
BWA-MEM	Mapping		Y	Y
Shotgun Metagenomics	Metagenomics		Y	Y
Essential-Genes	Metagenomics	Requires ref download	Y	Y
STAR-RSEM	RNA-Seq		Y	Y
Fluidigm-HT-C1-RNASeq	RNA-Seq		Y	N
LoFreq-Somatic	Somatic		Y	N
Mutect	Somatic		Y	Y
GATK	Variant-calling		Y	Y
Lacer-LoFreq	Variant-calling		Y	N

See example-dag.pdf in each pipeline's folder for a visual overview of the workflow.

Note, pipelines start with fastq files as input (a few allow injection of BAM files).

How it Works

All pipelines are based on
Input will be a single fastq file or a pair of fastq files. Multiple of these can be given. Each pair is treated as one readunit (see also resulting conf.yaml file) and gets its own readgroup assigned where appropriate.
Software versions are defined in each pipelines' cfg/modules.yaml and loaded via dotkit
Pipeline wrappers create an output directory containing all necessary configuration files, run scripts etc.
After creation of this folder, the analysis run is automatically submitted to the cluster (unless --no-run was used which gives you a chance to change the config file conf.yaml)
The actual run script is called run.sh
The main log file is ./logs/snakemake.log (use tail -f to follow live progress)
After a successful run, the last line in the mail log file will read: (100%) done
Cluster log files can be found in the respective ./logs/ sub-directory

Debugging Techniques

First call the wrapper in question with --no-run. cd into the given outdir and then

Check the created conf.yaml
Print the DAG: rm -f logs/snakemake.log; type=pdf; EXTRA_SNAKEMAKE_ARGS="--dag" bash run.sh; cat logs/snakemake.log | dot -T$type > dag.$type
Execute a dryrun: rm -f logs/snakemake.log; EXTRA_SNAKEMAKE_ARGS="--dryrun" bash run.sh; cat logs/snakemake.log
Run locally: nohup bash run.sh; tail -f logs/snakemake.log

(Multi) Sample Configuration

If you have just one sample to analyze (no matter if multiple fastq pairs or not), you will use options -s, -1 and -2 most of the time. To provide the pipeline with more information about your fastq files (e.g. run-id etc.) you can create a sample configuration file (see below) and provide it to the wrapper script with --sample-cfg (thus replacing -s, -1 and -2).

You also need a sample configuration file if you want to analyze many samples identically with just one wrapper call. The easiest way to create such a file is to first create an Excel/CSV sheet listing all samples and fastq files and convert it into a sample config file as described in the following:

Create an Excel sheet with the following columns:
1. sample name (mandatory; can be used repeatedly, e.g. if you have multiple fastqs per sample)
2. run id (allowed to be empty)
3. flowcell id (allowed to be empty)
4. library id (allowed to be empty)
5. lane id (allowed to be empty)
6. read-group id (allowed to be empty)
7. fastq1 (mandatory)
8. fastq2 (allowed to be empty)
Save the Excel sheet as CSV and run the following to convert it to a yaml config file: tools/sample_conf.py -i <your>.csv -i <your>.yaml Depending on how you created the CSV file you might want to set the CSV delimiter with -d, e.g. -d ,
Use the created yaml file as input for the pipeline wrapper (option --sample-cfg your.yaml)

Please note, not all pipelines support this feature (for example the somatic pipelines don't), but most do, e.g. GATK, Lacer-LoFreq. In some cases multisample processing can lead to very high memory consumption by the snakemake master process itself, a side-effect which is hard to predict.

The above configuration can be used for single sample processing as well, however, for single samples the corresponding use of options -s, -1 and -2 is usually easierwill be sufficient for most cases..

FAQ

Where are my results?

In the output directory that you specified with -o, under a subdirectory called out. Depending on the pipeline, the sample-name is added as well.

How do I know the pipeline run is completed?

You should have received an email. To double check run tail logs/snakemake.log in the output directory. It should either say Nothing to be done or (100%) done

How do I submit the wrapper to the cluster?

You don't. It's taken care of automatically.

Which Python version should I use?

None! Call scripts directly, i.e. without python.

Pipeline execution failed. What now?

First, simply try to restart the pipeline. In your output directory execute qsub run.sh >> logs/submission.log.

If this still fails, you need to troubleshoot by examining the log files. You can ask us for help (see below).

Can you write a pipeline for me?

Yes. Please email us. A committee will decide on implementation priority.

Can these pipelines be selected in / run from ELM?

No. For now you will have to use the commandline. The Datahub team is working on a separate web-interface for running pipelines.

Comments,

Questions, Bug reports

Contact us: Research Pipeline Development Team (RPD)

Name		Name	Last commit message	Last commit date
Latest commit History 1,136 Commits
aws		aws
bcl2fastq		bcl2fastq
chromatin-profiling/chipseq		chromatin-profiling/chipseq
custom		custom
doc		doc
downstream-handlers		downstream-handlers
etc		etc
germs/vipr		germs/vipr
lib		lib
mapping/BWA-MEM		mapping/BWA-MEM
metagenomics		metagenomics
rnaseq		rnaseq
rules		rules
somatic		somatic
tools		tools
variant-calling		variant-calling
.gitignore		.gitignore
DEVELOPERS_VERSION		DEVELOPERS_VERSION
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
changelog.md		changelog.md
pylintrc		pylintrc
run		run
test_all.sh		test_all.sh

License

kohjy-ag/pipelines

Folders and files

Latest commit

History

Repository files navigation