NGS Pipeline Framework for GIS
This folder contains workflow developed by the Research Pipeline Development Team (RPD)
- Cluster specifics are handled internally, i.e. users don't have to worry about scheduler usage.
- Built-in check-pointing: Easy restart and skipping of already completed steps
- Pipelines work out of the box on GIS's aquila (UGE) or the National Super Computing Center (NSCC) (PBS Pro) without any changes required by the user
- Pipelines are divided into steps that are automatically run in parallel where possible and each step tries to make optimal use of resources
- Pipelines are organized into directories of specific category,
e.g.
variant-calling
- Each pipeline has its own subfolder there and the corresponding wrapper
script has the same name
(e.g.
variant-calling/gatk/gatk.py
) - Each pipeline folder contains a README file describing the pipeline
(e.g.
variant-calling/gatk/README.md
) - Furthermore, each pipeline folder contains an example flowchart of
the workflow, called
example-dag.pdf
(see e.g.variant-calling/gatk/example-dag.pdf
)
The following installations are available at different sites (referred to as RPD_PIPELINES
below):
- GIS:
/mnt/projects/rpd/pipelines/
- NSCC:
/home/users/astar/gis/gisshared/rpd/pipelines/
Each of these contains one subfolder per pipeline version,
e.g. $RPD_PIPELINES/pipelines.2017-01
(referred to as
PIPELINE_ROOTDIR
below).
Much of this framework assumes a certain setup and services to be present, as is the case in GIS / the NSCC. This repository is therefore of limited use to the general public. See INSTALL.md for simplistic installation instructions.
Some pipelines only work at a certain site (due to system or software incompatibilities etc.). Refer to the table of pipelines below for details regarding availability.
There are two ways to invoke a pipeline: either call the convenience
wrapper, plainly called run
or invoke the pipeline specific scripts
directly:
- Using the convenience wrapper (recommended)
- The basic usage is
$PIPELINE_ROOTDIR/run name options
, wherename
is a pipeline name andoptions
are valid options for this pipeline. - An example (GATK) would be
$PIPELINE_ROOTDIR/run gatk --help
- Just calling
$PIPELINE_ROOTDIR/run
will print a list of available pipelines and simple usage information
- Direct invocation
-
Directly call the wrapper of the particular pipeline that you want to run, e.g. for GATK:
$PIPELINE_ROOTDIR/variant-calling/gatk/gatk.py
-
Note, in this case you need to have a Python3 interpreter in your PATH, which is not needed if you use the convenience wrapper (see above) In either case, you must not prefix the script with
python
. -
Note, there is no need to submit the script itself, as long as you run it from a cluster node
-
If called correctly, jobs will be run on the cluster automatically
-
Use
-h
or--help
to display usage information -
Use the
-v
option, so that more information is printed -
All scripts create an output directory (option
-o
) containing the run environment -
Your results will be saved to a corresponding subdirectory called
./out/
-
Upon completion (success or error) an email will be send to the user (unless
--no-mail
was specified) pointing to the results. In addition a file calledreport.html
will be generated containing some basic information about the analysis. -
Should a pipeline fail for purely technical reasons (crash of a node, connectivity issues etc.) they can be easily restarted: cd into the output directory and
qsub run.sh >> logs/submission.log
(for GIS). Upon restart, partially created files will be automatically deleted and the pipeline will skip already completed steps -
Note, that the output directory has to be on a shared filesystem, i.e. directories local to the cluster node like
/tmp
wont't work, unless run in local mode)
fq1_x=x_R1.fastq.gz
fq2_x=x_R2.fastq.gz
fq1_y=y_R1.fastq.gz
fq2_y=y_R2.fastq.gz
bed=/path/to/SeqCap_EZ_Exome_v3_primary.bed
outdir=/path/to/output-folder-for-this-analysis/
/path/to/pipelines/run gatk -o $outdir -1 $fq1_x $fq1_y -2 $fq2_x $fq2_y -s sample-name -t WES -l $bed
# or
# /path/to/pipelines/variant-calling/gatk/gatk.py -o $outdir -1 $fq1_x $fq1_y -2 $fq2_x $fq2_y -s sample-name -t WES -l $bed
Name | Category | Notes | @GIS | @NSCC |
---|---|---|---|---|
bcl2fastq | Production | Not for end-users | Y | Y |
ChIP-seq | Chromatin Profiling | Y | Y | |
SG10K | Custom | Not for end-users | Y | Y |
ViPR | GERMS | Y | Y | |
BWA-MEM | Mapping | Y | Y | |
Shotgun Metagenomics | Metagenomics | Y | Y | |
Essential-Genes | Metagenomics | Requires ref download | Y | Y |
STAR-RSEM | RNA-Seq | Y | Y | |
Fluidigm-HT-C1-RNASeq | RNA-Seq | Y | N | |
LoFreq-Somatic | Somatic | Y | N | |
Mutect | Somatic | Y | Y | |
GATK | Variant-calling | Y | Y | |
Lacer-LoFreq | Variant-calling | Y | N |
See example-dag.pdf
in each pipeline's folder for a visual overview of the workflow.
Note, pipelines start with fastq files as input (a few allow injection of BAM files).
- All pipelines are based on
- Input will be a single fastq file or a pair of fastq files. Multiple of these can
be given. Each pair is treated as one readunit (see also resulting
conf.yaml
file) and gets its own readgroup assigned where appropriate. - Software versions are defined in each pipelines'
cfg/modules.yaml
and loaded via dotkit - Pipeline wrappers create an output directory containing all necessary configuration files, run scripts etc.
- After creation of this folder, the analysis run is automatically submitted to the cluster
(unless
--no-run
was used which gives you a chance to change the config fileconf.yaml
) - The actual run script is called
run.sh
- The main log file is
./logs/snakemake.log
(usetail -f
to follow live progress) - After a successful run, the last line in the mail log file will
read:
(100%) done
- Cluster log files can be found in the respective
./logs/
sub-directory
First call the wrapper in question with --no-run
. cd into the given outdir and then
- Check the created
conf.yaml
- Print the DAG:
rm -f logs/snakemake.log; type=pdf; EXTRA_SNAKEMAKE_ARGS="--dag" bash run.sh; cat logs/snakemake.log | dot -T$type > dag.$type
- Execute a dryrun:
rm -f logs/snakemake.log; EXTRA_SNAKEMAKE_ARGS="--dryrun" bash run.sh; cat logs/snakemake.log
- Run locally:
nohup bash run.sh; tail -f logs/snakemake.log
If you have just one sample to analyze (no matter if multiple fastq
pairs or not), you will use options -s
, -1
and -2
most of the
time. To provide the pipeline with more information about your fastq
files (e.g. run-id etc.) you can create a sample configuration file
(see below) and provide it to the wrapper script with --sample-cfg
(thus replacing -s
, -1
and -2
).
You also need a sample configuration file if you want to analyze many samples identically with just one wrapper call. The easiest way to create such a file is to first create an Excel/CSV sheet listing all samples and fastq files and convert it into a sample config file as described in the following:
- Create an Excel sheet with the following columns:
- sample name (mandatory; can be used repeatedly, e.g. if you have multiple fastqs per sample)
- run id (allowed to be empty)
- flowcell id (allowed to be empty)
- library id (allowed to be empty)
- lane id (allowed to be empty)
- read-group id (allowed to be empty)
- fastq1 (mandatory)
- fastq2 (allowed to be empty)
- Save the Excel sheet as CSV and run the following to convert it to a
yaml config file:
tools/sample_conf.py -i <your>.csv -i <your>.yaml
Depending on how you created the CSV file you might want to set the CSV delimiter with-d
, e.g.-d ,
- Use the created yaml file as input for the pipeline wrapper (option
--sample-cfg your.yaml
)
Please note, not all pipelines support this feature (for example the somatic pipelines don't), but most do, e.g. GATK, Lacer-LoFreq. In some cases multisample processing can lead to very high memory consumption by the snakemake master process itself, a side-effect which is hard to predict.
The above configuration can be used for single sample processing as
well, however, for single samples the corresponding use of options
-s
, -1
and -2
is usually easierwill be sufficient for most
cases..
In the output directory that you specified with -o
, under a
subdirectory called out
. Depending on the pipeline, the sample-name
is added as well.
You should have received an email. To double check run tail logs/snakemake.log
in the output directory. It should either say
Nothing to be done
or (100%) done
You don't. It's taken care of automatically.
None! Call scripts directly, i.e. without python
.
First, simply try to restart the pipeline. In your output directory
execute qsub run.sh >> logs/submission.log
.
If this still fails, you need to troubleshoot by examining the log files. You can ask us for help (see below).
Yes. Please email us. A committee will decide on implementation priority.
No. For now you will have to use the commandline. The Datahub team is working on a separate web-interface for running pipelines.
Questions, Bug reports
Contact us: Research Pipeline Development Team (RPD)