NGS Pipeline Framework for GIS

This folder contains workflows/pipelines developed and maintained by the Research Pipeline Development Team (RPD)

Features

Pipelines are divided into steps that are automatically run in parallel whenever possible so that each step makes optimal use of resources
Pipelines work out of the box on the GIS in-house cluster (aquila) and the National Super Computing Center (NSCC) without any changes required by the user
Cluster specifics are handled internally, i.e. users don't have to worry about scheduler usage details etc.
Built-in check-pointing: Easy restart and skipping of already completed steps

Overview

Pipelines are organized into directories of specific category, e.g. variant-calling
Each pipeline has its own subfolder there and the corresponding starter script has the same name (e.g. variant-calling/gatk/gatk.py)
All wrappers require Python3. Should it be required, add the RPD Python installation to your PATH: export PATH=/mnt/projects/rpd/apps/miniconda3/bin/:$PATH
Invoking the starter script with -h will display its usage information
Each pipeline folder contains a README file (README.rst and/or README.html) describing the pipeline (e.g. variant-calling/gatk/README.rst)
All pipelines are designed to work either on aquila (UGE) or the NSCC (PBSPro) out of the box
Upon completion (success or error) an email will be send to the user pointing to results (or log files)
Output: results can be found in the respective ./out/ sub-directory. In addition a file called report.html will be generated containing some basic information about the analysis.
Should a pipeline fail for purely technical reasons (crash of a node, connectivity issues etc.) they can be easily restarted: cd into the output directory and qsub run.sh >> logs/submission.log. Upon restart, partially created files will be automatically deleted and the pipeline will skip already completed steps
In GIS different pipeline versions are installed at /mnt/projects/rpd/pipelines/

How it works

All pipelines are based on
Software versions are defined in each pipelines' conf.default.yaml and loaded via dotkit
Pipeline wrappers create a run directory which contains all necessary configuration files, run scripts etc.
After creation of this folder, the pipeline is automatically scheduled (unless --no-run was used which gives you a chance to change the config file conf.yaml)
The main log file is ./logs/snakemake.log (use tail -f to follow live progress)
Cluster log files can be found in the respective ./logs/ sub-directory

List of Pipelines

bcl2fastq
custom
- SG10K
mapping
- BWA-MEM
metagenomics
- essential-genes
rnaseq
- star-rsem
- fluidigm-ht-c1-rnaseq
variant-calling
- gatk
- lacer-lofreq

One wrapper, multiple samples

If you want to analyze many samples (identically) with just one wrapper call, you will have to provide per sample information to the wrapper. The easiest way to achieve this, is to create an Excel sheet listing all samples and fastq(s) and to convert it into a config file as described in the following:

Create an Excel sheet with the following columns:
1. sample name (mandatory; can be used repeatedly, e.g. if you have multiple fastqs per sample)
2. run id (allowed to be empty)
3. flowcell id (allowed to be empty)
4. library id (allowed to be empty)
5. lane id (allowed to be empty)
6. read-group id (allowed to be empty)
7. fastq1 (mandatory)
8. fastq2 (allowed to be empty)
Save the Excel sheet as CSV and run the following to convert it to yaml: tools/sample_conf.py -i <your>.csv -i <your>.yaml Depending on how you created the CSV file you might want to set the CSV delimiter with -d, e.g. "-d ,"
Use the created yaml file as input for the pipeline wrapper (usually "-c your.yaml")

Comments, Questions, Bug reports

Contact us: Research Pipeline Development Team (RPD)

Note to non-GIS Users

Much of this framework assumes a certain setup and services to be present, as is the case in GIS. In other words, it might be of limited use to the general public. See INSTALL.md for simplistic installation instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 663 Commits
bcl2fastq		bcl2fastq
custom/SG10K		custom/SG10K
doc		doc
downstream-handlers		downstream-handlers
etc		etc
lib		lib
mapping/BWA-MEM		mapping/BWA-MEM
metagenomics/essential-genes		metagenomics/essential-genes
rnaseq		rnaseq
rules		rules
tools		tools
variant-calling		variant-calling
.gitignore		.gitignore
DEVELOPERS_VERSION		DEVELOPERS_VERSION
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
changelog.md		changelog.md
pylintrc		pylintrc
test_all.sh		test_all.sh

License

wiseflying/pipelines

Folders and files

Latest commit

History

Repository files navigation

NGS Pipeline Framework for GIS

Features

Overview

How it works

List of Pipelines

One wrapper, multiple samples

Comments, Questions, Bug reports

Note to non-GIS Users

About

Resources

License

Stars

Watchers

Forks

Languages