Preprocessing and Alignment Pipeline (PAAP)

This set of scripts implements a simple read preprocessing and alignment pipeline using Slurm array jobs. It's just a single Python script, sample_dispatch.py that you can place in your project. It's (sadly, yet another custom) Slurm NGS task processing -- use the much better bcbio-nextgen for more serious processing of model-organism data.

Many steps are configurable, but the general pipeline is:

Join reads into interleaved pairs (to simplify processing).
Run reads through seqqs, which records metrics on read quality, length, base composition.
Trim adapter sequences off of reads using scythe.
Quality-based trimming with seqtk's trimfq.
Another around of seqqs, which records post pre-processing read quality metrics.

Align with BWA-MEM. The exact command is:

 mem -M -t <nthreads> -R <readgroup> -v 1 -p <reference> <in.fq>

Convert reads to BAM and sort reads by position with samtools.

This pipeline is just a template; it can be adjusted easily in code. In the future, this template may be factored out of code.

The pipeline in two steps (which are hidden from the user):

First, the dispatch step creates a text file of all the sample's run information (called the <job>_samples.txt file). Each of these is a JSON entry in a text file, which is passed to a runner. Each of these JSON entries contains all data needed by the processing command: reference, parameters, sample information, etc.

Then, the dispatch command writes a Slurm batch script in your directory for this job, and runs it. This batch script then calls the runners (using Slurm's array job infrastructure) which is a subcommand of this pipeline that takes a single line from the <job>_samples.txt file and run it.

Running the Pre-Processing and Alignment Pipeline

All processing is done on a single pair of reads. In order to run, two configuration files are needed (note you can name these whatever you want):

A setup.json JSON configuration file, which includes paths to all programs.
A samples.txt file which provides a mapping between both read pairs, sample IDs (@RG:SM in the SAM spec), and read group IDs (@RG:ID in the SAM spec). The header format is:
```
 sample_id, read_id, reads1, reads2
```

You'll need to have have Python installed with logbook. Logbook is used to log info/errors. Note that this uses Slurm's logging mechanism; task ID gets it's own log. We might add ZeroMQ-based messaging for log consolidation. On Slurm, tell load Python with:

$ module load python

Then start you run with something like:

$ python sample_dispatch.py dispatch --ref ~/data/refgen3-hapmap/maize3.fa.gz \
  --samples samples.txt --adapter illumina_adapters.fa --log log --stats data/stats/ \
  --mem 1024 --threads 10 --job teoparent --setup setup.json  --bam-dir data/aln/ 
  -P bigmem

This will create <job>_samples.txt and <job>_batch.sh. The first file is the job sample file, the second file is the Slurm batch script.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
sample_dispatch.py		sample_dispatch.py
setup.json		setup.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

sample_dispatch.py

sample_dispatch.py

setup.json

setup.json

Repository files navigation

Preprocessing and Alignment Pipeline (PAAP)

Running the Pre-Processing and Alignment Pipeline

About

Releases

Packages

Languages

yangjl/paap

Folders and files

Latest commit

History

Repository files navigation

Preprocessing and Alignment Pipeline (PAAP)

Running the Pre-Processing and Alignment Pipeline

About

Resources

Stars

Watchers

Forks

Languages