Woldlab RNA-seq

Introduction

This started as a port of the ENCODE3 bulk RNA-seq pipeline for processing Illumina short read sequences from DNA nexus to HT-Condor's dagman language. as time progressed I've also added in some of our (Wold Lab) quality-control steps after the core RNA-Seq STAR/rsem runs.

More recently I added new pipelines to run STAR on 10x style or split-seq style experiments.

Installation

Required Dependencies

HT-Condor (for bulk pipeline)
snakemake (for single cell pipelines)
python3
numpy
pandas
scipy.stats
jinja2
bokeh
GeorgisScripts
- gene_coverage_wig_gtf.py
- SAM_reads_in_genes3_BAM.py
- SAMstats.py

Obtaining Indexes

One convenient thing on the ENCODE DCC portal provides is previously built indexes and source files for STAR, RSEM, and Tophat.

For example experiments linked to from the ENCODE RNA-Seq evaluation dataset ENCSR000AJW such as ENCSR000AED have a visulaization showing how all the official ENCODE processed files were created. Between the two replicates there are cells indicating the human female genome index files used, which in this case were TopHat ENCFF700TXC STAR ENCFF839KAI and RSEM ENCFF826ONU

(In the following the links are are direct links to download the files, instead of viewing the DCC metadata about the tar file)

The mouse mm10 / M4 male genomic index files used were:

Download	Index type
ENCFF483PAE	STAR
ENCFF064YNQ	RSEM
ENCFF268IGM	TopHat

Late in ENCODE 3 we swiched to GRCh38 / V24 for the human reference genome and GENCODE annotation, The index files are.

Download	Index type
ENCFF742NER	STAR
ENCFF940AZB	RSEM
ENCFF751OWT	Tophat

I included the TopHat link although I'm not sure anyone bothered with TopHat at this point in the project.

Earlier in ENCODE 3 we were using human hg19 reference genome with GENCODE V19. The male genomic index files used were:

Download	Index type
ENCFF746IZL	STAR
ENCFF826ONU	RSEM
ENCFF515BAC	TopHat

All the tar files containing the various indexes genome version, annotation version, and chromosomal sex for the supported software are also available from the following links.

Accession	Description
ENCSR641UDW:	Tophat indexed reference files with ERCC spike-ins
ENCSR314WMD:	STAR indexed reference files with ERCC spike-ins
ENCSR219BJA:	RSEM indexed reference files with ERCC spike-ins

Even though we are currently just running STAR/RSEM the Tophat indexes include the prebuilt GFF file that contains all of tRNA and spikes that we need for Wold Lab QC software.

You can also download the components for the GFF files from GENCODE reference files used in ENCODE pipelines. Though they might still need to be processed to add the tRNA's and spike-ins commonly used.

To prepare an index directory:

Pick a set of TopHat, STAR, and RSEM indexes one needs.
Extract all three ENCFF files into the default out/ directory
Rename the out/ directory to a "Genome Triple" directory name: Genome Build name - Annotation version - Genome sex. For example: mm10-M4-male or hg19-V19-male

NOTE: I discovered that the human and mouse GFF files are named differently. The mouse GFF from the ENCFF268IGM: file was named gencode.vM4-tRNAs-ERCC.gff while the human GFF file from ENCFF515BAC was named annotation.gff. To work around this I needed for the hg19-V19-male directory I needed to do the following in the hg19-V19-male directory: ln -s annotation.gff gencode.vV19-tRNAs-ERCC.gff

Configuration

Currently this pipeline code needs a bit of help to find the condor scripts and the genome indexes.

You need to create a configuration file in your home directory called .htsworkflow.ini with the contents:

[analysis]
condor_script_dir=<path to long-rna-seq-condor repository>
genome_dir=<path to where the root of where the indexes are extracted>
star_dir=<directory containing STAR executable>
rsem_dir=<directory containing rsem-calculate-expression>
georgi_dir=<directory containing "GeorgiScripts" python files>
ucsc_tools_dir=<directory containing at least bedGraphToBigWig from ucsc tools>

Processing

Running

First, the processing scripts need two files to be created, an experiments and libraries definition tables.

The experiment file contains:

experiment: Experiment Name
replicates: Comma seperated list of library ids.

The more library table contains

library_id: Library ID (used by experiment file to group related libraries)
genome: Genome version string. e.g. mm10, hg19
annotation: Annotation version string. e.g. M4 or V19
sex: Sex of genome. male, female, uknown
analysis_dir: Directory relative to the library table where the analysis files should be generated. It must exist before the script is run.
reference_prefix: (Optional) you can use this to override the reference prefix for bedgraph bigwig file generation. It defaults to 'chr' but you might want all of the references '-' or you may have a genome that uses 'Scaffold' for as its referece prefix.
read_1: Comma seperated list of unix filename globs specifying where to find the first read fastq files.
read_2: (optional) Comma seperated list of unix filename glbos specifying where to find the second read (mate pairs) fastq files.

Second, after the definition files are constructed you need to create the analysis directories. You can do that with this command. You need to change the -f 2 to be whatever column you used for analysis_dir. I usually put analysis_dir as the second column, so I used `-f 2.`:

tail -n +2 library.tsv  | cut -f 2 | xargs mkdir

Next you can generate the DagMan script to generate the result files with:

make_dag <list of library.tsv files> > <filename>.dagman
condor_submit_dag <filename>.dagman

TODO Currently the QC summary statistics and report generation are not integrated into the condor pipeline and need to be run manually

NOTE If condor_submit_dag fails it will generate a rescue file <filename>.dagman.rescue<number>. After investigating the log files to find the cause of the error you can do:

condor_submit_dag -autorescue 1 <filename>.dagman

to try to contine.

Fourth, to generate the HDF5 files containing the various pairwise correlation scores one needs to do:

madqc -l <library.tsv> -e <output_experiment_name> <list_of_library ids>

NOTE Yes. That is currently annoying, the make_dag.py is supposed to generate the commands, but it doesn't yet.

Fifth, after all of the experiment correlation scores are generated one can construct a summary report with:

qcreport -l <library.tsv> -e <experiment.tsv> > <html filename>

TODO Implement a way to specify where the Bokeh JavaScript and CSS is.

Lastly, you probably should delete any bam and bedgraph (.bg) files you are not planning on using.

Processing Phases

Steps for our processing pipeline:

align-star-se
sort-samtools
quant-rsem
index-samtools
qc-samstats
bedgraph-star
qc-distribution
qc-coverage
bedgraph2bigwig

Name		Name	Last commit message	Last commit date
Latest commit History 819 Commits
.github/workflows		.github/workflows
containers		containers
docs		docs
woldrnaseq		woldrnaseq
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
ChangeLog.rst		ChangeLog.rst
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini
versioneer.py		versioneer.py

License

detrout/woldlab-rna-seq

Folders and files

Latest commit

History

Repository files navigation

Woldlab RNA-seq

Table of Contents

Introduction

Installation

Required Dependencies

Obtaining Indexes

Configuration

Processing

Running

Processing Phases

About

Resources

License

Stars

Watchers

Forks

Languages