Skip to content

songjiajia2018/Manual-for-running-IDP-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Manual for running IDP pipeline

IDP is a gene Isoform Detection and Prediction tool from Second Generation Sequencing and PacBio Sequencing (also called Hybrid Sequencing) developed by Prof. Kin Fai Au. It offers very reliable gene isoform identification with high sensitivity. This a fork of the original IDP, the purpose is to better help people to run this software. The official distribution is available at: http://augroup.org/software.html

This manual contains more than just IDP, it has whole IDP pipeline which includes four steps: (1) Correct errors in long reads using short reads; (2) Align the corrected long reads; (3) Align the short reads; (4) Running IDP software.

In the following, I will show you how to run IDP software through an example data. You should just clone this Git repository and use the example data from it. You can clone this repository and we'll work in the example directory.
$ git clone https://github.com/Dingjie-Wang/Manual-for-running-IDP-pipeline.git
$ cd Manual-for-running-IDP-pipeline/example/
$ tar -zxvf data/*.tar.gz
$ ls -lht data

1. Correct errors in long reads using short reads

The first step is to perform error correction on long reads using long and short reads combined. We have included both FMLRC, LoRDEC, ColoRMap and LSC software that can accomplish this step. I recommend FMLRC and LoRDEC for speed and comparable performance on larger datasets. A comparative evaluation for all hybrid error correction method see the paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1605-z
(1) FMLRC is available at: https://github.com/holtjma/fmlrc
(2) LoRDEC is available at: http://www.atgc-montpellier.fr/lordec/
(3) LSC is available at: https://github.com/RuRuYa/LSC
(4) ColoRMap is available at: https://github.com/sfu-compbio/colormap

As an example, in the following, I will show how to run FMLRC software for error correction in long reads, which includes the following steps:

(1) Installation for FMLRC

In order to install FMLRC, you should first fetch the source code from FMLRC git repository.
$ git clone --recursive https://github.com/holtjma/fmlrc.git
You can running the following command to install:
$ cd fmlrc
$ make
Then simply make the program and run it with the "-h" option to verify it installed.
$ ./fmlrc -h

(2) Building the short-read BWT by running ropebwt2

Prior to running FMLRC, a BWT of the short-read sequencing data needs to be constructed. Currently, the implementation expects it to be in the Run-Length Encoded (RLE) format of the msbwt python package. We recommend building the BWT using ropebwt2 (https://github.com/lh3/ropebwt2) by following the instructions on Converting to the fmlrc RLE-BWT format (https://github.com/holtjma/fmlrc/wiki/Converting-to-the-fmlrc-RLE-BWT-format). Alternatively, the msbwt package can directly build these BWTs (Constructing the BWT wiki: https://github.com/holtjma/msbwt/wiki/Constructing-the-MSBWT), but it may be slower and less memory efficient.

You can running the following command to install ropebwt2:
$ git clone https://github.com/lh3/ropebwt2.git
$ cd ropebwt2/
$ make

Then you can running the following command to build the short-read BWT:
$ awk 'NR % 4 == 2' sr.fa | sort | gzip > sr.sorted.txt.gz
$ gunzip -c sr.sorted.txt.gz | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc-convert comp_msbwt.npy

(3) Correcting long reads by running FMLRC

Once a short-read BWT is constructed, the execution of FMLRC is relatively simple:
$ ./fmlrc comp_msbwt.npy lr.fa corrected_lr.fa

2. Align the corrected long reads

You could let IDP do this for you, but I caution against it. Its’ a slow process and the aligners can crash sometimes, so its’ better to just sort this out now and not deal with it in the IDP run. Here, we align the corrected long reads using GMAP software.

(1) Install GMAP software

Download GMAP software at:
http://research-pub.gene.com/gmap/src/gmap-gsnap-2018-07-04.tar.gz

Detailed description for installation is available at:
http://research-pub.gene.com/gmap/src/README

In Linux, you can do the following command:
$ wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2018-07-04.tar.gz
$ tar –zxvf gmap-gsnap-2018-07-04.tar.gz
$ ./configure
$ make
$ make check
$ make install

(2) Build a gmap index

In Linux, you can use the following command for building the gmap index
$ gmap_build -D ./ -d gmapindex ./chr20.fa

(3) Align the corrected long reads

$ # Align the corrected long reads
$ gmap -D ./ -d gmapindex -t 2 -f 1 -n 1 corrected_lr.fasta > corrected_lr.psl

3. Align the short reads

I will use hisat2 to align reads but run SpliceMap is included if you want a more classic approach to the IDP pipeline. For speed and stability I recommend hisat2 but it will require an additional processing step on our part.

(1) Install hisat2 software

Download hisat2 software at:
http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-Linux_x86_64.zip

Detailed description for installation is available at:
https://ccb.jhu.edu/software/hisat2/index.shtml

In Linux, you can do the following command:
$ wget http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-Linux_x86_64.zip
$ unzip hisat2-2.1.0-Linux_x86_64.zip
$ cd hisat2-2.1.0

(2) Build a hisat2 index

In Linux, you can do the following command:
$ hisat2-build chr20.fa hisat2/hisat2index

(3) Align the short reads

We can align the short reads by the following command:
$ hisat2 -x hisat2/hisat2index -U sr.fa -f -S sr.sam

(4) Get the SAM and BED file for SpliceMap format

Looks good! Unfortunately, IDP needs a different format than the garden variety bam. To accomodate this we will need to conver the bam into a SpliceMap format sam, and also create a junction file like SpliceMap does. We use helper scripts for this part.
$ # get SpliceMap format sam file (please install python-2.7 and R-3.5)
$ ./Au-public-master/iron/utilities/make_sam_splicemap_like.py sr.sam > sr_trim.sam
$ Rscript ./Au-public-master/iron/utilities/make_sam_splicemap_like.R sr_trim.sam sr.splicemap-like.sam
$ # get SpliceMap format bed file
$ ./Au-public-master/iron/utilities/sam_to_splicemap_junction_bed.py -o sr.splicemap-like.junctions.bed sr.sam chr20.fa

4. Run IDP software

The psl option is the most convenient way to run IDP since it allows you to do your own alignment ahead of time as we have done here. To make this easier the IDP/examples folder contains a configuration file that points to the folders we've generated in this example. On a normal run you will create your own configuration file to describe the run. Now to actually run IDP. This configuration file has been set to use the files created in this example. In this example we are using an RPKM absolute and fraction cutoff rather than an FDR. The FDR does not execute well in small datasets or non-model organisms.

If you prepared related input files and created your own configuration file, you can run IDP by the following command:
$ ./bin/runIDP.py run.cfg 0
All of the output from IDP is automatically copied to the “output” directory, which includes isoform.gpd, isoform_detection.gpd, isoform_prediction.gpd and isoform.exp files.

About

IDP pipeline for identifying novel genes and isoforms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.3%
  • Perl 6.4%
  • Other 0.3%