GitHub - km1500/ectools: tools for error correction and working with long read data

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
test		test
CA_Makefile		CA_Makefile
README		README
__init__.py		__init__.py
alignment_verify.py		alignment_verify.py
bases_outsideclr.py		bases_outsideclr.py
correct.sh		correct.sh
cov.py		cov.py
filter.py		filter.py
filter_cluster.py		filter_cluster.py
filter_out.py		filter_out.py
fix_delta_header.py		fix_delta_header.py
gc_count.py		gc_count.py
gccontent.py		gccontent.py
group_by.py		group_by.py
io.py		io.py
join.py		join.py
key2linerecord.sh		key2linerecord.sh
layout2frgid.sh		layout2frgid.sh
misc.py		misc.py
nucalign.py		nucalign.py
partition.py		partition.py
pb_correct.py		pb_correct.py
pb_filter_for_shred.py		pb_filter_for_shred.py
pb_sim.py		pb_sim.py
pb_uncovered.py		pb_uncovered.py
pre_delta_filter.py		pre_delta_filter.py
qualgen.py		qualgen.py
seqio.py		seqio.py
split_read_layout.py		split_read_layout.py

Repository files navigation

PB Correction and other PB tools

PBTOOLS_HOME=/path/to/this/directory

Running Correction.

1. Create Celera Unitigs from Illumina (or other high identity) data
   
   Output: organism.utg.fasta


2. Create a new directory to house the correction
   
   $> mkdir organism_correct

3. Simlink your organism.utg.fasta into the correction dir

   $> cd organism_correct 
   $> ln -s /path/to/organism.utg.fasta

3. Select only long Pacbio Reads Short (contained) reads tend to mess
   with the Celera unitigger so we generally discard them even before
   correction.  Generally we only correct reads >3kb. Target coverage
   for good assemblies tends to be ~20x with reads >3kb. If you don't
   have that kind of coverage you can try with reads >1kb.

4. Partition pacbio reads into small batches

   $> python ${PBTOOLS_HOME}/partition.py 20 500 pbreads.legnth_filtered.fa 
   
   Output: Series of directories and a ReadIndex.txt which maps 
   reads to partitions 
   
   *Note: You probably only want a few reads 
   per file (20 is usually good)

5. Copy correction script to working dir
   
   $> cp ${PBTOOLS_HOME}/correct.sh .

6. Modify the global variables a the top of correct.sh

7. Ensure Nucmer suite is in your PATH
   
   $> nucmer

8. Launch Correction The correct.sh script should be run in each of
   the partition directories. The number of files in directory should
   be specified as the array parameter to grid engine.  This simple
   for loop works well:
   
   $> for i in {0001..000N}; do cd $i; qsub -cwd -t 1:${NUM_FILES_PER_PARTITION} ../correct.sh; cd ..; done
   
   Where N is the number of partitions (directories) and
   NUM_FILES_PER_PARTITION is the number of files per partition
   specified in the partitions script (500 in this case)


9. Wait for correction to complete.  Concatenate all corrected output
   files from all parititons into a single fa file:
   
   $> cat ????/*.cor.fa > organism.cor.fa


10. Run Celera on the output file.

   *Notes: 
   Use convert-fasta-to-v2.pl to make celera frg file from
   organism.cor.fa 

   example:
   $> convert-fasta-to-v2.pl -l organism_pbcor -s organism.cor.fa \
      -q <( python ${PBTOOLS_HOME}/qualgen.py organism.cor.fa) \
      > organism.cor.frg
   


   Celera's default parameters work pretty well with
   corrected PB data.  

   Make sure you use OverlapBasedTrimming (it is on by default).