This repository contains scripts for initial processing reads jump library data generated by Talkowski et al style jumps.
See http://www.ncbi.nlm.nih.gov/pubmed/21473983 and http://www.ncbi.nlm.nih.gov/pubmed/24789519 and for library details
Python: genutils, fastqstats, Bio
Other: Pear http://www.ncbi.nlm.nih.gov/pubmed/24142950 http://www.exelixis-lab.org/web/software/pear
The Talkowski jumping library method is based on circularization of long DNA fragments using a biotinylated linker with a pair of EcoP15I recognition sites. EcoP15I cuts 25/27 nucelotides away, resulting in double stranded fragments that look like:
5' XXXXXXXCTGCTGTACCGTTCTCCGTACAGCAGXXXXXXXX 3'
3' XXXXXXXGACGACATGGCAAGAGGCATGTCGTCXXXXXXXX 5'
Where XXX is 27 nucleotides of DNA at opposite ends of the original fragment. We thus expect a 27+27+26 = 80 bp long fragment. These are typically sequenced from both ends.
Here, we merge together overlapping read pairs, look for the linker sequence (which could be CTGCTGTACCGTTCTCCGTACAGCAG or CTGCTGTACGGAGAACGGTACAGCAG), and write out resulting paired end sequences, taking care to reverse complement read 1 to match standard library orientation.
python process-jump-fastq.py \
--r1fq miseq-runs/150601_M03079_0012_000000000-AG0UJ/Zoey_jump_R1.fastq.gz \
--r2fq miseq-runs/150601_M03079_0012_000000000-AG0UJ/Zoey_jump_R2.fastq.gz \
--sample Zoey_miseq_jump \
--outdir ../results/2015-06-05/