Skip to content

Flu analysis pipeline: i) Filtering of the sequencing reads by cutadapt and FastQC ii) Find the nearest sequence in NCBI database for each read, iii) Cluster and identify the viral species, iv) Generate consensus genomic sequence iteratively, v) Map the reads to the final consensus genome, vi) Identify SNVs and vii) Visualize the coverage of the…

masauer2/FluLINE

 
 

Repository files navigation

FluLINE (Influenza Analysis Pipeline)

Update 6 July 2018:

  • Added custom database option. user defined fasta file with sequences to database for querying.
  • Extended to Non-segment virus analysis

FluLINE.py is a wrapper script for processing fastq sequencing files from IonTorrent or Illumina. The pipeline does the steps (i), (iv), (v), (vi) and (vii) explained below.

The main steps in the pipeline are

  • i) Filtering of the sequencing reads by cutadapt and FastQC
    -- Quality filter with quality 20 and minimum length 50bp.
    -- code = /bin/run_QC.sh

  • ii) Find the nearest sequence in NCBI database for each read
    -- Download the NCBI database locally (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). If the nearest reference genome is unknown, then use "/bin/FindSpeciesInSample.py" to generate a XML file with the Blast of all sequence reads against the NCBI database
    -- code = bin/FindSpeciesInSample.py [This is not included in the pipeline, run seperately]

  • iii) Cluster and identify the viral species,
    -- MEGAN5 (http://ab.inf.uni-tuebingen.de/software/megan5/) can be used to view the XML file generated by the blast of each read against the NCBI database

  • iv) Generate consensus genomic sequence iteratively
    -- VIPR pipeline (https://github.com/CSB5/vipr) is used to iteratively get the consensus. The iterative mapping is done by BWA and the consensus is based on the maximum occurance of the nucleotide at a given position
    -- code = bin/GenerateConsensusGenome_withBlast.py

  • v) Map the reads to the final consensus genome
    -- Uses Bowtie2 to map reads to the consensus genome

  • vi) Identify SNVs and
    -- Uses Lofreq2 (http://csb5.github.io/lofreq/) to identify the SNVs (genome positions should atleadt have 100 reads mapped)

  • vii) Visualize the coverage of the genome
    -- Circos plot is used to visualize the different segments of Influenza
    -- code = bin/createGraphfiles-full.py

Dependencies

Picard
GATK
Samtools
Python 2.7

Some dependant software binaries (bwa, trim_galore, bedtools, circos, lofreq2) are in /src/ directory.
Please also install blastn and MEGAN5, and download NCBI blast database (nt)

Usage

  • Sample_info.csv file containing Columns: Samples ID, Sample Name, Sequencing Fastq name (partial) and Reference genome Name.
  • Reference folder with known or nearest reference genome: fasta file ReferenceGenomeName.fa
  • Edit the working directories and location of files and softwares in the FluLINE.py script.
  • Command: python ./FluLINE.py

About

Flu analysis pipeline: i) Filtering of the sequencing reads by cutadapt and FastQC ii) Find the nearest sequence in NCBI database for each read, iii) Cluster and identify the viral species, iv) Generate consensus genomic sequence iteratively, v) Map the reads to the final consensus genome, vi) Identify SNVs and vii) Visualize the coverage of the…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 50.4%
  • Shell 25.4%
  • Perl 23.6%
  • R 0.6%