FluLINE (Influenza Analysis Pipeline)

Update 6 July 2018:

Added custom database option. user defined fasta file with sequences to database for querying.
Extended to Non-segment virus analysis

FluLINE.py is a wrapper script for processing fastq sequencing files from IonTorrent or Illumina. The pipeline does the steps (i), (iv), (v), (vi) and (vii) explained below.

The main steps in the pipeline are

i) Filtering of the sequencing reads by cutadapt and FastQC
-- Quality filter with quality 20 and minimum length 50bp.
-- code = /bin/run_QC.sh
ii) Find the nearest sequence in NCBI database for each read
-- Download the NCBI database locally (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). If the nearest reference genome is unknown, then use "/bin/FindSpeciesInSample.py" to generate a XML file with the Blast of all sequence reads against the NCBI database
-- code = bin/FindSpeciesInSample.py [This is not included in the pipeline, run seperately]
iii) Cluster and identify the viral species,
-- MEGAN5 (http://ab.inf.uni-tuebingen.de/software/megan5/) can be used to view the XML file generated by the blast of each read against the NCBI database
iv) Generate consensus genomic sequence iteratively
-- VIPR pipeline (https://github.com/CSB5/vipr) is used to iteratively get the consensus. The iterative mapping is done by BWA and the consensus is based on the maximum occurance of the nucleotide at a given position
-- code = bin/GenerateConsensusGenome_withBlast.py
v) Map the reads to the final consensus genome
-- Uses Bowtie2 to map reads to the consensus genome
vi) Identify SNVs and
-- Uses Lofreq2 (http://csb5.github.io/lofreq/) to identify the SNVs (genome positions should atleadt have 100 reads mapped)
vii) Visualize the coverage of the genome
-- Circos plot is used to visualize the different segments of Influenza
-- code = bin/createGraphfiles-full.py

Dependencies

Picard
GATK
Samtools
Python 2.7

Some dependant software binaries (bwa, trim_galore, bedtools, circos, lofreq2) are in /src/ directory.
Please also install blastn and MEGAN5, and download NCBI blast database (nt)

Usage

Sample_info.csv file containing Columns: Samples ID, Sample Name, Sequencing Fastq name (partial) and Reference genome Name.
Reference folder with known or nearest reference genome: fasta file ReferenceGenomeName.fa
Edit the working directories and location of files and softwares in the FluLINE.py script.
Command: python ./FluLINE.py

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Fastq		Fastq
Reference		Reference
bin_2017		bin_2017
bin_Custom_2018		bin_Custom_2018
src		src
.DS_Store		.DS_Store
.Rhistory		.Rhistory
FluLINE.sh		FluLINE.sh
FluLINE_customdb_Influenza.sh		FluLINE_customdb_Influenza.sh
FluLINE_customdb_NonSegment.sh		FluLINE_customdb_NonSegment.sh
README.md		README.md
Sample_info.csv		Sample_info.csv
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastq

Fastq

Reference

Reference

bin_2017

bin_2017

bin_Custom_2018

bin_Custom_2018

src

src

.DS_Store

.DS_Store

.Rhistory

.Rhistory

FluLINE.sh

FluLINE.sh

FluLINE_customdb_Influenza.sh

FluLINE_customdb_Influenza.sh

FluLINE_customdb_NonSegment.sh

FluLINE_customdb_NonSegment.sh

README.md

README.md

Sample_info.csv

Sample_info.csv

_config.yml

_config.yml

Repository files navigation

FluLINE (Influenza Analysis Pipeline)

Dependencies

Usage

About

Releases

Packages

Languages

masauer2/FluLINE

Folders and files

Latest commit

History

Repository files navigation

FluLINE (Influenza Analysis Pipeline)

Dependencies

Usage

About

Resources

Stars

Watchers

Forks

Languages