Skip to content

zhongmicai/ITS_clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UFITS

###USEARCH Fungal ITS Clustering:###

UFITS is a series of scripts to process fungal ITS amplicon data using USEARCH and VSEARCH, although it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data and is cross-platform compatible (works on Mac, Linux - and could work on Windows). At least USEARCH v9.1.13 and VSEARCH v2.2.0 are required as of UFITS v0.7.0.


####Installation:####

####UFITS Wrapper script####

UFITS comes with a wrapper script for ease of use. On UNIX, you can call it by simply typing ufits, while on windows you may need to type ufits.py (not needed if you add the .py extension in your PATHEXT, directions here).

$ ufits
Usage:       ufits <command> <arguments>
version:     0.7.0

Description: UFITS is a package of scripts to process NGS amplicon data.  It uses the UPARSE algorithm for clustering
             and thus USEARCH is a dependency.
    
Process:     ion         pre-process Ion Torrent data (find barcodes, remove primers, trim/pad)
             illumina    pre-process folder of de-multiplexed Illumina data (gunzip, merge PE, remove primers, trim/pad)
             illumina2   pre-process Illumina data from a single file (read structure: <barcode><f_primer>READ<r_primer>)
             454         pre-process Roche 454 (pyrosequencing) data (find barcodes, remove primers, trim/pad)
             show        show number or reads per barcode from de-multiplexed data
             select      select reads (samples) from de-multiplexed data
             remove      remove reads (samples) from de-multiplexed data
             sample      sub-sample (rarify) de-multiplexed reads per sample
             
Clustering:  cluster     cluster OTUs (using UPARSE algorithm)
             dada2       run dada2 denoising algorithm, produces "inferred sequences" (requires R, dada2, ShortRead)
             unoise2     run UNOISE2 denoising algorithm
             cluster_ref closed/open reference based clustering (EXPERIMENTAL)

Utilities:   filter      OTU table filtering
             taxonomy    Assign taxonomy to OTUs
             summarize   Summarize Taxonomy (create OTU-like tables and/or stacked bar graphs for each level of taxonomy)
             funguild    Run FUNGuild (annotate OTUs with ecological information) 
             meta        pivot OTU table and append to meta data
             heatmap     Create heatmap from OTU table
             SRA         De-multiplex data and create meta data for NCBI SRA submission

Setup:       install     Download/install pre-formatted taxonomy DB (UNITE DB formatted for UFITS). Only need to run once.
             database    Format Reference Databases for Taxonomy

And then by calling one of the commands, you get a help menu for each:

$ ufits cluster
Usage:       ufits cluster <arguments>
version:     0.7.0

Description: Script is a "wrapper" for the UPARSE algorithm. FASTQ quality trimming via expected 
             errors and Dereplication are run in vsearch if installed otherwise defaults to Python 
             which allows for the use of datasets larger than 4GB.  
             Chimera filtering and UNOISE are also options.
    
Arguments:   -i, --fastq         Input FASTQ file (Required)
             -o, --out           Output base name. Default: out
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             -m, --minsize       Minimum size to keep (singleton filter). Default: 2
             --uchime_ref        Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
             --map_filtered      Map quality filtered reads back to OTUs. Default: off
             --unoise            Run De-noising pre-clustering (UNOISE). Default: off
             --debug             Keep intermediate files.
             -u, --usearch       USEARCH executable. Default: usearch9

####Installing Databases:#### UFITS is pre-configured to deal with amplicons from fungal ITS, fungal LSU, bacterial 16S, and insect COI. You can see how the databases were constructed here. After installation of UFITS, you can download and install these databases using the ufits install command (to overwrite existing databases, use the --force option. To install all of the databases, you would type:

#install all databases
ufits install -i ITS LSU 16S COI

#install only ITS databases
ufits install -i ITS

The resulting databases are stored in the DB folder of the ufits directory. These data are used for both Chimera reference filtering and for assigning taxonomy.

####What processing script do I use??#### You need to be familiar with your read structure! For example, do you have barcodes at the 5' end of your amplicons? Is the primer sequence in your reads or has it been removed by the sequencing software? I prefer that data be minimally processed - this acts largely as a quality control filter. For example when using Ion Torrent data (see instructions below) it is beneficial to turn off all filtering of the data on the server, by doing this you ensure that the reads are properly quality filtered and barcodes/primers are used as an additional quality filter. If you have PE MiSeq data, typically it comes in an already demultiplexed format, i.e. a folder of PE reads for each sample (_R1.fastq.gz, _R2.fastq.gz) - these data can be used directly with the ufits illumina command. A somewhat common (although I think not ideal) approach is to use a custom sequencing primer with Illumina data - and then the output from the sequencer is already demulitplexed and primers are already stripped - in this case, you can use the ufits illumina command, but make sure to pass the --require_primer off option. However, I have seen several other formats of data as well - even with Illumina data being in a single file, where then the ufits illumina2 script is helpful. Bottom line is if you don't know how your data is structured in terms of barcodes, primers, reads, paired-ends, etc - you need to find out before doing anything else....

####Processing Ion Torrent Data:####

From the Torrent Server, analyze the data using the --disable-all-filters BaseCaller argument. This will leave the adapters/key/barcode sequence intact. You can download either the unaligned BAM file from the server or a FASTQ file. You can then de-multiplex the data as follows:

ufits ion -i data.bam -o data -f fITS7 -r ITS4
ufits ion --barcodes 1,5,24 -i data.fastq -o data -f ITS1-F -r ITS2
ufits ion --barcode_fasta my_barcodes.fa -i data.fastq -o data -f ITS1 -r ITS2
ufits ion -i data.bam -m mapping_file.txt -o data

This will find Ion barcodes (1, 5, and 24) and relabel header with that information (barcodelabel=BC_5;). By default, it will look for all 96 Ion Xpress barcodes, specifiy the barcodes you used by a comma separated list. You can also pass in a fasta file containing your barcode sequences with properly labeled headers. Next the script will find and trim both the forward and reverse primers (default is ITS2 region: fITS7 & ITS4), and then finally will trim or pad with N's to a set length (default: 250 bp). Trimming to the same length is critcally important for USEARCH to cluster correctly, padding with N's after finding the reverse primer keeps short ITS sequences from being discarded. These options can be customized using: --fwd_primer, --rev_primer, --trim_len, etc.

####Processing Roche 454 Data:#### Data from 454 instruments has the same read structure as Ion Torrent: Read and thus can be processed very similarly. You just need to provide either an SFF file, FASTA + QUAL, or FASTQ files and then you need to specify a multi-fasta file containing the barcodes used in the project. The data will be processed in the same fashion (see above) as the Ion Torrent Data. For example:

#SFF input
ufits 454 -i data.sff --barcode_fasta my454barcodes.fa -o 454project

#FASTA/QUAL input
ufits 454 -i data.fa -q data.qual --barcode_fasta my454barcodes.fa -o 454project

####Processing Illumina MiSeq PE Data:####

Paired-end MiSeq data is typically delivered already de-multiplexed into separate read files, that have a defined naming structure from Illumina that looks like this:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

You can processes a folder of Illumina data like this:

ufits illumina -i folder_name -o miseqData -f fITS7 -r ITS4
ufits illumina -i folder_name -o miseqData -m mapping_file.txt

This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files. First it will run USEARCH8 -fastq_mergepairs, however, since some ITS sequences are too long to overlap you can rescue longer sequences by recovering the the non-merged forward reads by passing the --rescue_forward argument. Alternatively, you can only utilize the forward reads (R1), by passing the --reads forward argument. Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering. Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called miseqData.demux.fq that will be used for the next clustering step. The script will also output a text file called miseqData-filenames.txt that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used. The script will produce a folder containing the individual de-multiplexed files named from the -o, --out argment.

####OTU Clustering:####

The next step is to run ufits cluster, which expects de-multiplexed FASTQ data as a single file with ;barcodelabel=Sample_name in the FASTQ header. If your data is in some other format, you can use other UNIX/Perl/Python scripts to add the barcodelabel= to each read and then cluster your data using UFITS. Note that reads should be globally trimmed and the pre-processing steps in UFITS take steps to ensure high quality data makes it into the clustering algorithm with minimal sequence loss. Now the data from either platform (Ion, 454, or Illumina) can be clustered by running the following:

ufits cluster -i ufits.demux.fq -o ion_output

This script wil quality filter the data based on expected errors, then remove duplicated sequences (dereplication), sort the output by abundance, and finally cluster using usearch -cluster_otus command. You can also optionally run UCHIME Reference filtering by adding the --uchime_ref ITS option or change the default clustering radius (97%) by passing the --pct_otu option. Type -h for all the available options.

####DADA2 "Clustering":#### Recently there is a new "OTU picking" algorithm for amplicon based datasets called DADA2 that has sensitivity down to single base pairs, see publication here, GitHub here. This algorithm uses a statistical method to infer the original sequence that a read was derived from, foregoing the need to cluster at a set threshold (i.e. 97%). I've implemented a modified DADA2 pipeline here to work with the current UFITS data structure. A reminder is that reads for DADA2 must have no N's and have to all length trimmed identically, thus variable length amplicons will be truncated down. Thus this method is perhaps more suited to something like COI or 16S amplicons. You can run it as follows:

ufits dada2 -i ufits.demux.fq -o dada2_output -l 200

The script will quality filter your data, trim for use in DADA2, run DADA2 alogrithm, and then parse the results to output an OTU table and a file containing inferred sequences (OTUs) in fasta format. These files can be used in all downstream UFITS scripts, i.e. ufits filter and ufits taxonomy.

####OTU Table Filtering####

The data may need some additional filtering if you included a spike-in control mock community. The advantage is that you know what should be in the spike-in control barcode sample, thus you can modify USEARCH8 clustering parameters that give you reasonable results. If you need to trim your OTU table by some threshold, i.e. several OTUs at low abundance are showing up in your spike-in control sample that represent contamination or sequence error - you can set a threshold and filter the OTU table. This is done with the following script:

ufits filter -i test.otu_table.txt -f test.final.otus.fa -b mock3 --mc my_mock_seqs.fa

This will read the OTU table -i and the OTUs -f from the ufits cluster command. This script will apply an index-bleed filter to clean-up barcode-switching between samples which happens at a rate of ~ 0.2% in Ion Torrent and as much 0.3% in MiSeq data. The script first normalizes the OTU table to the number of reads in each sample, then (optionally) using the -b sample, it will calculate the amount of index-bleed in the OTU table, finally it will loop through each OTU and change values to 0 that are below the -index_bleed filter. Finally, this script will remove the mock spike in control sample from your dataset - as it should not be included in downstream processing, you can keep mock sequences if desired by passing the --keep_mock argument. The output is a filtered OTU table to be used for downstream processing.

If you do not have a mock community spike in, you can still run the index bleed filter (and you probably should as nearly all NGS data has some degree of barcode switching or index-bleed) by just running the command without a -b argument, such as, which will apply a 0.5% filter on the data. Passing the -p 0.005 argument will over-ride the calculated index-bleed.

ufits filter -i test.otu_table.txt -f test.final.otus.fa -p 0.005

####Assign Taxonomy:####

You can assign taxonomy to your OTUs using UFITS, either using UTAX from USEARCH8.1 or using usearch_global. The databases require some initial setup before you can use the ufits taxonomy command.

Issuing the ufits taxonomy command will inform you which databases have been properly configured as well as usage instructions:

$ ufits taxonomy
Usage:       ufits taxonomy <arguments>
version:     0.6.0

Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional).  By default the script
             uses a hybrid approach, e.g. gets taxonomy information from SINTAX, UTAX, and global alignment hits from the larger
             UNITE-INSD database, and then parses results to extract the most taxonomy information that it can at 
             'trustable' levels. SINTAX/UTAX results are used if BLAST-like search pct identity is less than 97%.  
             If % identity is greater than 97%, the result with most taxonomy levels is retained.
    
Arguments:   -f, --fasta         Input FASTA file (i.e. OTUs from ufits cluster) (Required)
             -i, --otu_table     Input OTU table file (i.e. otu_table from ufits cluster)
             -o, --out           Base name for output file. Default: ufits-taxonomy.<method>.txt
             -d, --db            Select Pre-installed database [ITS1, ITS2, ITS, 16S, LSU, COI]. Default: ITS2
             -m, --mapping_file  QIIME-like mapping file
             --method            Taxonomy method. Default: hybrid [utax, sintax, usearch, hybrid, rdp, blast]
             --fasta_db          Alternative database of fasta sequences to use for global alignment.
             --utax_db           UTAX formatted database. Default: ITS2.udb [See configured DB's below]
             --utax_cutoff       UTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             --usearch_db        USEARCH formatted database. Default: USEARCH.udb
             --usearch_cutoff    USEARCH threshold percent identity. Default 0.7
             --sintax_cutoff     SINTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             -r, --rdp           Path to RDP Classifier. Required if --method rdp
             --rdp_db            RDP Classifer DB set. [fungalits_unite, fungalits_warcup. fungallsu, 16srrna]  
             --rdp_cutoff        RDP Classifer confidence value threshold. Default: 0.8 [0 to 1.0]
             --local_blast       Local Blast database (full path) Default: NCBI remote nt database   
             --tax_filter        Remove OTUs from OTU table that do not match filter, i.e. Fungi to keep only fungi.
             -u, --usearch       USEARCH executable. Default: usearch9

And then you can use the ufits taxonomy command to assign taxonomy to your OTUs as well as append them to your OTU table as follows:

#use of hybrid taxonomy approach
ufits taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2

#filter data to only include OTUs identified to Fungi
ufits taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2 --tax_filter Fungi

#use RDP classifier
ufits taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt --method rdp --rdp_db fungalits_unite -rdp /path/to/classifier.jar

####Summarizing the Taxonomy:####

After taxonomy is appended to your OTU table, you can then generate OTU-like tables for each of your samples at all of the levels of taxonomy (i.e. Kingdom, Phylum, Class, Order, Family, Genus). At the same time, you can create a QIIME-like stacked bar graph from these data.

ufits summarize -i data.taxonomy.otu_table.txt -o data-summary

The optional --graphs argument will create the stacked bar graphs. You can save in a variety of formats as well as convert the result to precent of total with the --percent argument.

####Downstream Processing:#### As of ufits v0.6.0, the output from the ufits taxonomy command will create a biom file that contains taxonomy compatible with QIIME, PHINCH, PhyloSeq, and MetaCoMET. The script will also output a phylogenetic tree from your OTUs which is required for some downstream analysis. Moreover, you can pass the -m, --mapping_file option to ufits taxonomy and all columns will be incorporated as sample metadata. A mapping file is created automatically for you in the pre-processing steps of ufits, such as ufits ion and ufits illumina. You can easily add your metadata to this file using something like excel, and then save as a tab delimited text file.

QIIME and PhyloSeq import instructions

####Dependencies####

  • Python 2
  • Biopython
  • USEARCH8 (to use UTAX you will need at least version 8.1.1756)
  • natsort
  • pandas
  • numpy
  • matplotlib
  • psutil
  • bedtools (only needed if using Ion Torrent BAM file as input)
  • vsearch (version > 1.9.0, this is optional but will increase speed of UFITS and is required for very large datasets) installed via homebrew installation by default. Newer tools require vsearch.
  • biom-format (to create biom OTU table)
  • h5py (for biom)
  • R (dada2)
  • dada2, ShortRead (these will be automatically installed on first usage of ufits dada2

Python and USEARCH need to accessible in PATH; alternatively you can pass in the variable -u /path/to/usearch8 to scripts requiring USEARCH8.

In order to draw a heatmap or stacked bar graph using ufits.py heatmap or ufits summarize you will need to have the following python libraries installed: matplotlib, pandas, numpy. They can be installed with pip, i.e. pip install matplotlib pandas numpy.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages