AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 are required as of AMPtk v0.7.0.
- Mac install instructions
- Linux install instructions
- Windows install instuructions - Note use on Windows is not recommended.
AMPtk comes with a wrapper script for ease of use. On UNIX, you can call it by simply typing amptk
, while on windows you may need to type amptk.py
(not needed if you add the .py
extension in your PATHEXT, directions here).
$ amptk
Usage: amptk <command> <arguments>
version: 0.8.9
Description: AMPtk is a package of scripts to process NGS amplicon data.
Dependencies: USEARCH v9.1.13 and VSEARCH v2.2.0
Process: ion pre-process Ion Torrent data
illumina pre-process folder of de-multiplexed Illumina data
illumina2 pre-process PE Illumina data from a single file
454 pre-process Roche 454 (pyrosequencing) data
SRA pre-process singe FASTQ per sample data (i.e. SRA data)
Clustering: cluster cluster OTUs (using UPARSE algorithm)
dada2 dada2 denoising algorithm (requires R, dada2, ShortRead)
unoise2 UNOISE2 denoising algorithm
cluster_ref closed/open reference based clustering (EXPERIMENTAL)
Utilities: filter OTU table filtering
taxonomy Assign taxonomy to OTUs
show show number or reads per barcode from de-multiplexed data
select select reads (samples) from de-multiplexed data
remove remove reads (samples) from de-multiplexed data
sample sub-sample (rarify) de-multiplexed reads per sample
drop Drop OTUs from dataset
summarize Summarize Taxonomy (create OTU-like tables and/or stacked bar graphs)
funguild Run FUNGuild (annotate OTUs with ecological information)
meta pivot OTU table and append to meta data
heatmap Create heatmap from OTU table
SRA-submit De-multiplex data and create meta data for NCBI SRA submission
Setup: install Download/install pre-formatted taxonomy DB. Only need to run once.
database Format Reference Databases for Taxonomy
primers List primers hard-coded in AMPtk. Can use in pre-processing steps.
And then by calling one of the commands, you get a help menu for each:
$ amptk cluster
Usage: amptk cluster <arguments>
version: 0.8.0
Description: Script is a "wrapper" for the UPARSE algorithm. FASTQ quality trimming via expected
errors and Dereplication are run in vsearch if installed otherwise defaults to Python
which allows for the use of datasets larger than 4GB.
Chimera filtering and UNOISE are also options.
Arguments: -i, --fastq Input FASTQ file (Required)
-o, --out Output base name. Default: out
-e, --maxee Expected error quality trimming. Default: 1.0
-p, --pct_otu OTU Clustering Radius (percent). Default: 97
-m, --minsize Minimum size to keep (singleton filter). Default: 2
--uchime_ref Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
--map_filtered Map quality filtered reads back to OTUs. Default: off
--unoise Run De-noising pre-clustering (UNOISE). Default: off
--debug Keep intermediate files.
-u, --usearch USEARCH executable. Default: usearch9
AMPtk is pre-configured to deal with amplicons from fungal ITS, fungal LSU, bacterial 16S, and insect COI. You can see how the databases were constructed here. After installation of AMPtk, you can download and install these databases using the amptk install
command (to overwrite existing databases, use the --force
option. To install all of the databases, you would type:
#install all databases
amptk install -i ITS LSU 16S COI
#install only ITS databases
amptk install -i ITS
The resulting databases are stored in the DB
folder of the amptk
directory. These data are used for both Chimera reference filtering and for assigning taxonomy.
You need to be familiar with your read structure! For example, do you have barcodes at the 5' end of your amplicons? Is the primer sequence in your reads or has it been removed by the sequencing software? I prefer that data be minimally processed - this acts largely as a quality control filter. For example when using Ion Torrent data (see instructions below) it is beneficial to turn off all filtering of the data on the server, by doing this you ensure that the reads are properly quality filtered and barcodes/primers are used as an additional quality filter. If you have PE MiSeq data, typically it comes in an already demultiplexed format, i.e. a folder of PE reads for each sample (_R1.fastq.gz, _R2.fastq.gz) - these data can be used directly with the amptk illumina
command. A somewhat common (although I think not ideal) approach is to use a custom sequencing primer with Illumina data - and then the output from the sequencer is already demulitplexed and primers are already stripped - in this case, you can use the amptk illumina
command, but make sure to pass the --require_primer off
option. However, I have seen several other formats of data as well - even with Illumina data being in a single file, where then the amptk illumina2
script is helpful. Bottom line is if you don't know how your data is structured in terms of barcodes, primers, reads, paired-ends, etc - you need to find out before doing anything else....
From the Torrent Server, analyze the data using the --disable-all-filters
BaseCaller argument. This will leave the adapters/key/barcode sequence intact. You can download either the unaligned BAM file from the server or a FASTQ file. You can then de-multiplex the data as follows:
amptk ion -i data.bam -o data -f fITS7 -r ITS4
amptk ion --barcodes 1,5,24 -i data.fastq -o data -f ITS1-F -r ITS2
amptk ion --barcode_fasta my_barcodes.fa -i data.fastq -o data -f ITS1 -r ITS2
amptk ion -i data.bam -m mapping_file.txt -o data
This will find Ion barcodes (1, 5, and 24) and relabel header with that information (barcodelabel=BC_5;). By default, it will look for all 96 Ion Xpress barcodes, specifiy the barcodes you used by a comma separated list. You can also pass in a fasta file containing your barcode sequences with properly labeled headers. Next the script will find and trim both the forward and reverse primers (default is ITS2 region: fITS7 & ITS4), and then finally will trim or pad with N's to a set length (default: 250 bp). Trimming to the same length is critcally important for USEARCH to cluster correctly, padding with N's after finding the reverse primer keeps short ITS sequences from being discarded. These options can be customized using: --fwd_primer
, --rev_primer
, --trim_len
, etc.
Data from 454 instruments has the same read structure as Ion Torrent: Read and thus can be processed very similarly. You just need to provide either an SFF file, FASTA + QUAL, or FASTQ files and then you need to specify a multi-fasta file containing the barcodes used in the project. The data will be processed in the same fashion (see above) as the Ion Torrent Data. For example:
#SFF input
amptk 454 -i data.sff --barcode_fasta my454barcodes.fa -o 454project
#FASTA/QUAL input
amptk 454 -i data.fa -q data.qual --barcode_fasta my454barcodes.fa -o 454project
Paired-end MiSeq data is typically delivered already de-multiplexed into separate read files, that have a defined naming structure from Illumina that looks like this:
<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz
You can processes a folder of Illumina data like this:
amptk illumina -i folder_name -o miseqData -f fITS7 -r ITS4
amptk illumina -i folder_name -o miseqData -m mapping_file.txt
This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files. First it will run USEARCH8 -fastq_mergepairs
, however, since some ITS sequences are too long to overlap you can rescue longer sequences by recovering the the non-merged forward reads by passing the --rescue_forward
argument. Alternatively, you can only utilize the forward reads (R1), by passing the --reads forward
argument. Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering. Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called miseqData.demux.fq
that will be used for the next clustering step. The script will also output a text file called miseqData-filenames.txt
that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used. The script will produce a folder containing the individual de-multiplexed files named from the -o, --out
argment.
The next step is to run amptk cluster
, which expects de-multiplexed FASTQ data as a single file with ;barcodelabel=Sample_name
in the FASTQ header. If your data is in some other format, you can use other UNIX/Perl/Python scripts to add the barcodelabel=
to each read and then cluster your data using AMPtk. Note that reads should be globally trimmed and the pre-processing steps in AMPtk take steps to ensure high quality data makes it into the clustering algorithm with minimal sequence loss. Now the data from either platform (Ion, 454, or Illumina) can be clustered by running the following:
amptk cluster -i amptk.demux.fq -o ion_output
This script wil quality filter the data based on expected errors, then remove duplicated sequences (dereplication), sort the output by abundance, and finally cluster using usearch -cluster_otus
command. You can also optionally run UCHIME Reference filtering by adding the --uchime_ref ITS
option or change the default clustering radius (97%) by passing the --pct_otu
option. Type -h
for all the available options.
Recently there is a new "OTU picking" algorithm for amplicon based datasets called DADA2 that has sensitivity down to single base pairs, see publication here, GitHub here. This algorithm uses a statistical method to infer the original sequence that a read was derived from, foregoing the need to cluster at a set threshold (i.e. 97%). I've implemented a modified DADA2 pipeline here to work with the current AMPtk data structure. A reminder is that reads for DADA2 must have no N's and have to all length trimmed identically, thus variable length amplicons will be truncated down. Thus this method is perhaps more suited to something like COI or 16S amplicons. You can run it as follows:
amptk dada2 -i amptk.demux.fq -o dada2_output -l 200
The script will quality filter your data, trim for use in DADA2, run DADA2 alogrithm, and then parse the results to output an OTU table and a file containing inferred sequences (OTUs) in fasta format. These files can be used in all downstream AMPtk scripts, i.e. amptk filter
and amptk taxonomy
.
The data may need some additional filtering if you included a spike-in control mock community. The advantage is that you know what should be in the spike-in control barcode sample, thus you can modify USEARCH8 clustering parameters that give you reasonable results. If you need to trim your OTU table by some threshold, i.e. several OTUs at low abundance are showing up in your spike-in control sample that represent contamination or sequence error - you can set a threshold and filter the OTU table. This is done with the following script:
amptk filter -i test.otu_table.txt -f test.final.otus.fa -b mock3 --mc my_mock_seqs.fa
This will read the OTU table -i
and the OTUs -f
from the amptk cluster
command. This script will apply an index-bleed filter to clean-up barcode-switching between samples which happens at a rate of ~ 0.2% in Ion Torrent and as much 0.3% in MiSeq data. The script first normalizes the OTU table to the number of reads in each sample, then (optionally) using the -b
sample, it will calculate the amount of index-bleed in the OTU table, finally it will loop through each OTU and change values to 0 that are below the -index_bleed
filter. Finally, this script will remove the mock spike in control sample from your dataset - as it should not be included in downstream processing, you can keep mock sequences if desired by passing the --keep_mock
argument. The output is a filtered OTU table to be used for downstream processing.
If you do not have a mock community spike in, you can still run the index bleed filter (and you probably should as nearly all NGS data has some degree of barcode switching or index-bleed) by just running the command without a -b
argument, such as, which will apply a 0.5% filter on the data. Passing the -p 0.005
argument will over-ride the calculated index-bleed.
amptk filter -i test.otu_table.txt -f test.final.otus.fa -p 0.005
You can assign taxonomy to your OTUs using AMPtk, either using UTAX from USEARCH8.1 or using usearch_global. The databases require some initial setup before you can use the amptk taxonomy
command.
Issuing the amptk taxonomy
command will inform you which databases have been properly configured as well as usage instructions:
$ amptk taxonomy
Usage: amptk taxonomy <arguments>
version: 0.8.0
Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional). By default the script
uses a hybrid approach, e.g. gets taxonomy information from SINTAX, UTAX, and global alignment hits from the larger
UNITE-INSD database, and then parses results to extract the most taxonomy information that it can at
'trustable' levels. SINTAX/UTAX results are used if BLAST-like search pct identity is less than 97%.
If % identity is greater than 97%, the result with most taxonomy levels is retained.
Arguments: -f, --fasta Input FASTA file (i.e. OTUs from amptk cluster) (Required)
-i, --otu_table Input OTU table file (i.e. otu_table from amptk cluster)
-o, --out Base name for output file. Default: amptk-taxonomy.<method>.txt
-d, --db Select Pre-installed database [ITS1, ITS2, ITS, 16S, LSU, COI]. Default: ITS2
-m, --mapping_file QIIME-like mapping file
-t, --taxonomy Taxonomy calculated elsewhere. 2 Column file.
--method Taxonomy method. Default: hybrid [utax, sintax, usearch, hybrid, rdp, blast]
--fasta_db Alternative database of fasta sequenes to use for global alignment.
--utax_db UTAX formatted database. Default: ITS2.udb [See configured DB's below]
--utax_cutoff UTAX confidence value threshold. Default: 0.8 [0 to 0.9]
--usearch_db USEARCH formatted database. Default: USEARCH.udb
--usearch_cutoff USEARCH threshold percent identity. Default 0.7
--sintax_cutoff SINTAX confidence value threshold. Default: 0.8 [0 to 0.9]
-r, --rdp Path to RDP Classifier. Required if --method rdp
--rdp_db RDP Classifer DB set. [fungalits_unite, fungalits_warcup. fungallsu, 16srrna]
--rdp_cutoff RDP Classifer confidence value threshold. Default: 0.8 [0 to 1.0]
--local_blast Local Blast database (full path) Default: NCBI remote nt database
--tax_filter Remove OTUs from OTU table that do not match filter, i.e. Fungi to keep only fungi.
-u, --usearch USEARCH executable. Default: usearch9
And then you can use the amptk taxonomy
command to assign taxonomy to your OTUs as well as append them to your OTU table as follows:
#use of hybrid taxonomy approach
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2
#filter data to only include OTUs identified to Fungi
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2 --tax_filter Fungi
#use RDP classifier
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt --method rdp --rdp_db fungalits_unite -rdp /path/to/classifier.jar
After taxonomy is appended to your OTU table, you can then generate OTU-like tables for each of your samples at all of the levels of taxonomy (i.e. Kingdom, Phylum, Class, Order, Family, Genus). At the same time, you can create a QIIME-like stacked bar graph from these data.
amptk summarize -i data.taxonomy.otu_table.txt -o data-summary
The optional --graphs
argument will create the stacked bar graphs. You can save in a variety of formats as well as convert the result to precent of total with the --percent
argument.
As of amptk v0.6.0
, the output from the amptk taxonomy
command will create a biom file that contains taxonomy compatible with QIIME, PHINCH, PhyloSeq, and MetaCoMET. The script will also output a phylogenetic tree from your OTUs which is required for some downstream analysis. Moreover, you can pass the -m, --mapping_file
option to amptk taxonomy
and all columns will be incorporated as sample metadata. A mapping file is created automatically for you in the pre-processing steps of amptk, such as amptk ion
and amptk illumina
. You can easily add your metadata to this file using something like excel, and then save as a tab delimited text file.
QIIME and PhyloSeq import instructions
- Python 2.7
- Biopython
- USEARCH9
- natsort
- pandas
- numpy
- matplotlib
- psutil
- bedtools (only needed if using Ion Torrent BAM file as input)
- vsearch (version > 1.9.0, this is optional but will increase speed of AMPtk and is required for very large datasets) installed via homebrew installation by default. Newer tools require vsearch.
- biom-format (to create biom OTU table)
- h5py (for biom)
- R (dada2)
- dada2, ShortRead (these will be automatically installed on first usage of
amptk dada2
Python and USEARCH need to accessible in PATH; alternatively you can pass in the variable -u /path/to/usearch8
to scripts requiring USEARCH8.
In order to draw a heatmap or stacked bar graph using amptk.py heatmap
or amptk summarize
you will need to have the following python libraries installed: matplotlib, pandas, numpy
. They can be installed with pip, i.e. pip install matplotlib pandas numpy
.