Skip to content

mortonjt/woltka

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Woltka

License Build Status Coverage Status

Woltka (Web of Life Toolkit App), is a bioinformatics package for shotgun metagenome data analysis. It takes full advantage of, and it not limited by, the WoL reference phylogeny. It bridges first-pass sequence aligners with advanced analytical platforms (such as QIIME 2). Highlights of this program include:

  • OGU: fine-grained community ecology.
  • Tree-based, rank-free classification.
  • Combined taxonomic & functional analysis.

Woltka ships with a QIIME 2 plugin. See here for instructions.

Contents

Overview

Where does Woltka fit in a pipeline

Woltka is a classifier. It serves as a middle layer between sequence alignment and community ecology analyses.

What does Woltka do

Woltka processes alignments -- the mappings of query sequences against reference sequences (such as microbial genomes or genes), and infers the best placement of the queries in a hierarchical classification system. One query could have simultaneous matches in multiple references. Woltka finds the most suitable classification unit(s) to describe the query accordingly the criteria specified by the researcher. Woltka generates profiles (feature tables) -- the frequencies (counts) of classification units which describe the composition of samples.

What else does Woltka do

Woltka provides several utilities for handling feature tables, including collapsing a table to higher-level features, calculating feature group coverage, filtering features based on per-sample abundance, and merging tables.

What does Woltka not do

Woltka does NOT align sequences. You need to align your FastQ (or Fast5, etc.) files against a reference database (we recommend WoL) use an aligner of your choice (BLAST, Bowtie2, etc.). The resulting alignment files can be fed into Woltka.

Woltka does NOT analyze profiles. We recommend using QIIME 2 for robust downstream analyses of the profiles to decode the relationships among micobial communities and with their environments.

Installation

Requirement: Python 3.6 or above, with Python package biom-format installed.

pip install woltka

After installation, launch the program by executing:

woltka

More details about installation are provided here.

Example usage

Woltka provides several small test datasets under woltka/tests/data. To access them, download this GitHub repo, unzip, and navigate to this directory.

One can execute the following commands to make sure that Woltka functions correctly, and to get an impression of the basic usage of Woltka.

(Note: a more complete list of commands at provided here. Alternatively, you can skip this test dataset check out the instructions for working with WoL.)

1. OGU table generation (details):

woltka classify -i align/bowtie2 -o ogu.biom

The input path, align/bowtie2, is a directory containing five Bowtie2 alignment files (S01.sam.xz, S02.sam.xz,... S05.sam.xz) (SAM format, xzipped), each representing the mapping of shotgun metagenomic sequences per sample against a reference genome database.

The output file, table.biom, is a feature table in BIOM format, which can then be analyzed using various bioformatics programs such as QIIME 2.

2. Taxonomic profiling at the ranks of phylum, genus and species (details):

woltka classify \
  -i align/bowtie2 \
  --map taxonomy/g2tid.txt \
  --nodes taxonomy/nodes.dmp \
  --names taxonomy/names.dmp \
  --rank phylum,genus,species \
  -o output_dir

The mapping file (g2tid.txt) translates genome IDs to taxonomic IDs, which then allows Woltka to classify query sequences based on the NCBI taxonomy (nodes.dmp and names.dmp).

The output directory (output_dir) will contain three feature tables: phylum.biom, genus.biom and species.biom, each representing a taxonomic profile at one of the three ranks.

3. Functional profiling by UniRef entries then by GO terms (molecular process):

woltka classify \
  -i align/bowtie2 \
  --coords function/coords.txt.xz \
  --map function/uniref.map.xz \
  --map function/go/process.tsv.xz \
  --map-as-rank \
  --rank uniref,process \
  -o output_dir

Here, the input files are still read-to-genome alignments, instead of read-to-gene ones, but Woltka matches reads to genes based on their coordinates on the genomes (as indicated by the file coords.txt). This ensures consistency between taxonomic and functional classifications.

Subsequently, Woltka is able to assign query sequences to functional units, as defined in mapping files (uniref.map and go/process.tsv). As you can see, compressed files are supported and auto-detected.

Similarly, the output files are two functional profiles: uniref.biom and process.biom.

One can also combine taxonomic and functional profilings in a stratification analysis. See details.

Citation

The first manuscript describing Woltka has been preprinted at:

Note: This manuscript focuses on the OGU analysis. Although it does not discuss other functions of Woltka, it is so far the only citable article if you use Woltka in your studies.

Contact

Please forward any questions to the project leader: Dr. Qiyun Zhu (qiyunzhu@gmail.com) or the senior PI: Dr. Rob Knight (robknight@ucsd.edu).

About

Web of Life Toolkit App

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%