CompareM

Unsupported: Unfortunately, I no longer have time to continue support for CompareM. The AAI calculator at the Kostas Lab or the EzAAI tool may meet your needs.

Note: There is a known issue with CompareM that can results in no homologs being identified when run on some Linux system. This is related to different implementations of 'sort'. Titus Brown has suggest a solution that addresses this for Mac OS X.

CompareM is a software toolkit which supports performing large-scale comparative genomic analyses. It provides statistics across sets of genomes (e.g., amino acid identity) and for individual genomes (e.g., codon usage). Parallelized implementations are provided for computationally intensive tasks in order to allow scalability to thousands of genomes. Common workflows are provided as single methods to support easy adoption by users, and a more granular interface provided to allow experienced users to exploit specific functionality. CompareM is open source and released under the GNU General Public License (Version 3).

Comparative genomic statistics:

average amino acid identity (AAI) between genomes
taxonomic classification by calculating AAI between query genomes and a reference database

Genomic usage patterns:

codon usage
amino acid usage
kmer usage for k <= 8 (e.g., tetranucleotide)
stop codon usage

Other:

di-nucleotide and codon usage patterns for identifying LGT
data exploration using dissimilarity matrices, hierarchical clustering trees, and heat maps

Announcements

Ported to Python 3 starting with version 0.1.0

Installation

Install via Conda

CompareM can be install via Conda using:

>conda install -c bioconda comparem

Install via pip

CompareM can be installed using pip using:

> sudo pip install comparem

You must install Prodigal and DIAMOND independently.

Dependencies

CompareM makes use of the numpy, scipy, matplotlib, and biolib python packages, and assumes the following 3rd party dependencies are on your system path:

prodigal >= 2.6.2: Hyatt D, Locascio PF, Hauser LJ, Uberbacher EC. 2012. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28: 2223-2230.
diamond >= 0.9.0: Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60 doi:10.1038/nmeth.3176.

Most systems already contain the “SciPy Stack” of numpy, scipy, and matplotlib. However, if you need to install these on your system, instructions can be found at:

http://www.scipy.org/install.html

Quick Start

The functionality provided by CompareM can be accessed through the help menu:

> comparem -h

Usage information about specific functions can also be accessed through the help menu, e.g.:

> comparem aa_usage –h

Amino Acid Identity Workflow

The most common task performed with CompareM is the calculation of pairwise amino acid identity (AAI) values between a set of genomes. This can be performed using the aai_wf command:

> comparem aai_wf <input_files> <output_dir>

The <input_file> argument indicates the set of genomes to compare and can either i) a text file where each line indicating the location of a genome, or ii) a directory containing all genomes to be compared. The genomic nucleotide sequences of genomes must be in FASTA format. The <output_dir> indicates the desired directory for all output files. A typical use of this command would be:

> comparem --cpus 32 aai_wf my_genomes aai_output

where the directory my_genomes contains a set of genomes in FASTA format, the results are to be written to a directory called aai_output, and 32 processors should be used to calculate the results.

A number of optional arguments can also be specified. This includes the sequence similarity parameters used to define reciprocal best hits between genomes(i.e., homologs). By default the e-value (--evalue), percent sequence identity (--per_identity), and percent alignment length (--per_aln_len) parameters are set to 1e-5, 30%, and 70%. When specifying a directory of genomes to process, CompareM only processes files with a fna extension. This can be changes with the --file_ext argument. In addition, if genomes are already represented by amino acid protein sequences (as opposed to genomic nucleotide sequences), this must be specified with the --proteins flag. Otherwise, genes will be identified de novo using the Prodigal gene caller. The time to compute all pairwise AAI values can be substantially reduced by using multiple processors as specified with the --cpus argument. Other arguments are for specialized uses and are discussed in the User's Guide.

Pairwise AAI statistics are provided in the output file ./<output_dir>/aai/aai_summary.tsv. This file consists of 8 columns indicating:

Identifier of the first genome
Number of genes in the first genome
Identifier of the second genome
Number of genes in the second genome
Number of orthologous genes identified between the two genomes
The mean amino acid identity (AAI) of orthologous genes
The standard deviation of the AAI across orthologous genes
The orthologous fraction (OF) between the two genomes defined as the number of orthologous genes divided the minimum number of genes in either genome

Other output files produced by this command are described below.

Program Usage

Detailed information regarding the use of CompareM can be found in the User's Guide (user_guide.pdf).

Cite

If you find this package useful, please cite this git repository (https://github.com/dparks1134/CompareM)

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
bin		bin
comparem		comparem
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
users_guide.pdf		users_guide.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

comparem

comparem

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

users_guide.pdf

users_guide.pdf

Repository files navigation

CompareM

Announcements

Installation

Install via Conda

Install via pip

Dependencies

Quick Start

Amino Acid Identity Workflow

Program Usage

Cite

Copyright

About

Releases 17

Packages

Contributors 3

Languages

License

donovan-h-parks/CompareM

Folders and files

Latest commit

History

Repository files navigation

CompareM

Announcements

Installation

Install via Conda

Install via pip

Dependencies

Quick Start

Amino Acid Identity Workflow

Program Usage

Cite

Copyright

About

Resources

License

Stars

Watchers

Forks

Languages