Skip to content

mgperry/treeCl

 
 

Repository files navigation

TreeCl - Phylogenetic Tree Clustering

TreeCl is a python package for clustering gene families by phylogenetic similarity. It takes a collection of alignments, infers their phylogenetic trees, and clusters them based on a matrix of between-tree distances. Finally, it calculates a single representative tree for each cluster.

The purpose of this is to establish whether there is any underlying structure to the data.

Installation

Clone the repo with submodules using git clone --recursive git@github.com:kgori/treeCl.git and add it to your $PYTHONPATH

Dependencies

Python:

The easiest way to install the dependencies is using pip. If you don't have pip, it can be installed by typing easy_install pip in a shell. Then the above packages can be installed by running this command:

pip install numpy scipy dendropy scikit-learn

External:

Other:

Example Analysis

from treeCl.collection import Collection, Scorer
from treeCl.clustering import Clustering, Partition

c = Collection(input_dir='input_dir', file_format='phylip', datatype='protein') # add compression='gz' or 'bz2' if sequence alignments are compressed (zip not supported yet)
c.calc_NJ_trees() #add verbosity=1 or higher to get progress messages
dm = c.distance_matrix('euc')
cl = Clustering(dm)
p = cl.hierarchical(4, 'single') # should give fairly inaccurate clustering
true = Partition(tuple([1]*15+[2]*15+[3]*15+[4]*15))
sc = Scorer(c.records)
score = sc.score(p)
print score

About

Clustering phylogenetic trees with python

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.1%
  • C 3.7%
  • C++ 0.2%