GitHub - JelleZijlstra/JPP: Phylogenetic inference using maximum parsimony in Python

JelleZijlstra / JPP Public

Notifications You must be signed in to change notification settings
Fork 0
Star 4

Phylogenetic inference using maximum parsimony in Python

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
html		html
src		src
.gitignore		.gitignore
README		README

Repository files navigation

CS205 Final Project: Computational Phylogenetics
Jelle Zijlstra, November--December 2012

1. Introduction
This project implements phylogenetic inference using maximum parsimony (MP).
Its input is a list of taxa (e.g., animal species) with their states for a
variety of characters and its output is a tree that describes their
relationships. This method is used to reconstruct the evolutionary relationships among different organisms.

1.1. Example of MP in action
For example, suppose we have four species---A, B, C, and D. They vary in two
characters: A has red eyes while B, C, and D have blue eyes; and A and B have a
blue nose while C and D have a red one. We arbitrarily select A as the outgroup
(the first-branching taxon). That means there are three possible trees:

(1)
/-- A
--| /-- B
\--| /-- C
\--|
\-- D

(2)
/-- A
--| /-- C
\--| /-- B
\--|
\-- D

(3)
/-- A
--| /-- D
\--| /-- B
\--|
\-- C

The character matrix (the input to our program) would look like this:

4 2
A 00
B 10
C 11
D 11

When using maximum parsimony, we want to select the tree that requires the
fewest changes in character states. For all three trees above, there is one
change in character 1: from 0 in A to 1 in the branch (clade) formed by B, C,
and D. However, for character 2, trees (2) and (3) require two changes in
state: from state 0 in A, it changed to 1 independently in C and D, or it
changed to 1 in the B-C-D clade, and then back to 0 in B only. In either case,
there are two state changes. In tree (1), on the other hand, the character
changes only once in the branch leading to C and D.

Summing over the two characters, tree (1) has a length of 2, while trees (2)
and (3) have a length of 3. Therefore, (1) is the most parsimonious tree. We
conclude that C and D shared a more recent common ancestor with each other
than either did with B.

1.2. MP on larger datasets
Real-world datasets tend to have more than four taxa and two characters. This
poses a problem, because the number of possible trees increases very fast with
the number of taxa--at O(n!). Therefore, it is impossible in practice to
enumerate all possible trees when there are more than ~10 taxa.

However, there are alternatives. These algorithms rely on the observation
that optimal and near-optimal trees tend to be similar to each other. This is
often visualized as a ``tree landscape'', where similar good trees are
clustered together in ``mountains''. We then need a ``hill-climbing''
algorithm, which can move from a tree to other, similar trees in order to
climb up the tree.

This is accomplished using heuristic algorithms that perform slight
rearrangements on a tree in order to improve it---for example, exchange two
adjacent nodes or branches of the tree. Common heuristic algorithms include:
- Nearest Neighbor Interchange (NNI) -- perform rearrangements of the type
(A,(B,C)) -> (B,(A,C))
- Subtree Pruning Rebranching (SPR) -- prune a random subtree from the tree,
then replant that tree at a random different location in the tree.
- Tree Bissection Reconnection (TBR) -- similar to SPR, but more powerful. The
tree is bissected at a random node, and the two partial trees that result are
both treated as unrooted and reconnected at random spots.

Since these algorithms all rely on random rearrangements of a previous tree,
they do not provide an exact solution. However, in practice they usually come
at least close to finding the best tree.

2. Implementation and usage
In this project, I implemented exhaustive search, branch-and-bound search (a
slightly faster-running exact algorithm), and the heuristic algorithms NNI,
SPR, and TBR. The heuristic algorithms can be run in parallel by multiple
processes using MPI.

The code is written in Python and requires the following external modules:
- mpi4py
- numpy
- argparse
- CProfile (if profiling is enabled)

Type ``python main.py'' for usage notes. Here are some example invocations:

- openmpirun -n 4 python main.py -a tbr -t 60 -i 30 -d data/oryzos.txt
Runs the program on the datafile ``data/oryzos.txt'' using TBR for 60
minutes with 30 s between communications between the 4 MPI processes.
- python main.py -a exhaustive -d data/shortoryzos.txt
Performs an exhaustive search on the datafile ``data/shortoryzos.txt''.

2.1. Organization.
The implementation resides in the src/ directory. The program comprises the
following code files:
- branchnbound.py -- Implementation of branch-and-bound search.
- charmatrix.py -- Class representing a taxon-character matrix.
- exhaustive.py -- Implementation of exhaustive search.
- fuser.py -- Implementation of tree fusion (an operation on two trees to
detect common subtrees and unite the optimal components of both)
- list_tree.py -- Class representing a tree as a list of pointers to other
nodes in the tree; useful for heuristic algorithms like SPR and TBR.
- list_tree_searcher.py -- Abstract class used for search in multiple
processes using a heuristic algorithm.
- main.py -- Entry point of the program.
- mixed.py -- Searcher implementation that lets different processes use
different algorithms.
- nni.py -- Implementation of NNI.
- parallel_search.py -- Class used for communication between different MPI
processes to exchange trees.
- run_tests.py -- Runs unit tests on modules that have them.
- serial_util.py -- Provides a utility function for timing the execution of
pieces of code.
- spr.py -- Implementation of SPR.
- tbr.py -- Implementation of TBR.
- tree.py -- Implementation of trees as immutable, recursive data structures.
- tree_parse.py -- Parser that parses a string of the form "(1,(2,3))", used
for MPI communication, into a tree object.
The data/ directory contains example datasets, and the dev/ directory contains
some code used in development.

3. Performance.
Since this program uses heuristic search and does not calculate an exact
result, it is not meaningful to compute measures like speedup. Instead, we can
look at how short the trees produced by the program are as a function of the
number of processes and the runtime.

I ran the program on the dataset in data/oryzos.txt, with a communication
interval of 15 s, using TBR, under the following three configurations:
- Single process, run for 2 min -- produces a single tree of length 973
- Single process, run for 8 min -- produces a single tree of length 918
- Four processes, run for 2 min -- produce a single tree of length 961
I also ran the program under the "mixed" configuration with four processes for
two minutes with an interval of 15 s. This yielded a tree of length 934.

Thus, the multiple processes together perform better than a single process, but
not nearly as well as a single process run longer. This suggests that the
strategy used for communicating between the processes is not yet optimal.

4. Possible future optimizations.
Profiling (using CProfile) reveals that the TBR algorithm spends about half of
its time in the list_tree.char_length method. This method could probably be
optimized by caching partial results in the tree and invalidating portions of
the cache in the event of rearrangements.

The program also spends a significant amount of time in the MPI recv method. It
may be possible to increase productivity by performing further searches during
recv wait time.

5. Links.
- http://www.github.com/JelleZijlstra/JPP - source code
- http://cloud.cs50.net/~zijlstra/JPP/html/ - project website
- http://youtu.be/0fS0zfdye_k - project video on YouTube