Skip to content

nguyenngochuy91/Ancestral-Blocks-Reconstruction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ROAGUE: Reconstruction of Ancestral Gene Blocks Using Events

Purpose

ROAGUE is a tool to reconstruct ancestors of gene blocks in prokaryotic genomes. Gene blocks are genes co-located on the chromosome. In many cases, gene blocks are conserved between bacterial species, sometimes as operons, when genes are co-transcribed. The conservation is rarely absolute: gene loss, gain, duplication, block splitting and block fusion are frequently observed.

ROAGUE accepts a set of species and a gene block in a reference species. It then finds all gene blocks, orhtologous to the reference gene blocks, and reconsructs their ancestral states.

Requirements

Installation

Users can either use github interface Download button or type the following command in command line:

git clone https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction

Install Miniconda (you can either export the path everytime you use ROAGUE, or add it to the .bashrc file). Before using the following command line, users will need to install Wget.

wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/anaconda_ete/
export PATH=~/anaconda_ete/bin:$PATH;

Install Biopython and ete3 using conda (highly recommended install biopython with conda)

conda install -c bioconda biopython ete3

Install ete_toolchain for visualization

conda install -c etetoolkit ete_toolchain

Install BLAST, ClustalW, MUSCLE

conda install -c bioconda blast clustalw muscle

For PDA, check installation instructions on this website: PDA

Usage

The easiest way to run the project is to execute the script ROAGUE which is inside the directory [Ancestral-Blocks-Reconstruction].

Run on example datasets for Tracing the ancestry of operons in bacteria

The users can run this script on the example data sets provided in directory E_Coli and B_Sub. The two following command lines will run roague on our 2 directories. The final results (pdf files of our ancestral reconstructions) are stored in result/E_Coli/visualization and result/B_Sub/visualization directory by default.

E_Coli

./roague.py -g E_Coli/genomes/ -b E_Coli/gene_block_names_and_genes.txt -r NC_000913 -f E_Coli/phylo_order.txt -m global

B_Sub

./roague.py -g B_Sub/genomes/ -b B_Sub/gene_block_names_and_genes.txt -r NC_000964 -f B_Sub/phylo_order.txt -m global

Run on example datasets for Evolutionary analysis of the bacterial gibberellin

The users can go to this Gibberellin and follow the instruction in the README

Run on users' specific datasets

If the users wants to run the program on their own datasets, then they have to provide the following inputs:

  1. Directory that stores all the genomes file to study in genbank format
  2. Gene block text file that stores gene blocks in a reference species (this reference has to be in the genomes directory). The gene block format is tab delimited. The first column is the gene block name, then followed by the genes' name. For example, here is the gene_block_names_and_genes.txt file from Escheria coli K-12 MG1655.
astCADBE	astA	astB	astC	astD	astE
atpIBEFHAGDC	atpI	atpH	atpC	atpB	atpA	atpG	atpF	atpE	atpD
caiTABCDE	caiA	caiE	caiD	caiC	caiB	caiT
casABCDE12	casE	casD	casA	casC	casB	cas1	cas2
chbBCARFG	chbG	chbF	chbC	chbB	chbA	chbR
  1. Run ROAGUE, the output is stored in directory result.
./roague.py -g genomes_directory -b gene_block_names_and_genes.txt -r ref_accession -m global -o result
usage: roague.py [-h] [--genomes_directory GENOMES_DIRECTORY]
               [--gene_blocks GENE_BLOCKS] [--reference REFERENCE]
               [--filter FILTER] [--method METHOD] [--output OUTPUT]

optional arguments:
-h, --help            show this help message and exit
--genomes_directory GENOMES_DIRECTORY, -g GENOMES_DIRECTORY
                      The directory that store all the genomes file
                      (E_Coli/genomes)
--gene_blocks GENE_BLOCKS, -b GENE_BLOCKS
                      The gene_block_names_and_genes.txt file, this file
                      stores the operon name and its set of genes
--reference REFERENCE, -r REFERENCE
                      The ncbi accession number for the reference genome
                      (NC_000913 for E_Coli and NC_000964 for B_Sub)
--filter FILTER, -f FILTER
                      The filter file for creating the tree
                      (E_Coli/phylo_order.txt for E_Coli or
                      B_Sub/phylo_order.txt for B-Sub)
--method METHOD, -m METHOD
                      The method to reconstruc ancestral gene block, we
                      support either global or local
--output OUTPUT, -o OUTPUT
                      Output directory to store the result

Besides, the users can also provide a filter text file. This filter file specifies the species to be included in the reconstruction analysis. The reason is that there might be families of species that are over representative in our genomes directory. This will reduce phylogenetic diversity and cause bias in our ancestral reconstruction. Hence, it is recomended to run PDA on generated tree before proceeding further steps in our analysis. In order to achieve this, the user can follow the following instructions:

  1. Generate a phylogenetic tree from the genomes directory
./create_newick_tree.py -G genomes_directory -o tree_directory -f NONE -r ref_accession
usage: create_newick_tree.py [-h] [-G DIRECTORY] [-o DIRECTORY] [-f FILE]
                          [-m STRING] [-t FILE] [-r REF] [-q]

optional arguments:
-h, --help            show this help message and exit
-G DIRECTORY, --genbank_directory DIRECTORY
                     Folder containing all genbank files for use by the
                     program.
-o DIRECTORY, --outfolder DIRECTORY
                     Directory where the results of this program will be
                     stored.
-f FILE, --filter FILE
                     File restrictiong which accession numbers this script
                     will process. If no file is provided, filtering is not
                     performed.
-r REF, --ref REF     The reference genome number, such as NC_000913 for E_Coli
-q, --quiet           Suppresses most program text outputs.

  1. Download and install PDA. Debias the phylogenetic tree using PDA program:
./debias.py -i tree_directory/out_tree.nwk -o pda_result.txt -s num -r ref_accession
usage: debias.py [-h] [-i INPUT_TREE] [-o PDA_OUT] [-s TREE_SIZE] [-r REF]


optional arguments:
-h, --help            show this help message and exit
-i INPUT_TREE, --input_tree INPUT_TREE
                     Input tree that we want to debias
-o PDA_OUT, --pda_out PDA_OUT
                     Output of pda to be store.
-s TREE_SIZE, --tree_size TREE_SIZE
                     Reduce the size of the tree to this size
-r REF, --ref REF     Force to include the following species, here I force
                     to include the reference species

  1. Run ROAGUE, the output is stored in directory result.
./roague.py -g genomes_directory -b gene_block_names_and_genes.txt -r ref_accession -f phylo_order.txt -m global -o result

Examples

Here are two gene blocks that were generated through our program.

  1. Gene block paaABCDEFGHIJK:

This gene block codes for genes involved in the catabolism of phenylacetate and it is not conserved between the group of studied bacteria.

paaABCDEFGHIJK 2. Gene block atpIBEFHAGDC:

This gene block catalyzes the synthesis of ATP from ADP and inorganic phosphate and it is very conserved between the group of studied bacteria.

atpIBEFHAGDC

Credits

  1. http://bioinformatics.oxfordjournals.org/content/early/2015/04/13/bioinformatics.btv128.full