ROAGUE: Reconstruction of Ancestral Gene Blocks Using Events

Purpose

ROAGUE is a tool to reconstruct ancestors of gene blocks in prokaryotic genomes. Gene blocks are genes co-located on the chromosome. In many cases, gene blocks are conserved between bacterial species, sometimes as operons, when genes are co-transcribed. The conservation is rarely absolute: gene loss, gain, duplication, block splitting and block fusion are frequently observed.

ROAGUE accepts a set of species and a gene block in a reference species. It then finds all gene blocks, orhtologous to the reference gene blocks, and reconsructs their ancestral states.

Requirements

Wget
Conda (package manager so we don't have to use sudo)
Python 3+
Biopython 1.63+
Clustalw
Muscle Alignment
BLAST+
ETE3 (python framework for tree)
PDA (optional if you want to debias your tree base on Phylogenetic Diversity)

Installation

Users can either use github interface Download button or type the following command in command line:

git clone https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction

Install Miniconda (you can either export the path everytime you use ROAGUE, or add it to the .bashrc file). Before using the following command line, users will need to install Wget.

wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/anaconda_ete/
export PATH=~/anaconda_ete/bin:$PATH;

Install Biopython and ete3 using conda (highly recommended install biopython with conda)

conda install -c bioconda biopython ete3

Install ete_toolchain for visualization

conda install -c etetoolkit ete_toolchain

Install BLAST, ClustalW, MUSCLE

conda install -c bioconda blast clustalw muscle

For PDA, check installation instructions on this website: PDA

Usage

The easiest way to run the project is to execute the script ROAGUE which is inside the directory [Ancestral-Blocks-Reconstruction].

Run on example datasets for Tracing the ancestry of operons in bacteria

The users can run this script on the example data sets provided in directory E_Coli and B_Sub. The two following command lines will run roague on our 2 directories. The final results (pdf files of our ancestral reconstructions) are stored in result/E_Coli/visualization and result/B_Sub/visualization directory by default.

E_Coli

./roague.py -g E_Coli/genomes/ -b E_Coli/gene_block_names_and_genes.txt -r NC_000913 -f E_Coli/phylo_order.txt -m global

B_Sub

./roague.py -g B_Sub/genomes/ -b B_Sub/gene_block_names_and_genes.txt -r NC_000964 -f B_Sub/phylo_order.txt -m global

Run on example datasets for Evolutionary analysis of the bacterial gibberellin

The users can go to this Gibberellin and follow the instruction in the README

Run on users' specific datasets

If the users wants to run the program on their own datasets, then they have to provide the following inputs:

Directory that stores all the genomes file to study in genbank format
Gene block text file that stores gene blocks in a reference species (this reference has to be in the genomes directory). The gene block format is tab delimited. The first column is the gene block name, then followed by the genes' name. For example, here is the gene_block_names_and_genes.txt file from Escheria coli K-12 MG1655.

astCADBE	astA	astB	astC	astD	astE
atpIBEFHAGDC	atpI	atpH	atpC	atpB	atpA	atpG	atpF	atpE	atpD
caiTABCDE	caiA	caiE	caiD	caiC	caiB	caiT
casABCDE12	casE	casD	casA	casC	casB	cas1	cas2
chbBCARFG	chbG	chbF	chbC	chbB	chbA	chbR

Run ROAGUE, the output is stored in directory result.

./roague.py -g genomes_directory -b gene_block_names_and_genes.txt -r ref_accession -m global -o result

usage: roague.py [-h] [--genomes_directory GENOMES_DIRECTORY]
               [--gene_blocks GENE_BLOCKS] [--reference REFERENCE]
               [--filter FILTER] [--method METHOD] [--output OUTPUT]

optional arguments:
-h, --help            show this help message and exit
--genomes_directory GENOMES_DIRECTORY, -g GENOMES_DIRECTORY
                      The directory that store all the genomes file
                      (E_Coli/genomes)
--gene_blocks GENE_BLOCKS, -b GENE_BLOCKS
                      The gene_block_names_and_genes.txt file, this file
                      stores the operon name and its set of genes
--reference REFERENCE, -r REFERENCE
                      The ncbi accession number for the reference genome
                      (NC_000913 for E_Coli and NC_000964 for B_Sub)
--filter FILTER, -f FILTER
                      The filter file for creating the tree
                      (E_Coli/phylo_order.txt for E_Coli or
                      B_Sub/phylo_order.txt for B-Sub)
--method METHOD, -m METHOD
                      The method to reconstruc ancestral gene block, we
                      support either global or local
--output OUTPUT, -o OUTPUT
                      Output directory to store the result

Besides, the users can also provide a filter text file. This filter file specifies the species to be included in the reconstruction analysis. The reason is that there might be families of species that are over representative in our genomes directory. This will reduce phylogenetic diversity and cause bias in our ancestral reconstruction. Hence, it is recomended to run PDA on generated tree before proceeding further steps in our analysis. In order to achieve this, the user can follow the following instructions:

Generate a phylogenetic tree from the genomes directory

./create_newick_tree.py -G genomes_directory -o tree_directory -f NONE -r ref_accession

usage: create_newick_tree.py [-h] [-G DIRECTORY] [-o DIRECTORY] [-f FILE]
                          [-m STRING] [-t FILE] [-r REF] [-q]

optional arguments:
-h, --help            show this help message and exit
-G DIRECTORY, --genbank_directory DIRECTORY
                     Folder containing all genbank files for use by the
                     program.
-o DIRECTORY, --outfolder DIRECTORY
                     Directory where the results of this program will be
                     stored.
-f FILE, --filter FILE
                     File restrictiong which accession numbers this script
                     will process. If no file is provided, filtering is not
                     performed.
-r REF, --ref REF     The reference genome number, such as NC_000913 for E_Coli
-q, --quiet           Suppresses most program text outputs.

Download and install PDA. Debias the phylogenetic tree using PDA program:

./debias.py -i tree_directory/out_tree.nwk -o pda_result.txt -s num -r ref_accession

usage: debias.py [-h] [-i INPUT_TREE] [-o PDA_OUT] [-s TREE_SIZE] [-r REF]


optional arguments:
-h, --help            show this help message and exit
-i INPUT_TREE, --input_tree INPUT_TREE
                     Input tree that we want to debias
-o PDA_OUT, --pda_out PDA_OUT
                     Output of pda to be store.
-s TREE_SIZE, --tree_size TREE_SIZE
                     Reduce the size of the tree to this size
-r REF, --ref REF     Force to include the following species, here I force
                     to include the reference species

Run ROAGUE, the output is stored in directory result.

./roague.py -g genomes_directory -b gene_block_names_and_genes.txt -r ref_accession -f phylo_order.txt -m global -o result

Examples

Here are two gene blocks that were generated through our program.

Gene block paaABCDEFGHIJK:

This gene block codes for genes involved in the catabolism of phenylacetate and it is not conserved between the group of studied bacteria.

2. Gene block atpIBEFHAGDC:

This gene block catalyzes the synthesis of ATP from ADP and inorganic phosphate and it is very conserved between the group of studied bacteria.

Credits

http://bioinformatics.oxfordjournals.org/content/early/2015/04/13/bioinformatics.btv128.full

Name		Name	Last commit message	Last commit date
Latest commit History 307 Commits
B_Sub		B_Sub
E_Coli		E_Coli
E_Coli_edit		E_Coli_edit
edit		edit
images		images
.Rhistory		.Rhistory
.gitignore		.gitignore
.~lock.atp_global_edit.pdf#		.~lock.atp_global_edit.pdf#
LICENSE		LICENSE
README.md		README.md
README.md.orig		README.md.orig
atp_global_edit.png		atp_global_edit.png
atp_local_edit.png		atp_local_edit.png
bam_global_edit.png		bam_global_edit.png
blast_parse.py		blast_parse.py
blast_parse.py.bak		blast_parse.py.bak
blast_script.py		blast_script.py
blast_script.py.bak		blast_script.py.bak
blast_script1.py		blast_script1.py
boostrap.py		boostrap.py
check.pdf		check.pdf
check.png		check.png
check1.pdf		check1.pdf
check1.png		check1.png
checkGGPS2.py		checkGGPS2.py
command_line		command_line
comparison.py		comparison.py
convert.py		convert.py
create_newick_tree.py		create_newick_tree.py
create_newick_tree.py.bak		create_newick_tree.py.bak
create_operon_tree.py		create_operon_tree.py
debias.py		debias.py
display.py		display.py
error_log.txt		error_log.txt
example.pdf		example.pdf
example.png		example.png
file_handle.py		file_handle.py
filter.py		filter.py
filter_operon_blast_results.py		filter_operon_blast_results.py
filter_operon_blast_results.py.bak		filter_operon_blast_results.py.bak
findParent_global.py		findParent_global.py
findParent_global.py.bak		findParent_global.py.bak
findParent_local.py		findParent_local.py
findParent_local.py.bak		findParent_local.py.bak
format_db.py		format_db.py
format_db.py.bak		format_db.py.bak
format_db1.py		format_db1.py
get_result.py		get_result.py
group.py		group.py
group.py.bak		group.py.bak
group.txt		group.txt
homolog4.py		homolog4.py
homolog4.py.bak		homolog4.py.bak
lep_global_edit.png		lep_global_edit.png
make_operon_query.py		make_operon_query.py
make_operon_query.py.bak		make_operon_query.py.bak
mmg_global_edit.png		mmg_global_edit.png
modified.ph		modified.ph
paaOperon.pdf		paaOperon.pdf
paa_global_edit.png		paa_global_edit.png
paa_local_edit.png		paa_local_edit.png
quickTest.py		quickTest.py
rbs_global_edit.png		rbs_global_edit.png
reconstruction.py		reconstruction.py
roague.py		roague.py
show_boostrap.py		show_boostrap.py
show_tree.py		show_tree.py
simple_show.py		simple_show.py
visualize.py		visualize.py

License

nguyenngochuy91/Ancestral-Blocks-Reconstruction

Folders and files

Latest commit

History

Repository files navigation

ROAGUE: Reconstruction of Ancestral Gene Blocks Using Events