SwiftOrtho is orthology analysis tool which identifies orthologs, paralogs and co-orthologs for genomes. It is a graph-based approach.
Make sure that you have the following installed
$ git clone https://github.com/Rinoahu/SwiftOrtho
1. All-to-all homologous search:
$python SwiftOrtho/bin/find_hit.py -i input.fsa -d input.fsa -o input.fsa.sc -e 1e-5 -s 111111
-i|-d: protein sequences in fasta format. The identifier of each protein sequence in intput.fsa should be like this: >xxx|yyyy where xxx is the taxon code and yyyy is a sequence identifier. For example:
>A|a1
MENIHDLWERALAEMEKKVSKPSYETWLKSTKANDIQNDVITITAPNEFARDWLEEHYAG
LTSDTIEHLTGARLTPRFVIPQNELEDDFLIEPPKKKKPVSDNGSQNNGTKTMLNDKYTF
...
>A|b2
MQFTIQRDRFVHDVQNVAKAVSSRTTIPILTGIKIVADHEGVTLTGSDSDVSIETFIPKE
...
-o: output file. A tabular text file which contains 14 columns. The first 12 columns are the same as blastp -m8 format, the last 2 columns are the lengths of the query and target sequences. For example:
A|a1 A|a1 100.00 450 0 0 1 450 1 450 2.88e-261 897 450 450
A|a1 B|b2 53.52 340 158 0 111 450 240 579 1.60e-105 380 450 583
...
-e: expect value.
-s: space seed pattern.
2. Orthology inference:
$python SwiftOrtho/bin/find_orth.py -i input.fsa.sc -c 0.5 -y 0 > input.fsa.sc.orth
-i: input file. It is the output file of step 1 or blast -m8.
-c: threshold of alignment coverage of pairwise sequences.
-y: threshold of alignment identity of pairwise sequences.
>: output file. It is a tabular text file that contains 4 columns in the following format:
OT A|a1 B|b1 1.33510402833
IP A|a1 A|a2 1.23374340949
CO A|a2 B|b2 1.41459539212
...
Col 1: orthology relationship, one of OT(ortholog), CO(co-ortholog) and IP(in-paralog).
Col 2: identifier of gene in species A.
Col 3: identifier of gene in species B.
Col 4: weight of orthology relationship.
3. Cluster orthology relationships into groups:
SwiftOrtho implements two cluster algorithms including Markov Clustering and Affinity Aropagation in python. Here is an example of how to use it.
$python SwiftOrtho/bin/find_cluster.py -i input.fsa.sc.orth -a mcl -I 1.5 > input.fsa.sc.orth.mcl
-i: input file. Input file contains all the orthology relationships. For example:
OT A|a1 B|b1 1.33510402833
IP A|a1 A|a2 1.23374340949
CO A|a2 B|b2 1.41459539212
...
-a: algorithm to cluster. [mcl|apc]
-I: inflation parameter only for mcl.
>: output file. This file contains several rows. Each row stands for an orthologous group. In each row, there are gene identifiers of same or different species. For example:
A|a1 A|a2 B|b1 C|c1 D|d1
A|a3 B|b3 C|c3
A|a4 B|b4
A|a5 A|a6
...
If users prefer MCL, they can follow the steps below:
$cut -f2-4 input.fsa.sc.orth > input.fsa.sc.orth.xyz
$mcl input.fsa.sc.orth.xyz --abc -I 1.5 -o input.fsa.sc.orth.mcl -te 4
Run_all.py in directory scripts automatically implements the following steps:
- all-to-all homologous search.
- orthology inference.
- cluster orthology relationships into orthologous groups.
- perform a pan-genome analysis and estimate main features of pan-genome such as gene numbers of core|shared|specific, core size of pan-genome, openess...
- use conseved proteins to construct species phylogenetic tree.
- perform operonic clustering if the operonic information supplied.[optional]
Requirement:
- FastTree 2
- One of the following multiple sequence alignment tools:
- trimAl
Usage:
$python run_all.py -i test.fsa -p test.fsa.operon -a 4
-i: input file. protein sequences in fasta format. The identifier of each protein sequence in intput.fsa should be like this: >xxx|yyyy where xxx is the taxon code and yyyy is a sequence identifier.
-p: operonic annotation file. The 1st column of this file should be like x0-->x1-->x2-->x3 or x0<--x1<--x2<--x3. x# stand for gene identifiers and <-- or --> stands for strand of gene. For example:
A|a0-->A|a1 unknown-->COG1607 unknown::unknown-->I::Acyl-CoA hydrolase::Lipid transport and metabolism
B|b0<--B|b1 COG4644<--COG1961 X::Transposase and inactivated derivatives, TnpA family::Mobilome: prophages, transposons<--L::Site-specific DNA recombinase related to the DNA invertase Pin::Replication, recombination and repair
-a: number of multi-thread.
Results: Severl files are generated:
-
test.fsa.sc
result of all-vs-all homologous search. see example above in Usage section.
-
test.fsa.aln.trim
concatenation of trimmed aligned protein sequences of conserved genes.
-
test.fsa.nwk
species phylogenetic tree in newick format. the tree is constructed from the aligned protein sequences of the conserved gene.
-
test.fsa.opc
orthology relationships. see an example above in Usage section.
-
test.fsa.mcl
orthologous groups. see an example above in Usage section
-
test.fsa.operon.mcl
grouped operons which reflect conservation of operons across multiple species. This file contains several rows. Each row stands for an conserved operonic group. In each row, there are operonic information from same or different species. For example:
A1-->A2-->A3 B1<--B2<--B3 C1<--C2<--C3 A4<--A5<--A6 B4<--B5<--B6 C4<--C5<--C6 A7-->A8 B7<--B8<--B9 ....
-
test.fsa.pan
main features of pan-genome. For example:
# Statistics and profile of pan-genome: # The methods can be found in Hu X, et al. Trajectory and genomic determinants of fungal-pathogen speciation and host adaptation. # # statistic of core, shared and specific genes: # Feature core shared specific taxon # Number 27 2117 9766 5 # # ω(core size of pan-genome) and 95% confidence interval: # κc τc ω # 18001.747907101293±97986.86937584748 0.4604747552601067±0.5879003578202601 29.071595667457963±45.51565446328978 # # θ(new gene number for everay new genome sequenced) and 95% confidence interval: # κs τs tg(θ) # 1334.0072284367752±2342.5492209911768 2.2743910535524314±9.701652708550565 1952.605831944348±1311.6323603805986 # # κ(size and openess of pan-genome, open if γ > 0) and 95% confidence interval: # κ γ # 2899.5570049130965±179.58438208536737 0.8785342365438822±0.04423040927927408 # # Type and frequency of each gene group in different species: ################################################################################ #family type GCF_000005825.2_ASM582v2 GCF_000006645.1_ASM664v1 GCF_000006605.1_ASM660v1 GCF_000005845.2_ASM584v2 GC F_000006625.1_ASM662v1 group_000000000 Share 0 1 0 1 0 group_000000001 Specific 2 0 0 0 0 group_000000002 Specific 2 0 0 0 0 group_000000003 Specific 2 0 0 0 0 group_000000004 Specific 3 0 0 0 0 ...