Skip to content

Calculate the dissimilarity between sequence

License

Notifications You must be signed in to change notification settings

cxialab/SeqDistK-Pyhon

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

For Windows Users

For Windows Users, we recommend a Windows distribution with convenient graphical user interface:
https://github.com/htczero/SeqDistK. It is C++ based and much faster than this python distribution.

SeqDistK-Pyhon

Introduction

Phylogenetic tools are fundamental to studies of evolution and taxonomy. In this paper, we present SeqDistK, a novel tool for alignment-free phylogenetic analysis. SeqDistK batch computes the pairwise distance matrix between biological sequences, using seven popular k-mer based dissimilari-ty measures. Based on the matrix, SeqDistK constructs a phylogenetic tree using the Unweighted Pair Group Method with Arithmetic Mean algorithm. Using a golden-standard dataset of 16S rRNA sequences and the associated phylogenetic tree, we benchmarked the accuracy and efficiency of SeqDistK. We found the measure d2S (k=5, M=2) was the best, which correctly clustered and classified all sequences. Compared to multiple aligners such as Muscle, Clustalw2 and Mafft, SeqDistK was tens to hundreds of times faster, which helps eliminating the computation limit encountered by large-scale phylogenetic analysis.

Requirments

numpy 1.16
numba 0.43.1
tqdm
python 3.7

MiniConda is recommended. Using Anaconda3 is also ok.

Example

  1. Start the program
python main.py

Suppose you have N input sequences file in a directory and the directory path is '/home/seqs'

  1. Input the directory path
Input the directory path of sequences : /home/seq
  1. Input the k, the size of k-mer, you want to compute (k > 0). See the reference paper for how to choose a measure for details.

For a single k, input a integer(>0), such as 4

Input the k : 4

For a range of k, input kmin-kmax-step. For example(without quotation marks), '2-10-2', which specifies k = [2, 4, 6, 8, 10]"

Input the k : 2-10-2
  1. Choose the dissimilarity measure. See the reference paper for how to choose a measure for details.
0. Ma  
1. Ch  
2. Eu  
3. d2  
4. Hao  
5. d2S  
6. d2Star  
For example(without quotation marks), '1,2,3,4'  
Input the dissimilarities : 0,1,2,4,5
  1. If in the step 3, d2S or d2Star was chosen, one also needs to give M, the order of Markov background model. See the reference paper for how to choose M for details.

For a single M, input a interger(>=0)

Input the possibility order : 2

For a series of M, separation them with ','. For example(without quotation marks), '0, 1, 2, 3'

Input the possibility order : 0,1,2
  1. Input the path you want to save the results. For example, "/home/save"
Input the path you want to save : /home/save
  1. Confirm the parameters are correct before submit the computaiton.
Check the parameters : 'yes' or 'no'  
yes  # input yes and press enter if the parameters are correct.

Manuals

Structure of working directory

Single directory
|--dir
    |--seq_1.fasta
    |--seq_2.fasta
    |--seq_3.fasta
    |--seq_4.fasta
    :
    |--seq_n.fasta
For one directory with n sequences, you will obtain a csv file with n by n matrix for difference conditions.

Multiple directory
|--root
    |--dir_1
    |--dir_2
    |--dir_3
    |--dir_4
    :
    |--dir_n
        |--seq_1.fasta
        |--seq_2.fasta
        |--seq_3.fasta
        |--seq_4.fasta
        :
        |--seq_n.fasta

For each dir_x, it can be seen as a case of single directory.

FAQ

    Q1:  Can I pause if the program is running?
    A1:  No

    Q2:  What is the range of k?
    A2:  K should be no more than 15 (<=15).

    Q3:  What is the difference between Windows version and this?
    A3:  For windows version, it use C# and has UI. Further more, using multi-threading, Windows version is more faster than python version.

    Q4:  Can I use it in MacOS?
    A4:  Of course.

About

Calculate the dissimilarity between sequence

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%