Smith-Waterman Implementation Comparison

This repository contains material for comparing the performance of implementations of the Smith-Waterman algorithm, widely used as a step in genome assembly.

Installation

sudo apt install xsltproc gengetopt libtbb-dev

Compilation

cmake -GNinja -DCMAKE_MODULE_PATH=/usr/share/cmake/Modules -DCMAKE_BUILD_TYPE=Release ..

Selection Criteria

A "good" implementation of the Smith-Waterman algorithm for our purposes must possess the following properties.

Able to run on a GPU.
Suitable for reads of our lengths: ~10K-100K (long reads) and ~100 vs. ~1K-10K (short reads vs. contigs).
Utilizes the CPU as well (discuss this)
Separable. Not so deeply integrated with another codebase as to require excessive dependencies or so as to operate like a blackbox. X. TODO

Possible performance metrics: GCUPS, PPW (Performance per Watt), an analog of arithmetic intensity using GCUPS in place of FLOPS...

Candidate Implementations

Smith-Waterman Comparison Matrix

Key:

R = Paper rating. 0=very, very bad. 9=Quite good, actually.
?:? = Problem being solved. 1:1, 1:Many, Many:1, Many:Many

ID | R | Software Name Steinfadt2009 | | SWAMP Steinfadt2013 | | SWAMP ------------------|---|------ Farrar2007 | | Szalkowski2008 | | SWPS3 Rumble2009 | | SHRIMP David2011 | | SHRIMP2 Rognes2011 | | SWIPE Rucci2014 | | SWIMM Zhao2013 | | SSW Rucci2015 | | SWIMM Sjolund2016 | | DiagonalSW bowtie2 | | ------------------|---|------ Liu2006 | | ------------------|---|------ Munekawa2008 | 9 | Liu2009 | | Akoglu2009 | 9 | Ligowski2009 | | Striemer2009 | | GSW Ling2009 | | Liu2010 | T| CUDASW++ 2.0 Khajeh-Saeed2010 | | Hains2011 | 6 | Klus2012 | T| BarraCUDA Pankaj2012 | T| SWIFT Venkatachalam2012 | 9 | Dicker2014 | 6 | Okada2015 | 9T| SW# Huang2015 | 9 | nvbio_sw | | nvbio ugene | | ugene ------------------|---|------ Manavski2008 | 7 | SWCUDA Liu2013 | 9T| CUDASW++ 3.0 Luo2013 | | SOAP3 Marcos2014 | | Warris2018 | | pyPaSWAS | doi | Architecture | Compiles | ?:? |argetLength BP| CUPS | Architectural Notes | Files | Blanks | Comments | Code | Claims Faster Than | License | Source dir | Homepage | 10.1109/OCCBIO.2009.12 | ASC | | | | | | | | | | TODO | | | | Uncontacted | 10.1016/j.parco.2013.08.008 | ASC | | | | | | | | | | TODO | | | | Uncontacted ------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|---------------------------------------------------------------------------------- | 10.1093/bioinformatics/btl582 | CPU-SSE2 | | | | 3.0G | | | | | | TODO | TODO | -- | | 10.1186/1756-0500-1-107 | CPU-SSE2 | Yes | | | | | 23 | 581 | 1360 | 2696 | TODO | MIT | szalkowski2008 | https://lab.dessimoz.org/swps3/ | 10.1371/journal.pcbi.1000386 | CPU-SIMD | Yes | | | | | | | | | TODO | MIT? | shrimp | http://compbio.cs.toronto.edu/shrimp/ | 10.1093/bioinformatics/btr046 | CPU-SIMD | Yes | | | | | 108 | 4347 | 3854 | 24752 | TODO | MIT? | shrimp | http://compbio.cs.toronto.edu/shrimp/ | 10.1186/1471-2105-12-221 | CPU-SSSE3 | Fixable | | | | | 15 | 1889 | 808 | 9899 | Farrar2007 | AGPL-3.0 | rogness2011 | | Emailed ORNL help staff about getting MPIC++ on Titan. | 10.1109/CLUSTER.2014.6968784 | CPU-Xeon Phi | Error | | | | | 16 | 789 | 774 | 3542 | TODO | Unspecified | rucci2015 | | 10.1371/journal.pone.0082138 | CPU-SIMD | Yes | | | | | 11 | 380 | 694 | 2356 | TODO | MIT | zhao2013 | | 10.1002/cpe.3598 | CPU-Xeon Phi | Fixable | | | | | | | | | TODO | Unspecified | rucci2015 | | software-no-paper | CPU-SSE4/AltiVec | Yes | | | | | 19 | 321 | 72 | 1322 | TODO | MIT | sjolund2016 | http://diagonalsw.sourceforge.net/ | | | | | | | | | | | | | | bowtie2 | http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml ------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|-----------------------------------------------------------------------------------
| 10.1007/11758549_29 | GPU-OpenGL | | | | | | | | | | TODO | TODO | -- | ------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|-----------------------------------------------------------------------------------
| 10.1109/BIBE.2008.4696721 | GPU-CUDA | | 1:M |63-511v362 90M| 5.65G | | | | | | | | | | Emailed for source code on 2018-06-19. y-munekw address is dead. | 10.1186/1756-0500-2-73 | GPU-CUDA | | M:M?| | | | | | | | TODO | | | http://cudasw.sourceforge.net/homepage.htm#latest | CUDASW++2 and CUDASW++3 likely obviate the need to track down this code. | 10.1007/s10586-009-0089-8 | GPU-CUDA | Yes | 1:M | 64 v 1024 | | | 3 | 488 | 171 | 445 | TODO | | striemer2009 | | Code likely the same as striemer2009 | 10.1109/IPDPS.2009.5160931 | GPU-CUDA | | M1:1| | | | | | | | Manavski2008 | | | | Emailed for source code on 2018-06-19. Witold replied 2018-06-19. Sent further request back on 2018-06-19. | 10.1109/IPDPS.2009.5161066 | GPU-CUDA | Yes | 1:M | 64 v 1024 | | | 3 | 488 | 171 | 445 | TODO | Custom | striemer2009 | http://www2.engr.arizona.edu/~rcl/SmithWaterman.html | 10.1109/SASP.2009.5226343 | GPU-CUDA | | S1:1| | | | | | | | TODO | | | | | 10.1186/1756-0500-3-93 | GPU-CUDA | Yes | M:M | | | | 23 | 1821 | 2356 | 9174 | TODO | GPLv2 | liu2010 | http://cudasw.sourceforge.net/homepage.htm#latest | 10.1016/j.jcp.2010.02.009 | GPU-CUDA | | S1:1| | | | 28 | 776 | 553 | 3459 | TODO | Unknown | | | | GPU-CUDA | | -- | | | | | | | | | | | | 10.1186/1756-0500-5-27 | GPU-CUDA | Yes | M:1 |70 v 102M | Unlisted | Tesla M2050,M2090 | 54 | 1953 | 2772 | 12653 | TODO | MIT/GPLv3 | klus2012 | http://seqbarracuda.sourceforge.net/ | | GPU-CUDA | Yes | ??? | | | | 121 | 5087 | 9662 | 32724 | TODO | GPL-2.0 | pankaj2012 | | | GPU-CUDA | | -- | | | | | | | | | | | | | GPU-CUDA | | S1:1| | | GTX 460 | | | | | TODO | | | | | 10.1186/s12859-015-0744-4 | GPU-CUDA | Yes | M:M |5M v 5M | 66G (1) 202G (2)| ???? | 65 | 6537 | 3914 | 17665 | TODO | | okada2015 | http://www-hagi.ist.osaka-u.ac.jp/research/code/ | 10.1155/2015/185179 | GPU-CUDA | | 1:M | | | Tesla C1060, K20 | | | | | TODO | | | | TODO: Should contact | github.com/NVlabs/nvbio | GPU-CUDA | Yes | | | | | 712 | 31494 | 55870 | 144472 | TODO | BSD-3 | nvbio_sw | https://nvlabs.github.io/nvbio/ | | GPU-CUDA | Error | | | | | 6064 | 168501 | 220213 | 929208 | TODO | GPLv2 | ugene | http://ugene.net/download.html ------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|--------------------------------------------------------------------------
| 10.1186/1471-2105-9-S2-S10 | GPU-CUDA + CPU-SSE |RequiresQt| M1:1| | | | 68 | 3974 | 2861 | 8715 | TODO | TODO | manavski2008 | http://bioinformatics.cribi.unipd.it/cuda/swcuda.html | | 10.1186/1471-2105-14-117 | GPU-CUDA + CPU-SSE | Yes | M:M |5k v 35k: 190M| 119G (1) 186G(2)| GeForce GTX 680, 690| 21 | 642 | 568 | 4476 | TODO | GPLv2 | liu2013 | http://cudasw.sourceforge.net/homepage.htm#latest | 10.1371/journal.pone.0065632 | GPU-CUDA + CPU | Yes | 1:M?| | |TesC2070,M2050;GTX680| 215 | 14057 | 16852 | 74183 | TODO | GPLv2+ | luo2013 | http://www.cs.hku.hk/2bwt-tools/soap3-dp/ | | | GPU-CUDA + CPU | | ?? | | | | | | | | TODO | | | | | 10.1371/journal.pone.0190279 | GPU-CUDA + CPU + Python | | M:M | | | | 39 | 1120 | 1437 | 4766 | TODO | MIT | warris2018 |

TODO:

Liu2014           |   | GSWABE
Liu2014b          |   | CUSHAW2-GPU
Ren2019           |   |
Muller2019        |   | AnySeq

Ruled out:

Warris2015        |   | PaSWAS           | 10.1371/journal.pone.0122524  | GPU-CUDA                | Yes      | M:M |              |                  |                     | 19    | 1239   | 652      | 5128   | TODO               | MIT         | warris2015     |
Sandes2010        |   | MASA             | 10.1145/1693453.1693473       | GPU-CUDA                |          | S1:1|              |                  |                     |       |        |          |        | TODO               |             |                | https://github.com/edanssandes/MASA-Core/wiki         | There are *many* papers from this group.
Sandes2011        |   | MASA             | 10.1109/IPDPS.2011.114        | GPU-CUDA                |          | S1:1|              |                  |                     |       |        |          |        | TODO               |             |                | https://github.com/edanssandes/MASA-Core/wiki         | There are *many* papers from this group.
Sandes2013        |   | CUDAlign2.1      | 10.1109/TPDS.2012.194         | GPU-CUDA                | Yes (3.9)| S1:1| 162kBP-59MBP |                  |                     |       |        |          |        |                    | GPLv3       |                |                                                       | edans@cic.unb.br email is dead.
Sandes2014_hetero |   | MASA             | 10.1145/2555243.2555280       | GPU-CUDA                |          | S1:1|              |                  |                     |       |        |          |        |                    | GPLv3       |                |                                                       |
Sandes2014        |   | MASA-CUDAlign3.0 | 10.1109/CCGrid.2014.18        | GPU-CUDA                | Yes (3.9)| S1:1|       228MBP |                  |                     |       |        |          |        |                    | GPLv3       |                |                                                       |
Sandes2016_masa   |   | MASA             | 10.1145/2858656               | GPU-CUDA                |          | S1:1|              |                  |                     |       |        |          |        |                    | GPLv3       |                |                                                       |
Sandes2016        | 9 | MASA-CUDAlign4.0 | 10.1109/TPDS.2016.2515597     | GPU-CUDA                | NoSource | S1:1|       249MBP |  10.37T (384)    |                     |       |        |          |        |                    | GPLv3       |                |                                                       |

Sandes* only aligns two very long sequences.

Reviews:

Muhammadzadeh2014 |   | 
Pandey2015        | 1 | 10.9790/0661-17264852
Liu2013_review    |   | 10.5220/0004191202680271

Other methods:

Myers1986
Aluru2002:     parallel prefix computation
Rajko2004:     Improves on the techniques from Aluru2002
Boukerche2007: MPI-based method
Zhang2000:     Greedy algorithm

Background:

Gotoh1982
Hirschberg1975

Summary of Algorithmic Tricks/Improvements

Search space reduction
- Zhang2000: Greedy algorithm for sequences with low error rates
- Boukerche2007: Block pruning
- Sandes2013: block pruning
- Okada2015: Banded
- Okada2015: "interpair pruning"
Query profile (uses texture cache):
- Farrar2007: Variant-striped
- Manavski2008: Uses it in a standard way. Has a decent diagram.
- Akoglu2009: Criticizes Manavski2008 usage. Query profile too large for texture cache, leads to cache misses.
- Liu2010: (discusses sequential vs striped)
- Hains2011
- Rognes2011: Variant-sequential
- Venkatachalam2012: Query profile reduces random access to substitution matrix with sequential profile access
- Ling2009
- Striemer2009
Data layouts:
- Munekawa2008: Notes that local memory cannot be used in a coalesced manner, but that it is the fallback if there are too few registers available, so it is better to explicitly use GM than to implicitly allow LM to be used.
- Munekawa2008: Sort sequences by length
- Liu2009: Sort sequences by length
- Liu2009: Achieves coalesced memory access by arranging subject sequences so their elements are vertical in a matrix and the subjects are ordered from left to right in order of length
- Liu2009: Coalesced global memory access
- Liu2009: Divides matrix into cell blocks which reduces load/store counts. Not too well explained
- Munekawa2008: Stores (k-1) antidiagonal in shared memory (multiple threads access it) and (k-2) and current antidiagonal in registers (only accessed by a single thread)
- Munekawa2008: Stores query sequence in constant memory, since all threads refer to it
- Munekawa2008: Stores database seqeuences in texture memory, possibly only because they take a lot of memory. Not a clear rationale.
- Manavski2008: Pack char data into integers (4 per int) to make efficient use of local memory accesses.
- Akoglu2009: Puts both query sequence and substitution matrix in constant memory because: "reading from the constant cache is as fast as reading from a register if all threads read the same address, which is the case when reading values from the query sequence"
- Akoglu2009: Rearranges the substitution matrix for efficient access
- Liu2010: Packed data format to better leverage query profile
- Liu2013: Sorting the database and queries by length
- Huang2015: Interleaving sequences in memory for coalesced access
- Ligowski2009: Storing scores and backtracking data both in 4-byte integers
- Khajeh2010: Reformulates the antidiagonal as a row, allowing for coalesced memory access. Gaps are implemented using a parallel prefix scan.
- Ling2009: Improves over Munekawa and Manavski by separting computation of alignment matrix into multiple parts if number of threads and size of local memory are not sufficient, allocating resources to each submatrix in turn
Input-size dependent choice of algorithms:
- Hains2011: Switching between interthread and intrathread parallelism as sequence size changes
- Dicker2014: Parallel prefix versus diagonal wavefront
- Luo2013: If all sequences are within 1% of each other's lengths, sequences are allocated statically. Otherwise an atomic increment is used to reallocate sequences to processors as processing completes.
- Liu2009: Switches between interthread and intrathread parallelism
Speculation:
- Liu2010: Speculative calculation of H scores before F dependencies available (CUDASW++2.0)
- Farrar2007: For most cells in alignment matrix, F remains at zero and does not contribute to H. Only when H is greater than Ginit+Gext will F start to influence the value of H. So F is not considered initially. If required, a second step tries to correct the introduced errors. Manavski2008 claim their solution, which doesn't use this optimization, runs faster than Farrar2007.
- Ligowski2009: Only store score information and only as a single byte. Reprocess those sequences which were sufficiently high-scoring using a full algorithm.
Storage reduction:
- Munekawa2008: Stores only three anti-diagonals
- Munekawa2008: Packs sequences into vector data formatted in type char4. Four succeeding columns are assigned to each thread.
- Manavski2008: Pack bytes into integers; integer types had just become available
- Sandes2013: Using Myers-Miller for linear space
- Huang2015: Saving only the most recent rows/columns/diagonals rather than the whole dynamic programming matrix
Processing order:
- Guan1994: Divide-and-conquer for Myers-Miller
- Hains2011: Filling matrix in columns to increase utilization and decrease global memory accesses\
- Venkatachalam2012: Briefly mentions that assigning multiple rows per thread reduces synchronization costs
Fine-tuning block/thread counts:
- Sandes2013
Available as a library:
- Okada2015: Example code included
SIMD instructions
- Liu2013: Four adjacent subject sequences from pre-sorted list are assigned to a single thread, each vector lane corresponds to a sequence. Two-dimensional sequence profile is created.
- Venkatachalam2012: Short vectors can be used to read and manipulate four values at once, rather than using one thread per cell
Use of CPU and GPU:
- Liu2013
- Luo2013
- Warris2018
- Marcos2014
Use of local memory:
- Luo2013: 512kB per-thread local memory is used to store one row for matrices H and E.
Multi-GPU:
- Sandes2014: Splits data into short-phase and long-phase to minimize time spent waiting by downstream GPUs for communication from upstream
Calculation time prediction equation:
- Sandes2014:
Pipelining:
- Venkatachalam2012: Data can be loaded to GPU while other alignments are happening
Tricks:
- Using the modulus operator is extremely inefficient on CUDA
Recompile GPU code on the fly:
- Warris2018
Use of BWT:
- Klus2012:
IGNORED
- Liu2006: Because it is in OpenGL so the techniques are no longer really relevant
- Pankaj2012: Only have a power point.
- Sandes2014: Such long sequences
Misc:
- Sandes2013: Myers-Miller used to find midpoint of LCS
Parallel (prefix?) scan
Tiling
Blazewicz boolean matrices
Block pruning
Burrow-Wheeler Transformer? (Klus2012)

Summaries of papers and implementation notes

bowtie2

module load tbb
export CPATH=/lustre/atlas/sw/tbb/43/sles11.3_gnu4.8.2/source/include
make -j 8

Szalkowski2008 SWPS3 � fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2

mkdir build
cmake ..
make

Striemer2009

module load cudatoolkit
qsub -I -A CSC261 -l nodes=1,walltime=30:00
nvcc   -I. -Iinc *cu *cpp inc/*cpp -L${CRAY_LD_LIBRARY_PATH}  -lcudart

SmithWaterman_kernel.cu needs to be edited to hold a query sequence

Manavski2008

#Acquire CUDA 6.5

    wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

#Install it to (you may need to `mkdir -p` this directory)

    $HOME/os/cuda-6.5/

#Try compiling:

    module unload pgi
    module remove cudatoolkit
    module load cmake
    module load gcc/4.8.2
    export PATH="$HOME/os/cuda-6.5/bin:$PATH"
    export LIBRARY_PATH="$HOME/os/cuda-6.5/lib64"
    ./comp_cu.sh

Liu2006 GPU Accelerated Smith-Waterman

Farrar2007 Striped Smith�Waterman speeds database searches six times over other SIMD implementations

Manavski2008 CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

Rumble2009

make -f Makefile

David2011

make -f Makefile

Liu2010

Compilation succeeded with

module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
make

Rognes2011 Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation

Klus2012

Compilation succeeded with

module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
#Minor makefile adjustment to NVCC path
make

Set MAX_SEQUENCE_LENGTH in barracuda.h

Requires preprocessing the reference sequence using a Burrow-Wheeler transform.

Pankaj2012 Swift: A GPU-based Smith-Waterman Sequence Alignment Program

Video: http://on-demand.gputechconf.com/gtc/2012/video/S0083-Swift-GPU-Based-Smith-Waterman-Sequence-Alignment-Program.flv

Compilation successful.

module load cudatoolkit/7.0.28-1.0502.10280.4.1
make

All query sequences must be the same length, but they can be padded with N.

Rucci2014

make

Sandes2014

Code compiles on Titan using the following per the build.titan script in implementations/masa/masa-cudalign/.

Only aligns two very long sequences.

Sandes2016

Code for 4.0 doesn't seem to be available. TODO: email authors.

Only aligns two very long sequences.

Warris2015

PaSWAS, from Warris2015, needs to be compiled from source with the parameters of the input sequences. If the sequences are of different lengths, it would need to be compiled with the length of the longest one. Since the Antarctic data contains a sequence of length 5,279 this means that only a single sequence can fit on the GPU at a time.

Compiled with modifications to Makefile and inclusion of CUDA-deprecated header files.

cd PaSWAS/onGPU
module load cudatoolkit

Summit required minor modifications to the makefiles to point to the correct library paths, also:

module load cuda/9.0.69
module load gcc/6.4.0

Luo2013

Compilation succeeded with

module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
#Several fixes to the code and makefile
make

Requires preprocessing the reference sequence with a Burrows-Wheeler transform

Liu2013

For 50 sequences, it runs. For 100 sequences it fails, and keeps failing (presumably), until I load 2,698 sequences, and then everything's fine again.

module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
module load gcc/6.3.0
module unload pgi

Zhao2013

Compilation succeeded. Straight-forward.

make

Okada2015: Titan

Seems to just work.

Compilation successful. Minor alterations of makefile required.

module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
module unload pgi
module load gcc/6.3.0
make

Available as a library.

Have to use the protein alignment thing to get a M:M, otherwise it is single 1:1.

Sjolund2016

wget ftp://ftp.gnu.org/gnu/gengetopt/gengetopt-2.22.tar.gz
tar xvzf gengetopt-2.22.tar.gz
cd gengetopt-2.22/
./configure --prefix=$HOME/os
#Add `#include <string.h>` to the top of `src/fileutils.cpp`
make -j 10
make install
export PATH="$HOME/os/bin:$PATH"

module load tbb
echo $TBB_COMPILE_FLAGS #Get path to TBB
export LIBRARY_PATH="/lustre/atlas/sw/tbb/43/sles11.3_gnu4.8.2/source/build/linux_intel64_gcc_cc4.8.2_libc2.11.3_kernel3.0.101_release/:$LIBRARY_PATH"

mkdir build
cmake ..
make -j 10

#Executable is in: build/src/c

Warris2018

pip3 install pycuda --user
pip3 install BioPython=1.71 --user
pip3 install numpy=1.14.3
pip3 install pyopencl= --user
#Anaconda Python 3.5.5
#Cuda 9.0.69

Build process seems to require Spack. Might be easier to use Docker. That is, this is likely to be forever a troublesome dependency.

nvbio

Fork says to use flag -DGPU_ARCHITECTURE=sm_XX with cmake. (Link)

nvbio repo says that support is for GCC 4.8 with CUDA 6.5 (Link).

An alternative repo at https://github.com/ngstools/nvbio doesn't exist any more.

#Acquire CUDA 6.5

    wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

#Install it to (you may need to `mkdir -p` this directory)

    $HOME/os/cuda-6.5/

#Try compiling:

    module unload pgi
    module remove cudatoolkit
    module load cmake
    module load gcc/4.8.2
    export PATH="$HOME/os/cuda-6.5/bin:$PATH"
    export LIBRARY_PATH="$HOME/os/cuda-6.5/lib64"
    mkdir build
    cd build
    CXX=g++ CC=gcc cmake .. -DGPU_ARCHITECTURE=sm_35 -DCMAKE_INSTALL_PREFIX:PATH=$HOME/os
    make -j 10
    cd ..
    mkdir debug
    cd debug
    CXX=g++ CC=gcc cmake .. -DGPU_ARCHITECTURE=sm_35 -DCMAKE_INSTALL_PREFIX:PATH=$HOME/os -DCMAKE_BUILD_TYPE=Debug
    make -j 10

ugene

Seems to require Ubuntu or Fedora. Complicated build process, but a cool idea (generating a deb package on the fly).

Sites examined

All material on these sites has been examined and linked references downloaded.

Recordings of talks:

Test Data

Bulk Download

All the test files can be acquired quickly using the following commands:

wget https://svwh.dl.sourceforge.net/project/cudasw/data/simdb.fasta.gz -P data/
wget https://iweb.dl.sourceforge.net/project/cudasw/data/Queries.zip    -P data/
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz -P data/
cat data/pacbio_human54x_files | xargs -n 1 -P 4 wget --continue -P data/

Downloads

The test data comes from the following sources:

http://cudasw.sourceforge.net/homepage.htm#installation : CUDASW++ search for "Example query sequences". Download the files simdb.fasta.gz and Queries.zip.
http://sourceforge.net/projects/cudasw/files/data : Same as above, but a more direct link.
The PacBIO Human54x files are drawn from here and linked to from here.
UniProt's Swiss-Prot database of proteins here.

Using the Test Data

CUDASW++ (2.0)

An example of running CUDASW++ (2.0) with an arbitrary query from Queries/ against the simdb.fasta database with all the default parameter values:

./cudasw -query Queries/P01008.fasta -db simdb.fasta

The example assumes CUDASW++ (2.0) is compiled as the executable "cudasw", cudasw is in $PATH, and Queries/ and simdb.fasta are in the current working directory (simply provide the absolute path if not).

See here for additional instructions and options for CUDASW++.

Generating synthetic data

Might be possible to use: https://github.com/seqan/seqan/tree/master/apps/mason2

Misc

Omitted repos:

https://github.com/vgteam/gssw conflicts with Zhao2013 and is a generalization, so probably not needed

Reading sequence data:

https://bitbucket.org/aydozz/longreads/src/master/kmercode/fq_reader.c

Running on Titan

To run on Titan, you'll need to first compile your code. The following, for example, shows how to compile Striemer2009.

module load cudatoolkit
nvcc   -I. -Iinc *cu *cpp inc/*cpp -L${CRAY_LD_LIBRARY_PATH}  -lcudart

You'll then need to either make a batch script or start an interactive batch job:

qsub -I -X -A CSC261 -q debug -l nodes=1,walltime=30:00

The only way to access compute nodes if via the aprun command. But this command can only be run from somewhere on the lustre file system. Get there using (for example):

cd $MEMBERWORK/csc261
cd /lustre/atlas/scratch/spinyfan/csc261/

Finally, use aprun to run the program:

aprun ~/crd-swgpu/implementations/striemer2009/SmithWaterman/a.out

Name		Name	Last commit message	Last commit date
Latest commit History 3,085 Commits
data		data
implementations		implementations
refs		refs
submodules		submodules
tutorials		tutorials
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
test.py		test.py

r-barnes/sw_comparison

Folders and files

Latest commit

History

Repository files navigation

Smith-Waterman Implementation Comparison

Installation

Selection Criteria

Candidate Implementations

Smith-Waterman Comparison Matrix

Summary of Algorithmic Tricks/Improvements

Summaries of papers and implementation notes

bowtie2

Szalkowski2008 SWPS3 � fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2

Striemer2009

Manavski2008

Liu2006 GPU Accelerated Smith-Waterman

Farrar2007 Striped Smith�Waterman speeds database searches six times over other SIMD implementations

Manavski2008 CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

Rumble2009

David2011

Liu2010

Rognes2011 Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation

Klus2012

Pankaj2012 Swift: A GPU-based Smith-Waterman Sequence Alignment Program

Rucci2014

Sandes2014

Sandes2016

Warris2015

Luo2013

Liu2013

Zhao2013

Okada2015: Titan

Sjolund2016

Warris2018

nvbio

ugene

Sites examined

Test Data

Bulk Download

Downloads

Using the Test Data

CUDASW++ (2.0)

Generating synthetic data

Misc

Running on Titan

About

Resources

Stars

Watchers

Forks

Languages