This repository contains material for comparing the performance of implementations of the Smith-Waterman algorithm, widely used as a step in genome assembly.
sudo apt install xsltproc gengetopt libtbb-dev
Compilation
cmake -GNinja -DCMAKE_MODULE_PATH=/usr/share/cmake/Modules -DCMAKE_BUILD_TYPE=Release ..
A "good" implementation of the Smith-Waterman algorithm for our purposes must possess the following properties.
- Able to run on a GPU.
- Suitable for reads of our lengths: ~10K-100K (long reads) and ~100 vs. ~1K-10K (short reads vs. contigs).
- Utilizes the CPU as well (discuss this)
- Separable. Not so deeply integrated with another codebase as to require excessive dependencies or so as to operate like a blackbox. X. TODO
Possible performance metrics: GCUPS, PPW (Performance per Watt), an analog of arithmetic intensity using GCUPS in place of FLOPS...
Key:
-
R = Paper rating. 0=very, very bad. 9=Quite good, actually.
-
?:? = Problem being solved. 1:1, 1:Many, Many:1, Many:Many
ID | R | Software Name | doi | Architecture | Compiles | ?:? |argetLength BP| CUPS | Architectural Notes | Files | Blanks | Comments | Code | Claims Faster Than | License | Source dir | Homepage Steinfadt2009 | | SWAMP | 10.1109/OCCBIO.2009.12 | ASC | | | | | | | | | | TODO | | | | Uncontacted Steinfadt2013 | | SWAMP | 10.1016/j.parco.2013.08.008 | ASC | | | | | | | | | | TODO | | | | Uncontacted ------------------|---|------------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|---------------------------------------------------------------------------------- Farrar2007 | | | 10.1093/bioinformatics/btl582 | CPU-SSE2 | | | | 3.0G | | | | | | TODO | TODO | -- | Szalkowski2008 | | SWPS3 | 10.1186/1756-0500-1-107 | CPU-SSE2 | Yes | | | | | 23 | 581 | 1360 | 2696 | TODO | MIT | szalkowski2008 | https://lab.dessimoz.org/swps3/ Rumble2009 | | SHRIMP | 10.1371/journal.pcbi.1000386 | CPU-SIMD | Yes | | | | | | | | | TODO | MIT? | shrimp | http://compbio.cs.toronto.edu/shrimp/ David2011 | | SHRIMP2 | 10.1093/bioinformatics/btr046 | CPU-SIMD | Yes | | | | | 108 | 4347 | 3854 | 24752 | TODO | MIT? | shrimp | http://compbio.cs.toronto.edu/shrimp/ Rognes2011 | | SWIPE | 10.1186/1471-2105-12-221 | CPU-SSSE3 | Fixable | | | | | 15 | 1889 | 808 | 9899 | Farrar2007 | AGPL-3.0 | rogness2011 | | Emailed ORNL help staff about getting MPIC++ on Titan. Rucci2014 | | SWIMM | 10.1109/CLUSTER.2014.6968784 | CPU-Xeon Phi | Error | | | | | 16 | 789 | 774 | 3542 | TODO | Unspecified | rucci2015 | Zhao2013 | | SSW | 10.1371/journal.pone.0082138 | CPU-SIMD | Yes | | | | | 11 | 380 | 694 | 2356 | TODO | MIT | zhao2013 | Rucci2015 | | SWIMM | 10.1002/cpe.3598 | CPU-Xeon Phi | Fixable | | | | | | | | | TODO | Unspecified | rucci2015 | Sjolund2016 | | DiagonalSW | software-no-paper | CPU-SSE4/AltiVec | Yes | | | | | 19 | 321 | 72 | 1322 | TODO | MIT | sjolund2016 | http://diagonalsw.sourceforge.net/ bowtie2 | | | | | | | | | | | | | | | | bowtie2 | http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml ------------------|---|------------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|-----------------------------------------------------------------------------------
Liu2006 | | | 10.1007/11758549_29 | GPU-OpenGL | | | | | | | | | | TODO | TODO | -- | ------------------|---|------------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|-----------------------------------------------------------------------------------
Munekawa2008 | 9 | | 10.1109/BIBE.2008.4696721 | GPU-CUDA | | 1:M |63-511v362 90M| 5.65G | | | | | | | | | | Emailed for source code on 2018-06-19. y-munekw address is dead. Liu2009 | | | 10.1186/1756-0500-2-73 | GPU-CUDA | | M:M?| | | | | | | | TODO | | | http://cudasw.sourceforge.net/homepage.htm#latest | CUDASW++2 and CUDASW++3 likely obviate the need to track down this code. Akoglu2009 | 9 | | 10.1007/s10586-009-0089-8 | GPU-CUDA | Yes | 1:M | 64 v 1024 | | | 3 | 488 | 171 | 445 | TODO | | striemer2009 | | Code likely the same as striemer2009 Ligowski2009 | | | 10.1109/IPDPS.2009.5160931 | GPU-CUDA | | M1:1| | | | | | | | Manavski2008 | | | | Emailed for source code on 2018-06-19. Witold replied 2018-06-19. Sent further request back on 2018-06-19. Striemer2009 | | GSW | 10.1109/IPDPS.2009.5161066 | GPU-CUDA | Yes | 1:M | 64 v 1024 | | | 3 | 488 | 171 | 445 | TODO | Custom | striemer2009 | http://www2.engr.arizona.edu/~rcl/SmithWaterman.html Ling2009 | | | 10.1109/SASP.2009.5226343 | GPU-CUDA | | S1:1| | | | | | | | TODO | | | | Liu2010 | T| CUDASW++ 2.0 | 10.1186/1756-0500-3-93 | GPU-CUDA | Yes | M:M | | | | 23 | 1821 | 2356 | 9174 | TODO | GPLv2 | liu2010 | http://cudasw.sourceforge.net/homepage.htm#latest Khajeh-Saeed2010 | | | 10.1016/j.jcp.2010.02.009 | GPU-CUDA | | S1:1| | | | 28 | 776 | 553 | 3459 | TODO | Unknown | | Hains2011 | 6 | | | GPU-CUDA | | -- | | | | | | | | | | | Klus2012 | T| BarraCUDA | 10.1186/1756-0500-5-27 | GPU-CUDA | Yes | M:1 |70 v 102M | Unlisted | Tesla M2050,M2090 | 54 | 1953 | 2772 | 12653 | TODO | MIT/GPLv3 | klus2012 | http://seqbarracuda.sourceforge.net/ Pankaj2012 | T| SWIFT | | GPU-CUDA | Yes | ??? | | | | 121 | 5087 | 9662 | 32724 | TODO | GPL-2.0 | pankaj2012 | Venkatachalam2012 | 9 | | | GPU-CUDA | | -- | | | | | | | | | | | Dicker2014 | 6 | | | GPU-CUDA | | S1:1| | | GTX 460 | | | | | TODO | | | | Okada2015 | 9T| SW# | 10.1186/s12859-015-0744-4 | GPU-CUDA | Yes | M:M |5M v 5M | 66G (1) 202G (2)| ???? | 65 | 6537 | 3914 | 17665 | TODO | | okada2015 | http://www-hagi.ist.osaka-u.ac.jp/research/code/ Huang2015 | 9 | | 10.1155/2015/185179 | GPU-CUDA | | 1:M | | | Tesla C1060, K20 | | | | | TODO | | | | TODO: Should contact nvbio_sw | | nvbio | github.com/NVlabs/nvbio | GPU-CUDA | Yes | | | | | 712 | 31494 | 55870 | 144472 | TODO | BSD-3 | nvbio_sw | https://nvlabs.github.io/nvbio/ ugene | | ugene | | GPU-CUDA | Error | | | | | 6064 | 168501 | 220213 | 929208 | TODO | GPLv2 | ugene | http://ugene.net/download.html ------------------|---|------------------|-------------------------------|-------------------------|----------|-----|--------------|------------------|---------------------|-------|--------|----------|--------|--------------------|-------------|----------------|--------------------------------------------------------------------------
Manavski2008 | 7 | SWCUDA | 10.1186/1471-2105-9-S2-S10 | GPU-CUDA + CPU-SSE |RequiresQt| M1:1| | | | 68 | 3974 | 2861 | 8715 | TODO | TODO | manavski2008 | http://bioinformatics.cribi.unipd.it/cuda/swcuda.html | Liu2013 | 9T| CUDASW++ 3.0 | 10.1186/1471-2105-14-117 | GPU-CUDA + CPU-SSE | Yes | M:M |5k v 35k: 190M| 119G (1) 186G(2)| GeForce GTX 680, 690| 21 | 642 | 568 | 4476 | TODO | GPLv2 | liu2013 | http://cudasw.sourceforge.net/homepage.htm#latest Luo2013 | | SOAP3 | 10.1371/journal.pone.0065632 | GPU-CUDA + CPU | Yes | 1:M?| | |TesC2070,M2050;GTX680| 215 | 14057 | 16852 | 74183 | TODO | GPLv2+ | luo2013 | http://www.cs.hku.hk/2bwt-tools/soap3-dp/ | Marcos2014 | | | | GPU-CUDA + CPU | | ?? | | | | | | | | TODO | | | | Warris2018 | | pyPaSWAS | 10.1371/journal.pone.0190279 | GPU-CUDA + CPU + Python | | M:M | | | | 39 | 1120 | 1437 | 4766 | TODO | MIT | warris2018 |
TODO:
Liu2014 | | GSWABE
Liu2014b | | CUSHAW2-GPU
Ren2019 | |
Muller2019 | | AnySeq
Ruled out:
Warris2015 | | PaSWAS | 10.1371/journal.pone.0122524 | GPU-CUDA | Yes | M:M | | | | 19 | 1239 | 652 | 5128 | TODO | MIT | warris2015 |
Sandes2010 | | MASA | 10.1145/1693453.1693473 | GPU-CUDA | | S1:1| | | | | | | | TODO | | | https://github.com/edanssandes/MASA-Core/wiki | There are *many* papers from this group.
Sandes2011 | | MASA | 10.1109/IPDPS.2011.114 | GPU-CUDA | | S1:1| | | | | | | | TODO | | | https://github.com/edanssandes/MASA-Core/wiki | There are *many* papers from this group.
Sandes2013 | | CUDAlign2.1 | 10.1109/TPDS.2012.194 | GPU-CUDA | Yes (3.9)| S1:1| 162kBP-59MBP | | | | | | | | GPLv3 | | | edans@cic.unb.br email is dead.
Sandes2014_hetero | | MASA | 10.1145/2555243.2555280 | GPU-CUDA | | S1:1| | | | | | | | | GPLv3 | | |
Sandes2014 | | MASA-CUDAlign3.0 | 10.1109/CCGrid.2014.18 | GPU-CUDA | Yes (3.9)| S1:1| 228MBP | | | | | | | | GPLv3 | | |
Sandes2016_masa | | MASA | 10.1145/2858656 | GPU-CUDA | | S1:1| | | | | | | | | GPLv3 | | |
Sandes2016 | 9 | MASA-CUDAlign4.0 | 10.1109/TPDS.2016.2515597 | GPU-CUDA | NoSource | S1:1| 249MBP | 10.37T (384) | | | | | | | GPLv3 | | |
Sandes*
only aligns two very long sequences.
Reviews:
Muhammadzadeh2014 | |
Pandey2015 | 1 | 10.9790/0661-17264852
Liu2013_review | | 10.5220/0004191202680271
Other methods:
Myers1986
Aluru2002: parallel prefix computation
Rajko2004: Improves on the techniques from Aluru2002
Boukerche2007: MPI-based method
Zhang2000: Greedy algorithm
Background:
Gotoh1982
Hirschberg1975
-
Search space reduction
- Zhang2000: Greedy algorithm for sequences with low error rates
- Boukerche2007: Block pruning
- Sandes2013: block pruning
- Okada2015: Banded
- Okada2015: "interpair pruning"
-
Query profile (uses texture cache):
- Farrar2007: Variant-striped
- Manavski2008: Uses it in a standard way. Has a decent diagram.
- Akoglu2009: Criticizes Manavski2008 usage. Query profile too large for texture cache, leads to cache misses.
- Liu2010: (discusses sequential vs striped)
- Hains2011
- Rognes2011: Variant-sequential
- Venkatachalam2012: Query profile reduces random access to substitution matrix with sequential profile access
- Ling2009
- Striemer2009
-
Data layouts:
- Munekawa2008: Notes that local memory cannot be used in a coalesced manner, but that it is the fallback if there are too few registers available, so it is better to explicitly use GM than to implicitly allow LM to be used.
- Munekawa2008: Sort sequences by length
- Liu2009: Sort sequences by length
- Liu2009: Achieves coalesced memory access by arranging subject sequences so their elements are vertical in a matrix and the subjects are ordered from left to right in order of length
- Liu2009: Coalesced global memory access
- Liu2009: Divides matrix into cell blocks which reduces load/store counts. Not too well explained
- Munekawa2008: Stores (k-1) antidiagonal in shared memory (multiple threads access it) and (k-2) and current antidiagonal in registers (only accessed by a single thread)
- Munekawa2008: Stores query sequence in constant memory, since all threads refer to it
- Munekawa2008: Stores database seqeuences in texture memory, possibly only because they take a lot of memory. Not a clear rationale.
- Manavski2008: Pack char data into integers (4 per int) to make efficient use of local memory accesses.
- Akoglu2009: Puts both query sequence and substitution matrix in constant memory because: "reading from the constant cache is as fast as reading from a register if all threads read the same address, which is the case when reading values from the query sequence"
- Akoglu2009: Rearranges the substitution matrix for efficient access
- Liu2010: Packed data format to better leverage query profile
- Liu2013: Sorting the database and queries by length
- Huang2015: Interleaving sequences in memory for coalesced access
- Ligowski2009: Storing scores and backtracking data both in 4-byte integers
- Khajeh2010: Reformulates the antidiagonal as a row, allowing for coalesced memory access. Gaps are implemented using a parallel prefix scan.
- Ling2009: Improves over Munekawa and Manavski by separting computation of alignment matrix into multiple parts if number of threads and size of local memory are not sufficient, allocating resources to each submatrix in turn
-
Input-size dependent choice of algorithms:
- Hains2011: Switching between interthread and intrathread parallelism as sequence size changes
- Dicker2014: Parallel prefix versus diagonal wavefront
- Luo2013: If all sequences are within 1% of each other's lengths, sequences are allocated statically. Otherwise an atomic increment is used to reallocate sequences to processors as processing completes.
- Liu2009: Switches between interthread and intrathread parallelism
-
Speculation:
- Liu2010: Speculative calculation of H scores before F dependencies available (CUDASW++2.0)
- Farrar2007: For most cells in alignment matrix, F remains at zero and does not contribute to H. Only when H is greater than Ginit+Gext will F start to influence the value of H. So F is not considered initially. If required, a second step tries to correct the introduced errors. Manavski2008 claim their solution, which doesn't use this optimization, runs faster than Farrar2007.
- Ligowski2009: Only store score information and only as a single byte. Reprocess those sequences which were sufficiently high-scoring using a full algorithm.
-
Storage reduction:
- Munekawa2008: Stores only three anti-diagonals
- Munekawa2008: Packs sequences into vector data formatted in type char4. Four succeeding columns are assigned to each thread.
- Manavski2008: Pack bytes into integers; integer types had just become available
- Sandes2013: Using Myers-Miller for linear space
- Huang2015: Saving only the most recent rows/columns/diagonals rather than the whole dynamic programming matrix
-
Processing order:
- Guan1994: Divide-and-conquer for Myers-Miller
- Hains2011: Filling matrix in columns to increase utilization and decrease global memory accesses\
- Venkatachalam2012: Briefly mentions that assigning multiple rows per thread reduces synchronization costs
-
Fine-tuning block/thread counts:
- Sandes2013
-
Available as a library:
- Okada2015: Example code included
-
SIMD instructions
- Liu2013: Four adjacent subject sequences from pre-sorted list are assigned to a single thread, each vector lane corresponds to a sequence. Two-dimensional sequence profile is created.
- Venkatachalam2012: Short vectors can be used to read and manipulate four values at once, rather than using one thread per cell
-
Use of CPU and GPU:
- Liu2013
- Luo2013
- Warris2018
- Marcos2014
-
Use of local memory:
- Luo2013: 512kB per-thread local memory is used to store one row for matrices H and E.
-
Multi-GPU:
- Sandes2014: Splits data into short-phase and long-phase to minimize time spent waiting by downstream GPUs for communication from upstream
-
Calculation time prediction equation:
- Sandes2014:
-
Pipelining:
- Venkatachalam2012: Data can be loaded to GPU while other alignments are happening
-
Tricks:
- Using the modulus operator is extremely inefficient on CUDA
-
Recompile GPU code on the fly:
- Warris2018
-
Use of BWT:
- Klus2012:
-
IGNORED
- Liu2006: Because it is in OpenGL so the techniques are no longer really relevant
- Pankaj2012: Only have a power point.
- Sandes2014: Such long sequences
-
Misc:
- Sandes2013: Myers-Miller used to find midpoint of LCS
-
Parallel (prefix?) scan
-
Tiling
-
Blazewicz boolean matrices
-
Block pruning
-
Burrow-Wheeler Transformer? (Klus2012)
module load tbb
export CPATH=/lustre/atlas/sw/tbb/43/sles11.3_gnu4.8.2/source/include
make -j 8
Szalkowski2008 SWPS3 ďż˝ fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and Ă—86/SSE2
mkdir build
cmake ..
make
module load cudatoolkit
qsub -I -A CSC261 -l nodes=1,walltime=30:00
nvcc -I. -Iinc *cu *cpp inc/*cpp -L${CRAY_LD_LIBRARY_PATH} -lcudart
SmithWaterman_kernel.cu
needs to be edited to hold a query sequence
#Acquire CUDA 6.5
wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run
#Install it to (you may need to `mkdir -p` this directory)
$HOME/os/cuda-6.5/
#Try compiling:
module unload pgi
module remove cudatoolkit
module load cmake
module load gcc/4.8.2
export PATH="$HOME/os/cuda-6.5/bin:$PATH"
export LIBRARY_PATH="$HOME/os/cuda-6.5/lib64"
./comp_cu.sh
Farrar2007 Striped Smith�Waterman speeds database searches six times over other SIMD implementations
Manavski2008 CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
make -f Makefile
make -f Makefile
Compilation succeeded with
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
make
Compilation succeeded with
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
#Minor makefile adjustment to NVCC path
make
Set MAX_SEQUENCE_LENGTH
in barracuda.h
Requires preprocessing the reference sequence using a Burrow-Wheeler transform.
Compilation successful.
module load cudatoolkit/7.0.28-1.0502.10280.4.1
make
All query sequences must be the same length, but they can be padded with N
.
make
Code compiles on Titan using the following per the build.titan
script in implementations/masa/masa-cudalign/
.
Only aligns two very long sequences.
Code for 4.0 doesn't seem to be available. TODO: email authors.
Only aligns two very long sequences.
PaSWAS, from Warris2015, needs to be compiled from source with the parameters of the input sequences. If the sequences are of different lengths, it would need to be compiled with the length of the longest one. Since the Antarctic data contains a sequence of length 5,279 this means that only a single sequence can fit on the GPU at a time.
Compiled with modifications to Makefile and inclusion of CUDA-deprecated header files.
cd PaSWAS/onGPU
module load cudatoolkit
Summit required minor modifications to the makefiles to point to the correct library paths, also:
module load cuda/9.0.69
module load gcc/6.4.0
Compilation succeeded with
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
#Several fixes to the code and makefile
make
Requires preprocessing the reference sequence with a Burrows-Wheeler transform
For 50 sequences, it runs. For 100 sequences it fails, and keeps failing (presumably), until I load 2,698 sequences, and then everything's fine again.
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
module load gcc/6.3.0
module unload pgi
Compilation succeeded. Straight-forward.
make
Seems to just work.
Compilation successful. Minor alterations of makefile required.
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
module unload pgi
module load gcc/6.3.0
make
Available as a library.
Have to use the protein alignment thing to get a M:M, otherwise it is single 1:1.
wget ftp://ftp.gnu.org/gnu/gengetopt/gengetopt-2.22.tar.gz
tar xvzf gengetopt-2.22.tar.gz
cd gengetopt-2.22/
./configure --prefix=$HOME/os
#Add `#include <string.h>` to the top of `src/fileutils.cpp`
make -j 10
make install
export PATH="$HOME/os/bin:$PATH"
module load tbb
echo $TBB_COMPILE_FLAGS #Get path to TBB
export LIBRARY_PATH="/lustre/atlas/sw/tbb/43/sles11.3_gnu4.8.2/source/build/linux_intel64_gcc_cc4.8.2_libc2.11.3_kernel3.0.101_release/:$LIBRARY_PATH"
mkdir build
cmake ..
make -j 10
#Executable is in: build/src/c
pip3 install pycuda --user
pip3 install BioPython=1.71 --user
pip3 install numpy=1.14.3
pip3 install pyopencl= --user
#Anaconda Python 3.5.5
#Cuda 9.0.69
Build process seems to require Spack. Might be easier to use Docker. That is, this is likely to be forever a troublesome dependency.
Fork says to use flag -DGPU_ARCHITECTURE=sm_XX
with cmake. (Link)
nvbio repo says that support is for GCC 4.8 with CUDA 6.5 (Link).
An alternative repo at https://github.com/ngstools/nvbio doesn't exist any more.
#Acquire CUDA 6.5
wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run
#Install it to (you may need to `mkdir -p` this directory)
$HOME/os/cuda-6.5/
#Try compiling:
module unload pgi
module remove cudatoolkit
module load cmake
module load gcc/4.8.2
export PATH="$HOME/os/cuda-6.5/bin:$PATH"
export LIBRARY_PATH="$HOME/os/cuda-6.5/lib64"
mkdir build
cd build
CXX=g++ CC=gcc cmake .. -DGPU_ARCHITECTURE=sm_35 -DCMAKE_INSTALL_PREFIX:PATH=$HOME/os
make -j 10
cd ..
mkdir debug
cd debug
CXX=g++ CC=gcc cmake .. -DGPU_ARCHITECTURE=sm_35 -DCMAKE_INSTALL_PREFIX:PATH=$HOME/os -DCMAKE_BUILD_TYPE=Debug
make -j 10
Seems to require Ubuntu or Fedora. Complicated build process, but a cool idea (generating a deb package on the fly).
All material on these sites has been examined and linked references downloaded.
- http://www.nvidia.com/object/cuda_showcase_html.html
- http://www.nvidia.com/object/bio_info_life_sciences.html
Recordings of talks:
- https://www.youtube.com/watch?v=dTjvJmOpbM4
- http://on-demand.gputechconf.com/gtc/2012/video/S0083-Swift-GPU-Based-Smith-Waterman-Sequence-Alignment-Program.flv
All the test files can be acquired quickly using the following commands:
wget https://svwh.dl.sourceforge.net/project/cudasw/data/simdb.fasta.gz -P data/
wget https://iweb.dl.sourceforge.net/project/cudasw/data/Queries.zip -P data/
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz -P data/
cat data/pacbio_human54x_files | xargs -n 1 -P 4 wget --continue -P data/
The test data comes from the following sources:
- http://cudasw.sourceforge.net/homepage.htm#installation : CUDASW++ search for
"Example query sequences". Download the files
simdb.fasta.gz
andQueries.zip
. - http://sourceforge.net/projects/cudasw/files/data : Same as above, but a more direct link.
- The PacBIO Human54x files are drawn from here and linked to from here.
- UniProt's Swiss-Prot database of proteins here.
An example of running CUDASW++ (2.0) with an arbitrary query from Queries/ against the simdb.fasta database with all the default parameter values:
./cudasw -query Queries/P01008.fasta -db simdb.fasta
The example assumes CUDASW++ (2.0) is compiled as the executable "cudasw", cudasw is in $PATH, and Queries/ and simdb.fasta are in the current working directory (simply provide the absolute path if not).
See here for additional instructions and options for CUDASW++.
- Might be possible to use: https://github.com/seqan/seqan/tree/master/apps/mason2
Omitted repos:
- https://github.com/vgteam/gssw conflicts with Zhao2013 and is a generalization, so probably not needed
Reading sequence data:
To run on Titan, you'll need to first compile your code. The following, for example, shows how to compile Striemer2009.
module load cudatoolkit
nvcc -I. -Iinc *cu *cpp inc/*cpp -L${CRAY_LD_LIBRARY_PATH} -lcudart
You'll then need to either make a batch script or start an interactive batch job:
qsub -I -X -A CSC261 -q debug -l nodes=1,walltime=30:00
The only way to access compute nodes if via the aprun
command. But this
command can only be run from somewhere on the lustre file system. Get there
using (for example):
cd $MEMBERWORK/csc261
cd /lustre/atlas/scratch/spinyfan/csc261/
Finally, use aprun
to run the program:
aprun ~/crd-swgpu/implementations/striemer2009/SmithWaterman/a.out