Smith Waterman algorithm
Encyclopedia
The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment
; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.
and Michael S. Waterman in 1981. Like the Needleman-Wunsch algorithm
, of which it is a variation, Smith-Waterman is a dynamic programming
algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix
and the gap-scoring
scheme). The main difference to the Needleman-Wunsch algorithm
is that negative scoring matrix cells are set to zero, which renders the (thus positively scoring) local alignments visible. Backtracking starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment. One does not actually implement the algorithm as described because improved alternatives are now available that have better scaling (Gotoh, 1982) and are more accurate (Altschul and Erickson, 1986).
Where:
To obtain the optimum local alignment, we start with the highest value in the matrix (i,j). Then, we go backwards to one of positions (i-1,j), (i,j-1), and (i-1,j-1) depending on the direction of movement used to construct the matrix. We keep the process until we reach a matrix cell with zero value, or the value in position (0,0).
In the example, the highest value corresponds to the cell in position (8,8). The walk back corresponds to (8,8), (7,7), (7,6), (6,5), (5,4), (4,3), (3,2), (2,1), (1,1), and (0,0),
Once we've finished, we reconstruct the alignment as follows: Starting with the last value, we reach (i,j) using the previously-calculated path. A diagonal jump implies there is an alignment (either a match or a mismatch). A top-down jump implies there is a deletion. A left-right jump implies there is an insertion.
For the example, we get:
Sequence 1 = A-CACACTA
Sequence 2 = AGCACAC-A
and gap penalties) would yield for a random sequence.
Another motivation for using local alignments is that there is a reliable statistical model (developed by Karlin and Altschul) for optimal local alignments. The alignment of unrelated sequences tends to produce optimal local alignment scores which follow an extreme value distribution. This property allows programs to produce an expectation value for the optimal local alignment of two sequences, which is a measure of how often two unrelated sequences would produce an optimal local alignment whose score is greater than or equal to the observed score. Very low expectation values indicate that the two sequences in question might be homologous, meaning they might share a common ancestor.
The Smith–Waterman algorithm is fairly demanding of time: To align two sequences of lengths m and n, O
(mn) time is required. Smith-Waterman local similarity scores can be calculated in O(m) (linear) space, but naive algorithms to produce the alignment require O(mn) space. A linear space alignment strategy has been described. BLAST
and FASTA
reduce the amount of time required by identifying conserved regions using rapid lookup strategies.
An implementation of the Smith-Waterman Algorithm, SSEARCH, is available in the FASTA
sequence analysis package from http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml. This implementation includes Altivec
accelerated code for PowerPC
G4 and G5 processors that speeds up comparisons 10 - 20-fold, using a modification of the Wozniak, 1997 approach, and an SSE2 vectorization developed by Farrar making optimal protein sequence database
searches quite practical.
demonstrated acceleration of the Smith–Waterman algorithm using a reconfigurable computing
platform based on FPGA chips, with results showing up to 28x speed-up over standard microprocessor-based solutions. Another FPGA based version of the Smith–Waterman algorithm shows FPGA (Virtex-4) speedups up to 100x over a 2.2 GHz Opteron processor. The TimeLogic DeCypher and CodeQuest systems also accelerate Smith-Waterman and Framesearch using PCIe FPGA cards.
Convey Computer
Corporation's HC-1, an example of hybrid-core computing
, has demonstrated a Smith-Waterman implementation that is 172x
faster than SSEARCH in FASTA on a 3.0 GHz Intel Nehalem core with Farrar's implementation using the SSE2 instruction set. Over a more traditional software implementation it is a couple of thousand times faster. The Convey result was obtained on a single Convey HC-1 server.
SciEngines' RIVYERA S6-LX150 provides more than 6000 GCUPS per single computer and implements Smith-Waterman more than 1000x faster than SSEARCH in FASTA on a 3.0 GHz Intel Nehalem core.
A recently published research thesis, Genetic sequence alignment on a supercomputing platform from the Computer Engineering department at Delft University shows an analysis on FPGA based Smith-Waterman acceleration.
and the US Department of Energy's Joint Genome Institute
accelerates Smith-Waterman local sequence alignment searches using graphics processing units (GPUs) with preliminary results showing a 2x speed-up over software implementations. A similar method has already been implemented in the Biofacet software since 1997, with the same speed-up factor.
Several GPU implementations of the algorithm in NVIDIA
's CUDA
C platform are also available. When compared to the best known CPU implementation (using SIMD instructions on the x86 architecture), by Farrar, the performance tests of this solution using a single NVidia GeForce 8800 GTX
card show a slight increase in performance for smaller sequences, but a slight decrease in performance for larger ones. However the same tests running on dual NVidia GeForce 8800 GTX
cards are almost twice as fast as the Farrar implementation for all sequence sizes tested.
A newer GPU CUDA implementation of SW is now available that is faster than previous versions and also removes limitations on query lengths. See CUDASW++.
Eleven different SW implementations on CUDA have been reported, three of which report speedups of 30X.
A SSE2
vectorization of the algorithm (Farrar, 2007) is now available providing an 8-16-fold speedup on Intel/AMD processors with SSE2 extensions. When running on Intel processor using the Core microarchitecture the SSE2 implementation achieves a 20-fold increase. Farrar's SSE2 implementation is available as the SSEARCH program in the FASTA
sequence comparison package. The SSEARCH is included in the European Bioinformatics Institute
's suite of similarity searching programs.
Danish bioinformatics company CLC bio
has achieved speed-ups of close to 200 over standard software implementations with SSE2 on a Intel 2.17 GHz Core 2 Duo CPU, according to a publicly available white paper.
Accelerated version of the Smith–Waterman algorithm, on Intel and AMD based Linux servers, is supported by the GenCore 6 package, offered by Biocceleration. Performance benchmarks of this software package show up to 10 fold speed acceleration relative to standard software implementation on the same processor.
Currently the only company in bioinformatics to offer both SSE and FPGA solutions accelerating Smith-Waterman, CLC bio
has achieved speed-ups of more than 110 over standard software implementations with CLC Bioinformatics Cube
The fastest implementation of the algorithm on CPUs with SSSE3
can be found the SWIPE software (Rognes, 2011), which is available under the GNU Affero General Public License. In parallel, this software compares residues from sixteen different database sequences to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon
X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. It is faster than BLAST
when using the BLOSUM50 matrix.
, respectively.
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.
Background
The algorithm was first proposed by Temple F. SmithTemple F. Smith
Temple F. Smith is a university professor in biomedical engineering who helped to develop the Smith-Waterman algorithm developed with Michael Waterman in 1981. The Smith-Waterman algorithm serves as the basis for multi sequence comparisons, identifying the segment with the maximum local sequence...
and Michael S. Waterman in 1981. Like the Needleman-Wunsch algorithm
Needleman-Wunsch algorithm
The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
, of which it is a variation, Smith-Waterman is a dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix
Substitution matrix
In bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time...
and the gap-scoring
Gap penalty
Gap penalties are used during sequence alignment. Gap penalties contribute to the overall score of alignments, and therefore, the size of the gap penalty relative to the entries in the similarity matrix affects the alignment that is finally selected...
scheme). The main difference to the Needleman-Wunsch algorithm
Needleman-Wunsch algorithm
The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
is that negative scoring matrix cells are set to zero, which renders the (thus positively scoring) local alignments visible. Backtracking starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment. One does not actually implement the algorithm as described because improved alternatives are now available that have better scaling (Gotoh, 1982) and are more accurate (Altschul and Erickson, 1986).
Algorithm Explanation
A matrix is built as follows:Where:
- = Strings over the AlphabetAlphabetAn alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...
- - is the maximum Similarity-Score between a suffix of a[1...i] and a suffix of b[1...j]
- , '-' is the gap-scoringGap penaltyGap penalties are used during sequence alignment. Gap penalties contribute to the overall score of alignments, and therefore, the size of the gap penalty relative to the entries in the similarity matrix affects the alignment that is finally selected...
scheme
Example
- Sequence 1 = ACACACTA
- Sequence 2 = AGCACACA
- w(gapGap penaltyGap penalties are used during sequence alignment. Gap penalties contribute to the overall score of alignments, and therefore, the size of the gap penalty relative to the entries in the similarity matrix affects the alignment that is finally selected...
) = 0 - w(match) = +2
To obtain the optimum local alignment, we start with the highest value in the matrix (i,j). Then, we go backwards to one of positions (i-1,j), (i,j-1), and (i-1,j-1) depending on the direction of movement used to construct the matrix. We keep the process until we reach a matrix cell with zero value, or the value in position (0,0).
In the example, the highest value corresponds to the cell in position (8,8). The walk back corresponds to (8,8), (7,7), (7,6), (6,5), (5,4), (4,3), (3,2), (2,1), (1,1), and (0,0),
Once we've finished, we reconstruct the alignment as follows: Starting with the last value, we reach (i,j) using the previously-calculated path. A diagonal jump implies there is an alignment (either a match or a mismatch). A top-down jump implies there is a deletion. A left-right jump implies there is an insertion.
For the example, we get:
Sequence 1 = A-CACACTA
Sequence 2 = AGCACAC-A
Motivation
One motivation for local alignment is the difficulty of obtaining correct alignments in regions of low similarity between distantly related biological sequences, because mutations have added too much 'noise' over evolutionary time to allow for a meaningful comparison of those regions. Local alignment avoids such regions altogether and focuses on those with a positive score, i.e. those with an evolutionary conserved signal of similarity. A prerequisite for local alignment is a negative expectation score. The expectation score is defined as the average score that the scoring system (substitution matrixSubstitution matrix
In bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time...
and gap penalties) would yield for a random sequence.
Another motivation for using local alignments is that there is a reliable statistical model (developed by Karlin and Altschul) for optimal local alignments. The alignment of unrelated sequences tends to produce optimal local alignment scores which follow an extreme value distribution. This property allows programs to produce an expectation value for the optimal local alignment of two sequences, which is a measure of how often two unrelated sequences would produce an optimal local alignment whose score is greater than or equal to the observed score. Very low expectation values indicate that the two sequences in question might be homologous, meaning they might share a common ancestor.
The Smith–Waterman algorithm is fairly demanding of time: To align two sequences of lengths m and n, O
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
(mn) time is required. Smith-Waterman local similarity scores can be calculated in O(m) (linear) space, but naive algorithms to produce the alignment require O(mn) space. A linear space alignment strategy has been described. BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
and FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...
reduce the amount of time required by identifying conserved regions using rapid lookup strategies.
An implementation of the Smith-Waterman Algorithm, SSEARCH, is available in the FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...
sequence analysis package from http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml. This implementation includes Altivec
AltiVec
AltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...
accelerated code for PowerPC
PowerPC
PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...
G4 and G5 processors that speeds up comparisons 10 - 20-fold, using a modification of the Wozniak, 1997 approach, and an SSE2 vectorization developed by Farrar making optimal protein sequence database
Sequence database
In the field of bioinformatics, a sequence database is a large collection of computerized nucleic acid sequences, protein sequences, or other sequences stored on a computer...
searches quite practical.
FPGA
CrayCray
Cray Inc. is an American supercomputer manufacturer based in Seattle, Washington. The company's predecessor, Cray Research, Inc. , was founded in 1972 by computer designer Seymour Cray. Seymour Cray went on to form the spin-off Cray Computer Corporation , in 1989, which went bankrupt in 1995,...
demonstrated acceleration of the Smith–Waterman algorithm using a reconfigurable computing
Reconfigurable computing
Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays...
platform based on FPGA chips, with results showing up to 28x speed-up over standard microprocessor-based solutions. Another FPGA based version of the Smith–Waterman algorithm shows FPGA (Virtex-4) speedups up to 100x over a 2.2 GHz Opteron processor. The TimeLogic DeCypher and CodeQuest systems also accelerate Smith-Waterman and Framesearch using PCIe FPGA cards.
Convey Computer
Convey Computer
Convey Computer Corporation is a privately owned company, established in December 2006 and is based in Richardson, Texas. Convey has developed a specific form of heterogeneous computing they call hybrid-core computing...
Corporation's HC-1, an example of hybrid-core computing
Hybrid-core computing
Hybrid-core computing is the technique of extending a commodity instruction set architecture with application-specific instructions to accelerate application performance...
, has demonstrated a Smith-Waterman implementation that is 172x
faster than SSEARCH in FASTA on a 3.0 GHz Intel Nehalem core with Farrar's implementation using the SSE2 instruction set. Over a more traditional software implementation it is a couple of thousand times faster. The Convey result was obtained on a single Convey HC-1 server.
SciEngines' RIVYERA S6-LX150 provides more than 6000 GCUPS per single computer and implements Smith-Waterman more than 1000x faster than SSEARCH in FASTA on a 3.0 GHz Intel Nehalem core.
A recently published research thesis, Genetic sequence alignment on a supercomputing platform from the Computer Engineering department at Delft University shows an analysis on FPGA based Smith-Waterman acceleration.
GPU
Recent work developed at Lawrence Livermore National LaboratoryLawrence Livermore National Laboratory
The Lawrence Livermore National Laboratory , just outside Livermore, California, is a Federally Funded Research and Development Center founded by the University of California in 1952...
and the US Department of Energy's Joint Genome Institute
Joint Genome Institute
The U.S. Department of Energy Joint Genome Institute was created in 1997 to unite the expertise and resources in genome mapping, DNA sequencing, technology development, and information sciences pioneered at the DOE genome centers at Lawrence Berkeley National Laboratory , Lawrence Livermore...
accelerates Smith-Waterman local sequence alignment searches using graphics processing units (GPUs) with preliminary results showing a 2x speed-up over software implementations. A similar method has already been implemented in the Biofacet software since 1997, with the same speed-up factor.
Several GPU implementations of the algorithm in NVIDIA
NVIDIA
Nvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...
's CUDA
CUDA
CUDA or Compute Unified Device Architecture is a parallel computing architecture developed by Nvidia. CUDA is the computing engine in Nvidia graphics processing units that is accessible to software developers through variants of industry standard programming languages...
C platform are also available. When compared to the best known CPU implementation (using SIMD instructions on the x86 architecture), by Farrar, the performance tests of this solution using a single NVidia GeForce 8800 GTX
GeForce 8 Series
The GeForce 8 Series, is the eighth generation of NVIDIA's GeForce line of graphics processing units. The third major GPU architecture developed at NVIDIA, the GeForce 8 represents the company's first unified shader architecture.-Naming:...
card show a slight increase in performance for smaller sequences, but a slight decrease in performance for larger ones. However the same tests running on dual NVidia GeForce 8800 GTX
GeForce 8 Series
The GeForce 8 Series, is the eighth generation of NVIDIA's GeForce line of graphics processing units. The third major GPU architecture developed at NVIDIA, the GeForce 8 represents the company's first unified shader architecture.-Naming:...
cards are almost twice as fast as the Farrar implementation for all sequence sizes tested.
A newer GPU CUDA implementation of SW is now available that is faster than previous versions and also removes limitations on query lengths. See CUDASW++.
Eleven different SW implementations on CUDA have been reported, three of which report speedups of 30X.
SSE
In 2000, a fast implementation of the Smith–Waterman algorithm using the SIMD technology available in Intel Pentium MMX processors and similar technology was described in a publication by Rognes and Seeberg. In contrast to the Wozniak (1997) approach, the new implementation was based on vectors parallel with the query sequence, not diagonal vectors. The company Sencel Bioinformatics has applied for a patent covering this approach. Sencel is developing the software further and provides executables for academic use free of charge.A SSE2
SSE2
SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully supplant MMX. Intel extended SSE2 to create SSE3...
vectorization of the algorithm (Farrar, 2007) is now available providing an 8-16-fold speedup on Intel/AMD processors with SSE2 extensions. When running on Intel processor using the Core microarchitecture the SSE2 implementation achieves a 20-fold increase. Farrar's SSE2 implementation is available as the SSEARCH program in the FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...
sequence comparison package. The SSEARCH is included in the European Bioinformatics Institute
European Bioinformatics Institute
The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...
's suite of similarity searching programs.
Danish bioinformatics company CLC bio
CLC bio
CLC bio is a bioinformatics solution provider based in Aarhus, Denmark. CLC bio's software workbenches have more than 75,000 users in more than 100 countries around the globe.-Software:...
has achieved speed-ups of close to 200 over standard software implementations with SSE2 on a Intel 2.17 GHz Core 2 Duo CPU, according to a publicly available white paper.
Accelerated version of the Smith–Waterman algorithm, on Intel and AMD based Linux servers, is supported by the GenCore 6 package, offered by Biocceleration. Performance benchmarks of this software package show up to 10 fold speed acceleration relative to standard software implementation on the same processor.
Currently the only company in bioinformatics to offer both SSE and FPGA solutions accelerating Smith-Waterman, CLC bio
CLC bio
CLC bio is a bioinformatics solution provider based in Aarhus, Denmark. CLC bio's software workbenches have more than 75,000 users in more than 100 countries around the globe.-Software:...
has achieved speed-ups of more than 110 over standard software implementations with CLC Bioinformatics Cube
The fastest implementation of the algorithm on CPUs with SSSE3
SSSE3
Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.- History :...
can be found the SWIPE software (Rognes, 2011), which is available under the GNU Affero General Public License. In parallel, this software compares residues from sixteen different database sequences to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon
Xeon
The Xeon is a brand of multiprocessing- or multi-socket-capable x86 microprocessors from Intel Corporation targeted at the non-consumer server, workstation and embedded system markets.-Overview:...
X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. It is faster than BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
when using the BLOSUM50 matrix.
Cell Broadband Engine
In 2008, Farrar described a port of the Striped Smith-Waterman to the Cell Broadband Engine and reported speeds of 32 and 12 GCUPS on an IBM QS20 blade and a Sony PlayStation 3PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
, respectively.
External links
- JAligner — an open source Java implementation of the Smith–Waterman algorithm
- B.A.B.A. — an applet (with source) which visually explains the algorithm.
- FASTA/SSEARCH at the EBI'sEuropean Bioinformatics InstituteThe European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...
FASTA/SSEARCH services page. - UGENE Smith-Waterman plugin — An open source SSEARCH compatible implementation of the algorithm with graphical interface written in C++.
- OPAL JavaScript implementation of algorithms: Needleman-Wunsch, Needleman-Wunsch-Sellers and Smith-Waterman