Hirschberg's algorithm
Encyclopedia
Hirschberg's algorithm, named after its inventor, Dan Hirschberg
, is a dynamic programming
algorithm
that finds the least cost sequence alignment
between two string
s, where cost is measured as Levenshtein distance
, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string to the other. Hirschberg's algorithm is simply described as a divide and conquer
version of the Needleman-Wunsch algorithm
. Hirschberg's algorithm is commonly used in computational biology
to find maximal global alignments of DNA
and protein
sequences.
and FASTA
are suboptimal heuristics. If x and y are strings, where length(x) = n and length(y) = m, the Needleman-Wunsch algorithm
finds an optimal alignment in O
(nm) time, using O(nm) space. Hirschberg's algorithm is a clever modification of the Needleman-Wunsch Algorithm which still takes O(nm) time, but needs only O(min{n,m}) space.
One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence
between two sets of data such as with the common diff
tool.
inserting a symbol a into a string, Sub(a,b) for the cost of substituting a symbol b for a in a string, and Del(a) for the cost of deleting a symbol a from a string. The "standard" choice of those costs is Ins(a) = Del(a) = 1 for each symbol a, Sub(a,a) = 0, and Sub(a,b) = 1 if a is not equal to b.
The Needleman-Wunsch algorithm computes Edit(x,y) by computing Edit(Prefix[x,i],Prefix[y,j]) for all i and j dynamically, where Prefix[x,i] denotes the prefix of x of length i. That algorithm requires O(nm) time and O(nm) space, where n = length(x) and m = length(y).
that edit distances can be computed using linear space.
What we call the "forward subprogram" computes the values of
Edit(Prefix[x,i],Prefix[y,j]) for all i and j, just as the Needleman-Wunsch
and returns the array {Edit(x,Prefix[y,j])}0 ≤ j ≤ m. The
"backward subprogram" is similar, except that the dynamic program
is done in the opposite direction, i.e., starting from the right ends
of the strings. It returns the array {Edit(x,Suffix[y,j])}0 ≤ j ≤ m,
where Suffix[y,j] is the suffix of y of length j.
The forward and backward subprograms compute values in the matrix by using
previously computed values, but save space by erasing values that will no
longer be needed during that execution of the subprogram. Unfortunately,
the erased values will need to be recomputed later; thus, Hirschberg's
algorithm takes roughly twice as much time as Needleman-Wunsch.
High-level description of the backwards subprogram
High level description of Hirschberg's algorithm:
Dan Hirschberg
Daniel S. Hirschberg is a full professor in Computer Science at University of California, Irvine. His research interests are in the theory of design and analysis of algorithms....
, is a dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
that finds the least cost sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
between two string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
s, where cost is measured as Levenshtein distance
Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string to the other. Hirschberg's algorithm is simply described as a divide and conquer
Divide and conquer algorithm
In computer science, divide and conquer is an important algorithm design paradigm based on multi-branched recursion. A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same type, until these become simple enough to be solved directly...
version of the Needleman-Wunsch algorithm
Needleman-Wunsch algorithm
The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
. Hirschberg's algorithm is commonly used in computational biology
Computational biology
Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...
to find maximal global alignments of DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
and protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
sequences.
Algorithm information
Hirschberg's algorithm is a generally applicable algorithm for finding an optimal sequence alignment. BLASTBLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
and FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...
are suboptimal heuristics. If x and y are strings, where length(x) = n and length(y) = m, the Needleman-Wunsch algorithm
Needleman-Wunsch algorithm
The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
finds an optimal alignment in O
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
(nm) time, using O(nm) space. Hirschberg's algorithm is a clever modification of the Needleman-Wunsch Algorithm which still takes O(nm) time, but needs only O(min{n,m}) space.
One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence
Longest common subsequence problem
The longest common subsequence problem is to find the longest subsequence common to all sequences in a set of sequences . Note that subsequence is different from a substring, see substring vs. subsequence...
between two sets of data such as with the common diff
Diff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...
tool.
Computation of the Levenshtein edit distance in linear space
The edit distance Edit(x,y) is the least cost of changing x into y by using the operations Insert, Substitute, and Delete, where the cost of each kind of operation is given. We write Ins(a) for the cost ofinserting a symbol a into a string, Sub(a,b) for the cost of substituting a symbol b for a in a string, and Del(a) for the cost of deleting a symbol a from a string. The "standard" choice of those costs is Ins(a) = Del(a) = 1 for each symbol a, Sub(a,a) = 0, and Sub(a,b) = 1 if a is not equal to b.
The Needleman-Wunsch algorithm computes Edit(x,y) by computing Edit(Prefix[x,i],Prefix[y,j]) for all i and j dynamically, where Prefix[x,i] denotes the prefix of x of length i. That algorithm requires O(nm) time and O(nm) space, where n = length(x) and m = length(y).
Algorithm organization
To understand Hirschberg's algorithm, it is important to first understandthat edit distances can be computed using linear space.
What we call the "forward subprogram" computes the values of
Edit(Prefix[x,i],Prefix[y,j]) for all i and j, just as the Needleman-Wunsch
and returns the array {Edit(x,Prefix[y,j])}0 ≤ j ≤ m. The
"backward subprogram" is similar, except that the dynamic program
is done in the opposite direction, i.e., starting from the right ends
of the strings. It returns the array {Edit(x,Suffix[y,j])}0 ≤ j ≤ m,
where Suffix[y,j] is the suffix of y of length j.
The forward and backward subprograms compute values in the matrix by using
previously computed values, but save space by erasing values that will no
longer be needed during that execution of the subprogram. Unfortunately,
the erased values will need to be recomputed later; thus, Hirschberg's
algorithm takes roughly twice as much time as Needleman-Wunsch.
Pseudocode
High-level description of the forwards subprogramHigh-level description of the backwards subprogram
High level description of Hirschberg's algorithm:
See also
- Needleman-Wunsch algorithmNeedleman-Wunsch algorithmThe Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
- Smith Waterman algorithmSmith Waterman algorithmThe Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences...
- Levenshtein distanceLevenshtein distanceIn information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
- Longest Common SubsequenceLongest common subsequence problemThe longest common subsequence problem is to find the longest subsequence common to all sequences in a set of sequences . Note that subsequence is different from a substring, see substring vs. subsequence...