Suffix array
Encyclopedia
In computer science
, a suffix array is an array of integers giving the starting positions of suffixes of a string
in lexicographical order
.
of length 12, that ends with a sentinel letter $, appearing only once, and less (in lexicographical order) than any other letter in the string.
It has twelve suffixes: "abracadabra$", "bracadabra$", "racadabra$", and so on down to "a$" and "$" that can be sorted
into lexicographical order to obtain:
If the original string is available, each suffix can be completely specified by the index of its first character. The suffix array is the array of the indices of suffixes sorted in lexicographical order. For the string "abracadabra$", using one-based indexing, the suffix array is {12,11,8,1,4,6,9,2,5,7,10,3}, because the suffix "$" begins at position 12, "a$" begins at position 11, "abra$" begins at position 8, and so forth.
The longest common prefix is also shown above as lcp. This value, stored alongside the list of prefix indices, indicates how many characters a particular suffix has in common with the suffix directly above it, starting at the beginning of both suffixes. The lcp is useful in making some string operations more efficient. For example, it can be used to avoid comparing characters that are already known to be the same when searching through the list of suffixes. The fact that the minimum lcp value belonging to a consecutive set of sorted suffixes gives the longest common prefix among all of those suffixes can also be useful.
algorithm. This requires suffix comparisons, but a suffix comparison requires time, so the overall runtime of this approach is . More sophisticated algorithms improve this to by exploiting the results of partial sorts to avoid redundant comparisons. Several algorithms (of Pang Ko and Srinivas Aluru, Juha Kärkkäinen and Peter Sanders, etc.) have also been developed which provide faster construction and have space usage of with low constants. These latter are derived from the suffix tree
construction algorithm of Farach.
Recent work by Salson et al. proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is , it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text (Léonard, Mouchard and Salson).
to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes time, where is the length of the substring W. The following pseudo-code from Manber and Myers shows how to find W (or the suffix lexicographically immediately before W if W is not present) in a suffix array with indices stored in pos, starting with 1 as the first index.
To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving search time.
Suffix sorting algorithms can be used to perform the Burrows–Wheeler transform (BWT). Technically the BWT requires sorting cyclic permutation
s of a string, not suffixes. We can fix this by appending to the string a special end-of-string character which sorts lexicographically before every other character. Sorting cyclic permutations is then equivalent to sorting suffixes.
Suffix arrays are used to look up substrings in Example-Based Machine Translation
, demanding much less storage than a full phrase table as used in Statistical machine translation
.
to reduce memory consumption compared to a suffix tree
. This began the trend towards compressed suffix array
s and BWT-based compressed full-text indices.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, a suffix array is an array of integers giving the starting positions of suffixes of a string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
in lexicographical order
Lexicographical order
In mathematics, the lexicographic or lexicographical order, , is a generalization of the way the alphabetical order of words is based on the alphabetical order of letters.-Definition:Given two partially ordered sets A and B, the lexicographical order on...
.
Details
Consider the string1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
a | b | r | a | c | a | d | a | b | r | a | $ |
of length 12, that ends with a sentinel letter $, appearing only once, and less (in lexicographical order) than any other letter in the string.
It has twelve suffixes: "abracadabra$", "bracadabra$", "racadabra$", and so on down to "a$" and "$" that can be sorted
into lexicographical order to obtain:
index | sorted suffix | lcp |
---|---|---|
12 | $ | 0 |
11 | a$ | 0 |
8 | abra$ | 1 |
1 | abracadabra$ | 4 |
4 | acadabra$ | 1 |
6 | adabra$ | 1 |
9 | bra$ | 0 |
2 | bracadabra$ | 3 |
5 | cadabra$ | 0 |
7 | dabra$ | 0 |
10 | ra$ | 0 |
3 | racadabra$ | 2 |
If the original string is available, each suffix can be completely specified by the index of its first character. The suffix array is the array of the indices of suffixes sorted in lexicographical order. For the string "abracadabra$", using one-based indexing, the suffix array is {12,11,8,1,4,6,9,2,5,7,10,3}, because the suffix "$" begins at position 12, "a$" begins at position 11, "abra$" begins at position 8, and so forth.
The longest common prefix is also shown above as lcp. This value, stored alongside the list of prefix indices, indicates how many characters a particular suffix has in common with the suffix directly above it, starting at the beginning of both suffixes. The lcp is useful in making some string operations more efficient. For example, it can be used to avoid comparing characters that are already known to be the same when searching through the list of suffixes. The fact that the minimum lcp value belonging to a consecutive set of sorted suffixes gives the longest common prefix among all of those suffixes can also be useful.
Algorithms
The easiest way to construct a suffix array is to use an efficient comparison sortComparison sort
A comparison sort is a type of sorting algorithm that only reads the list elements through a single abstract comparison operation that determines which of two elements should occur first in the final sorted list...
algorithm. This requires suffix comparisons, but a suffix comparison requires time, so the overall runtime of this approach is . More sophisticated algorithms improve this to by exploiting the results of partial sorts to avoid redundant comparisons. Several algorithms (of Pang Ko and Srinivas Aluru, Juha Kärkkäinen and Peter Sanders, etc.) have also been developed which provide faster construction and have space usage of with low constants. These latter are derived from the suffix tree
Suffix tree
In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...
construction algorithm of Farach.
Recent work by Salson et al. proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is , it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text (Léonard, Mouchard and Salson).
Applications
The suffix array of a string can be used as an indexIndex (information technology)
In computer science, an index can be:# an integer that identifies an array element# a data structure that enables sublinear-time lookup -Array element identifier:...
to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes time, where is the length of the substring W. The following pseudo-code from Manber and Myers shows how to find W (or the suffix lexicographically immediately before W if W is not present) in a suffix array with indices stored in pos, starting with 1 as the first index.
if W <= suffixAt(pos[1]) then
ans = 1
else if W > suffixAt(pos[n]) then
ans = n
else
{
L = 1, R = n
while R-L > 1 do
{
M = (L + R)/2
if W <= suffixAt(pos[M]) then
R = M
else
L = M
}
ans = R
}
To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving search time.
Suffix sorting algorithms can be used to perform the Burrows–Wheeler transform (BWT). Technically the BWT requires sorting cyclic permutation
Cyclic permutation
A cyclic permutation or circular permutation is a permutation built from one or more sets of elements in cyclic order.The notion "cyclic permutation" is used in different, but related ways:- Definition 1 :right|mapping of permutation...
s of a string, not suffixes. We can fix this by appending to the string a special end-of-string character which sorts lexicographically before every other character. Sorting cyclic permutations is then equivalent to sorting suffixes.
Suffix arrays are used to look up substrings in Example-Based Machine Translation
Example-based machine translation
The example-based machine translation approach to machine translation is often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base, at run-time...
, demanding much less storage than a full phrase table as used in Statistical machine translation
Statistical machine translation
Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora...
.
History
Suffix arrays were originally developed by Gene Myers and Udi ManberUdi Manber
Udi Manber is an Israeli computer scientist. He is one of the authors of agrep and GLIMPSE. As of April 2008, he is employed by Google as one of their vice presidents of engineering.-Biography:...
to reduce memory consumption compared to a suffix tree
Suffix tree
In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...
. This began the trend towards compressed suffix array
Compressed suffix array
The Compressed Suffix Array is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T...
s and BWT-based compressed full-text indices.
External links
- Various algorithms for constructing Suffix Arrays in Java, with performance tests
- Suffix sorting module for BWT in C code
- Suffix Array Implementation in Ruby
- Suffix array library and tools
- Project containing Suffix Array Implementation in Java
- Project containing various Suffix Array c/c++ Implementations with a unified interface
- A fast, lightweight, and robust C API library to construct the suffix array