Boyer–Moore string search algorithm
Encyclopedia
The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm
, and it has been the standard benchmark for the practical string search literature. It was developed by Bob Boyer and J Strother Moore
in 1977. The algorithm
preprocesses
the target string
(key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize
the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.
", for instance, it checks the eighth position of the text to see if it contains an "N". If it finds the "N", it moves to the seventh position to see if that contains the last "A" of the word, and so on until it checks the first position of the text for an "A".
Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails—for instance, if instead of an "N" in the eighth position, we find an "X". The "X" doesn't appear anywhere in "ANPANMAN", and this means there is no match for the search string at the very start of the text—or at the next seven positions following it, since those would all fall across the "X" as well. After checking the eight characters of the word "ANPANMAN" for just one character "X", we're able to skip ahead and start looking for a match ending at the sixteenth position of the text.
This explains why the best-case performance of the algorithm, for a text of length and a fixed pattern of length , is : in the best case, only one in characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.
The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the value of the character that caused the mismatch; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables
in computer science.) The algorithm will shift the larger of the two jump values when a mismatch occurs.
For instance, for the search string ANPANMAN, the table would be as follows:
(NMAN signifies a substring in ANPANMAN consisting of a character that is not 'N' plus the characters 'MAN'.)
Example: For the string ANPANMAN, the second table would be as shown (for clarity, entries are shown in the order they would be added to the table):
(The N which is supposed to be zero is based on the second N from the right because we only record the calculation for the first letters)
The amount of shift calculated by the second table is sometimes called the "bad character shift".
The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out the "first table". The Boyer-Moore-Horspool algorithm requires (in the worst case) comparisons, while the Boyer-Moore algorithm requires (in the worst case) only comparisons.
String searching algorithm
String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....
, and it has been the standard benchmark for the practical string search literature. It was developed by Bob Boyer and J Strother Moore
J Strother Moore
J Strother Moore is a computer scientist, and he is a co-developer of the Boyer–Moore string search algorithm and the Boyer–Moore automated theorem prover, Nqthm. An example of the workings of the Boyer–Moore string search algorithm is given...
in 1977. The algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
preprocesses
Preprocessor
In computer science, a preprocessor is a program that processes its input data to produce output that is used as input to another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers...
the target string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
(key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize
Amortization
Amortization is the process of decreasing, or accounting for, an amount over a period. The word comes from Middle English amortisen to kill, alienate in mortmain, from Anglo-French amorteser, alteration of amortir, from Vulgar Latin admortire to kill, from Latin ad- + mort-, mors death.When used...
the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.
How the algorithm works
What people frequently find surprising about the Boyer-Moore algorithm, when they first encounter it, is that its verifications—its attempts to check whether a match exists at a particular position—work backwards. If it starts a search at the beginning of a text for the word "ANPANMANAnpanman
is one of the most popular anime cartoon series for young children in Japan. , the titular Anpanman is the most popular fictional character among people age 0 to 12 years in Japan in 10 consective years, according to research by Bandai. The series is written by Takashi Yanase, a Japanese writer of...
", for instance, it checks the eighth position of the text to see if it contains an "N". If it finds the "N", it moves to the seventh position to see if that contains the last "A" of the word, and so on until it checks the first position of the text for an "A".
Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails—for instance, if instead of an "N" in the eighth position, we find an "X". The "X" doesn't appear anywhere in "ANPANMAN", and this means there is no match for the search string at the very start of the text—or at the next seven positions following it, since those would all fall across the "X" as well. After checking the eight characters of the word "ANPANMAN" for just one character "X", we're able to skip ahead and start looking for a match ending at the sixteenth position of the text.
This explains why the best-case performance of the algorithm, for a text of length and a fixed pattern of length , is : in the best case, only one in characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.
The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the value of the character that caused the mismatch; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables
Branch table
In computer programming, a branch table is a term used to describe an efficient method of transferring program control to another part of a program using a table of branch instructions. It is a form of multiway branch...
in computer science.) The algorithm will shift the larger of the two jump values when a mismatch occurs.
The first table
Populate the first table as follows. For each i less than the length of the search string, construct the pattern consisting of the last i characters of the string preceded by a mis-matched character, right-align the pattern and string, and record the fewest characters the pattern must shift left for a match.For instance, for the search string ANPANMAN, the table would be as follows:
(
i | Pattern | Left Shift |
---|---|---|
0 | |
It is true that the next letter to the left in 'ANPANMAN' is not N (it is A), therefore the pattern |
1 | |
|
2 | |
Substring |
3 | |
We see that ' |
4 | |
6 |
5 | |
6 |
6 | |
6 |
7 | |
6 |
The second table
The second table is easier to calculate: Start at the last character of the sought string and move towards the first character. Each time you move left, if the character you are on is not in the table already, add it; its Shift value is its distance from the rightmost character. All other characters receive a count equal to the length of the search string.Example: For the string ANPANMAN, the second table would be as shown (for clarity, entries are shown in the order they would be added to the table):
(The N which is supposed to be zero is based on the second N from the right because we only record the calculation for the first letters)
Character | Shift |
---|---|
A | 1 |
M | 2 |
N | 3 |
P | 5 |
all other characters | 8 |
The amount of shift calculated by the second table is sometimes called the "bad character shift".
Performance of the Boyer-Moore string search algorithm
The worst-case to find all occurrences in an aperiodic text needs approximately comparisons, hence the complexity is , regardless whether the text contains a match or not. This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than ; in 1980 it was shown to be no more than , until Cole's result in Sep 1991.C
Variants
The Turbo Boyer-Moore algorithm takes an additional constant amount of space to complete a search within comparisons (as opposed to for Boyer-Moore), where is the number of characters in the text to be searched.The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out the "first table". The Boyer-Moore-Horspool algorithm requires (in the worst case) comparisons, while the Boyer-Moore algorithm requires (in the worst case) only comparisons.
External links
- String Searching Applet animation
- Original article
- An example of the Boyer-Moore algorithm from the homepage of J Strother MooreJ Strother MooreJ Strother Moore is a computer scientist, and he is a co-developer of the Boyer–Moore string search algorithm and the Boyer–Moore automated theorem prover, Nqthm. An example of the workings of the Boyer–Moore string search algorithm is given...
, co-inventor of the algorithm - An explanation of the algorithm (with sample C code)
- Cole et al., Tighter lower bounds on the exact complexity of string matching
- An implementation of the algorithm in Ruby
- Scala functional implementation with source code