Aho–Corasick string matching algorithm

The Aho–Corasick string matching algorithm is a string searching algorithm

String searching algorithm

String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....

invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. The complexity

Computational complexity theory

Computational complexity theory is a branch of the theory of computation in theoretical computer science and mathematics that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other...

of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

Informally, the algorithm constructs a finite state machine

Finite state machine

A finite-state machine or finite-state automaton , or simply a state machine, is a mathematical model used to design computer programs and digital logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states...

that resembles a trie

Trie

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the...

with additional links between the various internal nodes. These extra internal links allow fast transitions between failed pattern matches (e.g. a search for cat in a trie that does not contain cat, but contains cart, and thus would fail at the node prefixed by ca), to other branches of the trie that share a common prefix (e.g., in the previous case, a branch for attribute might be the best lateral transition). This allows the automaton to transition between pattern matches without the need for backtracking.

When the pattern dictionary is known in advance (e.g. a computer virus

Computer virus

A computer virus is a computer program that can replicate itself and spread from one computer to another. The term "virus" is also commonly but erroneously used to refer to other types of malware, including but not limited to adware and spyware programs that do not have the reproductive ability...

database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.

The Aho–Corasick string matching algorithm formed the basis of the original Unix command fgrep.

The following is the Aho–Corasick data structure constructed from the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the (unique) sequence of characters from the root to the node.

Dictionary {a, ab, bc, bca, c, caa}
Path	In Dictionary	Suffix Link	Dict Suffix Link
	-
(a)	+
(ab)	+	(b)
(b)	-
(bc)	+	(c)	(c)
(bca)	+	(ca)	(a)
(c)	+
(ca)	-	(a)	(a)
(caa)	+	(a)	(a)

At each step, the current node is extended by finding its child,
and if that doesn't exist, finding its suffix's child, and if
that doesn't work, finding its suffix's suffix's child, finally
ending in the root node if nothing's seen before.

Execution on input string abccab yields the following steps:

Analysis of input string `abccab`
Node	Remaining String	Output:End Position	Transition	Output
	abccab		start at root
(a)	bccab	a:1	to child (a)	Current node
(ab)	ccab	ab:2	(a) to child (ab)	Current node
(bc)	cab	bc:3, c:3	(ab) to suffix (b) to child (bc)	Current Node, Dict suffix node
(c)	ab	c:4	(bc) to suffix (c) to suffix to child (c)	Current node
(ca)	b	a:5	(c) to child (ca)	Dict suffix node
(ab)		ab:6	(ca) to suffix (a) to child (ab)	Current node

In general, more than one dictionary suffix link may need to
be followed, as more than one dictionary entry may end at a
given character in the input.

External links

Set Matching and Aho–Corasick Algorithm, lecture slides by Pekka Kilpeläinen
Aho–Corasick entry in NIST's Dictionary of Algorithms and Data Structures
Dictionary of Algorithms and Data Structures
The Dictionary of Algorithms and Data Structures is a dictionary style reference for many of the algorithms, algorithmic techniques, archetypal problems and data structures found in the field of computer science. The dictionary is maintained by Paul E...
Aho–Corasick string matching in C#, a CodeProject tutorial by Tomáš Petříček
Aho-Corasick string matching in PHP, optimized for terms-filtering
Aho-Corasick C implementation
Aho-Corasick Python implementation
Aho-Corasick Java implementation

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.