Damerau-Levenshtein distance - AbsoluteAstronomy.com

Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

and computer science

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, the Damerau–Levenshtein distance (named after Frederick J. Damerau

Frederick J. Damerau

Frederick J. Damerau was a pioneer of research on natural language processing and data mining.After earning his B.A. from Cornell University in 1953, he spent most of his career at IBM, in the Thomas J. Watson Research Center....

and Vladimir I. Levenshtein) is a "distance" (string metric

String metric

In mathematics, string metrics are a class of metric that measure similarity or dissimilarity between two text strings for approximate string matching or comparison and in fuzzy string searching. For example the strings "Sam" and "Samuel" can be considered to be similar...

) between two strings

String (computer science)

In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters. In his seminal paper, Damerau not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. The corresponding edit distance

Edit distance

In information theory and computer science, the edit distance between two strings of characters generally refers to the Levenshtein distance. However, according to Nico Jacobs, “The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and...

, i.e., dealing with multiple edit operations, known as the Levenshtein distance

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

, was introduced by Levenshtein, but it did not include transpositions in the set of basic operations. The name Damerau–Levenshtein distance is used to refer to the edit distance that allows multiple edit operations including transpositions, although it is not clear whether the term Damerau–Levenshtein distance is sometimes used in some sources as to take into account non-adjacent transpositions or not.

While the original motivation was to measure distance between human misspellings to improve applications such as spell checker

Spell checker

In computing, a spell checker is an application program that flags words in a document that may not be spelled correctly. Spell checkers may be stand-alone capable of operating on a block of text, or as part of a larger application, such as a word processor, email client, electronic dictionary,...

s, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between DNA

DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

Algorithm

Adding transpositions sounds simple, but in reality there is a serious complication. Presented here are two algorithms: the first, simpler one, computes what is known as the optimal string alignment (sometimes called the restricted edit distance), while the second one computes the Damerau–Levenshtein distance with adjacent transpositions. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction.

Take for example the edit distance between CA and ABC. The Damerau-Levenshtein distance LD(CA,ABC) = 2 because CA -> AC -> ABC, but the optimal string alignment distance OSA(CA,ABC) = 3 because if the operation CA -> AC is used, it is not possible to use AC -> ABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CA -> A -> AB -> ABC. Note that for the optimal string alignment distance, the triangle inequality

Triangle inequality

In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side ....

does not hold: OSA(CA,AC) + OSA(AC,ABC) < OSA(CA,ABC), and so it is not a true metric.

Firstly, let us consider a direct extension of the formula used to calculate Levenshtein distance

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

. Below is pseudocode

Pseudocode

In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...

for a function OptimalStringAlignmentDistance that takes two strings, str1 of length lenStr1, and str2 of length lenStr2, and computes the optimal string alignment distance between them:

int OptimalStringAlignmentDistance(char str1[1..lenStr1], char str2[1..lenStr2])
// d is a table with lenStr1+1 rows and lenStr2+1 columns
declare int d[0..lenStr1, 0..lenStr2]
// i and j are used to iterate over str1 and str2
declare int i, j, cost
//for loop is inclusive, need table 1 row/column larger than string length.
for i from 0 to lenStr1
d[i, 0] := i
for j from 1 to lenStr2
d[0, j] := j
//Pseudo-code assumes string indices start at 1, not 0.
//If implemented, make sure to start comparing at 1st letter of strings.
for i from 1 to lenStr1
for j from 1 to lenStr2
if str1[i] = str2[j] then cost := 0
else cost := 1
d[i, j] := minimum(
d[i-1, j ] + 1, // deletion
d[i , j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution
)
if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then
d[i, j] := minimum(
d[i, j],
d[i-2, j-2] + cost // transposition
)

return d[lenStr1, lenStr2]

Basically this is the algorithm to compute Levenshtein distance

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

with one additional recurrence:

if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then
d[i, j] := minimum(
d[i, j],
d[i-2, j-2] + cost // transposition
)

Here is the second algorithm that computes the true Damerau-Levenshtein distance with adjacent transpositions (ActionScript 3.0); this function requires as an additional parameter the size of the alphabet (C), so that all entries of the arrays are in 0..(C−1):

static public function damerauLevenshteinDistance(a:Array, b:Array, C:uint):uint
{
var INF:uint = a.length + b.length;
var H:matrix = new matrix(a.length+2,b.length+2);
H[0][0] = INF;
for(var i:uint = 0; i<=a.length; ++i) {H[i+1][1] = i; H[i+1][0] = INF;}
for(var j:uint = 0; j<=b.length; ++j) {H[1][j+1] = j; H[0][j+1] = INF;}
var DA:Array = new Array(C);
for(var d:uint = 0; d for(var i:uint = 1; i<=a.length; ++i)
{
var DB:uint = 0;
for(var j:uint = 1; j<=b.length; ++j)
{
var i1:uint = DA[b[j-1]];
var j1:uint = DB;
var d:uint = ((a[i-1]

b[j-1])?0:1);
if(d

0) DB = j;
H[i+1][j+1] = Math.min(H[i][j]+d, H[i+1][j] + 1, H[i][j+1]+1,
H[i1][j1] + (i-i1-1) + 1 + (j-j1-1));
}
DA[a[i-1]] = i;
}
return H[a.length+1][b.length+1];
}

public class matrix extends Array
{
public var rows:uint, cols:uint;
public function matrix(nrows:int=0, ncols:int=-1)
{
super(nrows);
if(ncols -1) ncols = nrows;
for(var i:uint = 0; i {
this[i] = new Array(ncols);
}
rows = nrows; cols = ncols;
}
}

using the C# language computes the true Damerau-Levenshtein distance with adjacent transpositions.
public static Int32 DamerauLevenshteinDistance(String source, String target)
{
if (String.IsNullOrEmpty(source))
{
if (String.IsNullOrEmpty(target))
{
return 0;
}
else
{
return target.Length;
}
}
else if (String.IsNullOrEmpty(target))
{
return source.Length;
}

Int32 m = source.Length;
Int32 n = target.Length;
Int32[,] H = new Int32[m + 2, n + 2];

Int32 INF = m + n;
H[0, 0] = INF;
for (Int32 i = 0; i <= m; i++) { H[i + 1, 1] = i; H[i + 1, 0] = INF; }
for (Int32 j = 0; j <= n; j++) { H[1, j + 1] = j; H[0, j + 1] = INF; }

SortedDictionary sd = new SortedDictionary;
foreach (Char Letter in (source + target))
{
if (!sd.ContainsKey(Letter))
sd.Add(Letter, 0);
}

for (Int32 i = 1; i <= m; i++)
{
Int32 DB = 0;
for (Int32 j = 1; j <= n; j++)
{
Int32 i1 = sd[target[j - 1]];
Int32 j1 = DB;

if (source[i - 1] target[j - 1])
{
H[i + 1, j + 1] = H[i, j];
DB = j;
}
else
{
H[i + 1, j + 1] = Math.Min(H[i, j], Math.Min(H[i + 1, j], H[i, j + 1])) + 1;
}

H[i + 1, j + 1] = Math.Min(H[i + 1, j + 1], H[i1, j1] + (i - i1 - 1) + 1 + (j - j1 - 1));
}

sd[source[i - 1]] = i;
}

return H[m + 1, n + 1];
}

(Note: the algorithm given in the paper uses alphabet 1..C rather than 0..C−1, and indexes arrays differently: H[−1..|A|,−1..|B|] rather than H[0..|A|+1,0..|B|+1], DA[1..C] rather than DA[0..C−1]; the paper seems to be missing the necessary line H[−1,−1] = INF)

To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity:

, where M and N are string lengths. Using the ideas of Lowrance and Wagner, this naive algorithm can be improved to be

in the worst case.

It is interesting that the bitap algorithm

Bitap algorithm

The bitap algorithm is an approximate string matching algorithm...

can be modified to process transposition. See the information retrieval section of for an example of such an adaptation.

Applications

The first algorithm computes only the restricted edit distance. Damerau–Levenshtein distance plays an important role in natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

. In natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. That is why this limitation is not very important. However, one must remember that restricted edit distance does not always satisfy the triangle inequality

Triangle inequality

In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side ....

and, thus, cannot be used with metric trees.

DNA

Since DNA

DNA

frequently undergoes insertions, deletions, substitutions, and transpositions, and each of these operations occurs on approximately the same timescale, the Damerau–Levenshtein distance is an appropriate metric of the variation between two strands of DNA. More common in DNA, protein, and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman-Wunsch or Smith-Waterman.

Fraud detection

The algorithm can be used with any set of words, like vendor names. Since entry is manual by nature there is a risk of entering false vendor. A fraudster employee may enter one real vendor such as "Rich Heir Estate Services" versus a false vendor "Rich Hier State Services". The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Algorithm

b[j-1])?0:1); if(d

Applications

DNA

Fraud detection

External links

b[j-1])?0:1);
if(d