Hamming distance
Encyclopedia
In information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
, the Hamming distance between two string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
s of equal length is the number of positions at which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.
Examples
The Hamming distance between:- "toned" and "roses" is 3.
- 1011101 and 1001001 is 2.
- 2173896 and 2233796 is 3.
Special properties
For a fixed length n, the Hamming distance is a metricMetric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
on the vector space of the words of that length, as it obviously fulfills the conditions of non-negativity, identity of indiscernibles and symmetry, and it can be shown easily by complete induction that it satisfies the triangle inequality
Triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side ....
as well. The Hamming distance between two words a and b can also be seen as the Hamming weight
Hamming weight
The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string...
of a−b for an appropriate choice of the − operator.
For binary strings a and b the Hamming distance is equal to the number of ones (population count
Hamming weight
The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string...
) in a XOR b. The metric space of length-n binary strings, with the Hamming distance, is known as the Hamming cube; it is equivalent as a metric space to the set of distances between vertices in a hypercube graph. One can also view a binary string of length n as a vector in by treating each symbol in the string as a real coordinate; with this embedding, the strings form the vertices of an n-dimensional hypercube
Hypercube
In geometry, a hypercube is an n-dimensional analogue of a square and a cube . It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, perpendicular to each other and of the same length.An...
, and the Hamming distance of the strings is equivalent to the Manhattan distance between the vertices.
History and applications
The Hamming distance is named after Richard HammingRichard Hamming
Richard Wesley Hamming was an American mathematician whose work had many implications for computer science and telecommunications...
, who introduced it in his fundamental paper on Hamming code
Hamming code
In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...
s Error detecting and error correcting codes in 1950. It is used in telecommunication
Telecommunication
Telecommunication is the transmission of information over significant distances to communicate. In earlier times, telecommunications involved the use of visual signals, such as beacons, smoke signals, semaphore telegraphs, signal flags, and optical heliographs, or audio messages via coded...
to count the number of flipped bits in a fixed-length binary word as an estimate of error, and therefore is sometimes called the signal distance. Hamming weight analysis of bits is used in several disciplines including information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
, coding theory
Coding theory
Coding theory is the study of the properties of codes and their fitness for a specific application. Codes are used for data compression, cryptography, error-correction and more recently also for network coding...
, and cryptography
Cryptography
Cryptography is the practice and study of techniques for secure communication in the presence of third parties...
. However, for comparing strings of different lengths, or strings where not just substitutions but also insertions or deletions have to be expected, a more sophisticated metric like the Levenshtein distance
Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
is more appropriate.
For q-ary strings over an alphabet
Alphabet
An alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...
of size q ≥ 2 the Hamming distance is applied in case of orthogonal modulation
Modulation
In electronics and telecommunications, modulation is the process of varying one or more properties of a high-frequency periodic waveform, called the carrier signal, with a modulating signal which typically contains information to be transmitted...
, while the Lee distance is used for phase modulation. If q = 2 or q = 3 both distances coincide.
The Hamming distance is also used in systematics
Systematics
Biological systematics is the study of the diversification of terrestrial life, both past and present, and the relationships among living things through time. Relationships are visualized as evolutionary trees...
as a measure of genetic distance.
On a grid (such as a chessboard), the points at a Lee distance of 1 constitute the von Neumann neighborhood
Von Neumann neighborhood
In cellular automata, the von Neumann neighborhood comprises the four cells orthogonally surrounding a central cell on a two-dimensional square lattice. The neighborhood is named after John von Neumann, who used it for his pioneering cellular automata including the Universal Constructor...
of that point.
Algorithm example
The PythonPython (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
function
hamming_distance
computes the Hamming distance betweentwo strings (or other iterable
Iterator
In computer programming, an iterator is an object that enables a programmer to traverse a container. Various types of iterators are often provided via a container's interface...
objects) of equal length, by creating a sequence of zero and one values indicating mismatches and matches between corresponding positions in the two inputs, and then summing the sequence.
The following C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
function will compute the Hamming distance of two integers (considered as binary values, that is, as sequences of bits). The running time of this procedure is proportional to the Hamming distance rather than to the number of bits in the inputs. It computes the bitwise
Bitwise operation
A bitwise operation operates on one or more bit patterns or binary numerals at the level of their individual bits. This is used directly at the digital hardware level as well as in microcode, machine code and certain kinds of high level languages...
exclusive or of the two inputs, and then finds the Hamming weight
Hamming weight
The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string...
of the result (the number of nonzero bits) using an algorithm of that repeatedly finds and clears the lowest-order nonzero bit.
See also
- Damerau–Levenshtein distance
- Euclidean distanceEuclidean distanceIn mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
- Jaccard indexJaccard indexThe Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....
- Similarity (mathematics)
- Sørensen similarity indexSørensen similarity indexThe Sørensen index, also known as Sørensen’s similarity coefficient, is a statistic used for comparing the similarity of two samples. It was developed by the botanist Thorvald Sørensen and published in 1948....
- Word golf
External links
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....
Sørensen similarity index
The Sørensen index, also known as Sørensen’s similarity coefficient, is a statistic used for comparing the similarity of two samples. It was developed by the botanist Thorvald Sørensen and published in 1948....
- Hamming Code Tool Tool to generate hamming code
- set_matcher Tool to match two families of sets from the same base population using Hamming distance.