Hash function
Encyclopedia
A hash function is any algorithm
or subroutine
that maps large data sets
to smaller data sets, called keys. For example, a single integer
can serve as an index to an array
(cf.
associative array). The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.
Hash functions are mostly used to accelerate table lookup or data comparison tasks such as finding items in a database
, detecting duplicated or similar records in a large file, finding similar stretches in DNA
sequences, and so on.
A hash function should be referentially transparent, i.e. if called twice on input that is "equal" (e.g. strings that consist of the same sequence of characters), it should give the same result. This is a contract in many programming languages that allow the user to override equality and hash functions for an object, that if two objects are equal their hash codes must be the same. This is important in order for it to be possible to find an element in a hash table quickly since two of the same element would both hash to the same slot.
Some hash functions may map two or more keys to the same hash value, causing a collision. Such hash functions try to map the keys to the hash values as evenly as possible because, as Hash Tables fill up, collisions become more frequent. Thus single digit hash values are frequently restricted to 80% of the size of the Table. Depending on the algorithm used, other properties may be required as well, such as double Hashing and Linear Probing. Although the idea was conceived in the 1950s, the design of good hash functions is still a topic of active research.
Hash functions are related to (and often confused with) checksum
s, check digit
s, fingerprint
s, randomization function
s, error correcting codes, and cryptographic hash function
s. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently. The HashKeeper
database maintained by the American National Drug Intelligence Center
, for instance, is more aptly described as a catalog of file fingerprints than of hash values.
s, to quickly locate a data record (for example, a dictionary
definition) given its search key (the headword). Specifically, the hash function is used to map the search key to the hash. The index gives the place where the corresponding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic sets.
In general, a hashing function may map several different keys to the same index. Therefore, each slot of a hash table is associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices.
Thus, the hash function only hints at the record's location—it tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.
, a compact data structure that provides an enclosing approximation to a set of them.
This class includes the so-called acoustic fingerprint
algorithms, that are used to locate similar-sounding entries in large collection of audio files. For this application, the hash function must be as insensitive as possible to data capture or transmission errors, and to "trivial" changes such as timing and volume changes, compression, etc.
. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above.
The Rabin-Karp algorithm
is a relatively fast string searching algorithm
that works in O(n)
time on average. It is based on the use of hashing to compare strings.
, computational geometry
and many other disciplines, to solve many proximity problems in the plane or in three-dimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar image
s in an image database
, and so on. In these applications, the set of all inputs is some sort of metric space
, and the hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array with two or more indices (called a grid file
, grid index, bucket grid, and similar names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing
or the grid method. Geometric hashing is also used in telecommunication
s (usually under the name vector quantization
) to encode and compress
multi-dimensional signals.
s, checksums, etc.).
solution will be more efficient than a self-balancing binary tree if the number of items is large and the hash function produces few collisions and less efficient if the number of items is small and the hash function is complex.
—meaning that for a given input value it must always generate the same hash value. In other words, it must be a function
of the hashed data, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed, because that address may change during execution (as may happen on systems that use certain methods of garbage collection
), although sometimes rehashing of the item is possible).
. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions—pairs of inputs that are mapped to the same hash value—increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function
", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m -- see the birthday paradox
).
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-squared test.
A common solution is to compute a fixed hash function with a very large range (say, 0 to 232−1), divide the result by n, and use the division's remainder
. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n−1, for any n that may occur in the application. Depending on the function, the remainder may be uniform only for certain n, e.g. odd or prime number
s.
It is possible to relax the restriction of the table size being a power of 2 and not having to perform any modulo, remainder or division operation -as these operation are considered computational costly in some contexts. For example, when n is significantly less than 2b begin with a pseudo random number generator (PRNG) function P(key), uniform on the interval [0, 2b−1]. Consider the ratio q = 2b / n. Now the hash function can be seen as the value of P(key) / q. Rearranging the calculation and replacing the 2b-division by bit shifting right (>>) b times you end up with hash function n * P(key) >> b.
A hash function that will relocate the minimum number of records when the table is resized is desirable.
What is needed is a hash function H(z,n) – where z is the key being hashed and n is the number of allowed hash values – such that H(z,n+1) = H(z,n) with probability close to n/(n+1).
Linear hashing and spiral storage are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property.
Extendible hashing
uses a dynamic hash function that requires space proportional to n to compute the hash function, and it becomes a function of the previous keys that have been inserted.
Several algorithms that preserve the uniformity property but require time proportional to n to compute the value of H(z,n) have been invented.
criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper-casing all letters.
as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.
Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash function
s, and other related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that use linear search
.
in the intended application.
) hash function is effectively zero. This hash function is perfect
, as it maps each input to a distinct hash value.
The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in Java
, the hash code is a 32-bit integer. Thus the 32-bit integer
Other types of data can also use this perfect hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in ASCII
or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode
characters, the table would have 17×216 = 1114112 entries.
The same technique can be used to map two-letter country codes
like "us" or "za" to country names (262=676 table entries), 5-digit zip codes like 13083 to city names (100000 entries), etc. Invalid data values (such as the country code "xx" or the zip code 00000) may be left undefined in the table, or mapped to some appropriate "null" value.
—that is, maps each valid input to a different hash value—is said to be perfect
. With such a function one can directly locate the desired entry in a hash table, without any additional searching.
(such as telephone
numbers, car license plate
s, invoice
numbers, etc.), and each input may independently
occur with uniform probability, then a hash function need only map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by n/N and truncated to an integer), or many other formulas.
Warning: h = z mod n was used in many of the original random number generators, but was found to have a number of issues. One of which is that as n approaches N, this function becomes less and less uniform.
will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if n is 10000 or so, the division formula (z × n) ÷ N, which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n, which is quite sensitive to the trailing digits, may still yield a fairly even distribution.
has highly non-uniform distributions of character
s, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way.
In cryptographic hash functions, a Merkle–Damgård construction is usually used. In general, the scheme for hashing such data is to break the input into a sequence of small units (bit
s, byte
s, words, etc.) and combine all the units b[1], b[2], ..., b[m] sequentially, as follows
This schema is also used in many text checksum and fingerprint algorithms. The state variable S may be a 32- or 64-bit unsigned integer; in that case, S0 can be 0, and G(S,n) can be just S mod n. The best choice of F is a complex issue and depends on the nature of the data. If the units b[k] are single bits, then F(S,b) could be, for instance
Here highbit(S) denotes the most significant bit of S; the '*' operator denotes unsigned integer multiplication with lost overflow
; '^' is the bitwise exclusive or operation applied to words; and P is a suitable fixed word.
, one must compute a hash function h for every k-character substring
of a given n-character string t; where k is a fixed integer, and n is k. The straightforward solution, which is to extract every such substring s of t and compute h(s) separately, requires a number of operations proportional to k·n. However, with the proper choice of h, one can use the technique of rolling hash to compute all those hashes with an effort proportional to k+n.
scheme is a randomized algorithm
that selects a hashing function h among a family of such functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will however have more collisions than perfect hashing, and may require more operations than a special-purpose hash function.
This method may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. However, some checksums fare poorly in the avalanche test
, which may be a concern in some applications. In particular, the popular CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing. Moreover, each bit of the input has a deterministic effect on each bit of the CRC32, that is one can tell without looking at the rest of the input, which bits of the output will flip if the input bit is flipped; so care must be taken to use all 32 bits when computing the hash from the checksum.
s, such as SHA-1, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions.
In ordinary applications, this advantage may be too small to offset their much higher cost. However, this method can provide uniformly distributed hashes even when the keys are chosen by a malicious agent. This feature may help protect services against denial of service attacks.
operation, "chop" the input domain into many sub-domains that get "mixed" into the output range to improve the uniformity of the key distribution.
Donald Knuth
notes that Hans Peter Luhn
of IBM
appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris
used the term in a survey paper in CACM
which elevated the term from technical jargon to formal terminology.
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
or subroutine
Subroutine
In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
that maps large data sets
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...
to smaller data sets, called keys. For example, a single integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
can serve as an index to an array
Array data type
In computer science, an array type is a data type that is meant to describe a collection of elements , each selected by one or more indices that can be computed at run time by the program. Such a collection is usually called an array variable, array value, or simply array...
(cf.
Cf.
cf., an abbreviation for the Latin word confer , literally meaning "bring together", is used to refer to other material or ideas which may provide similar or different information or arguments. It is mainly used in scholarly contexts, such as in academic or legal texts...
associative array). The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.
Hash functions are mostly used to accelerate table lookup or data comparison tasks such as finding items in a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
, detecting duplicated or similar records in a large file, finding similar stretches in DNA
Nucleic acid
Nucleic acids are biological molecules essential for life, and include DNA and RNA . Together with proteins, nucleic acids make up the most important macromolecules; each is found in abundance in all living things, where they function in encoding, transmitting and expressing genetic information...
sequences, and so on.
A hash function should be referentially transparent, i.e. if called twice on input that is "equal" (e.g. strings that consist of the same sequence of characters), it should give the same result. This is a contract in many programming languages that allow the user to override equality and hash functions for an object, that if two objects are equal their hash codes must be the same. This is important in order for it to be possible to find an element in a hash table quickly since two of the same element would both hash to the same slot.
Some hash functions may map two or more keys to the same hash value, causing a collision. Such hash functions try to map the keys to the hash values as evenly as possible because, as Hash Tables fill up, collisions become more frequent. Thus single digit hash values are frequently restricted to 80% of the size of the Table. Depending on the algorithm used, other properties may be required as well, such as double Hashing and Linear Probing. Although the idea was conceived in the 1950s, the design of good hash functions is still a topic of active research.
Hash functions are related to (and often confused with) checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
s, check digit
Check digit
A check digit is a form of redundancy check used for error detection, the decimal equivalent of a binary checksum. It consists of a single digit computed from the other digits in the message....
s, fingerprint
Fingerprint (computing)
In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes...
s, randomization function
Randomization function
In computer science, a randomization function or randomizing function is an algorithm or procedure that implements a randomly chosen function between two specific sets, suitable for use in a randomized algorithm....
s, error correcting codes, and cryptographic hash function
Cryptographic hash function
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...
s. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently. The HashKeeper
HashKeeper
HashKeeper is a database application of value primarily to those conducting forensic examinations of computers on a somewhat regular basis.- Overview :...
database maintained by the American National Drug Intelligence Center
National Drug Intelligence Center
The U.S. National Drug Intelligence Center , established in 1993, is a component of the U.S. Department of Justice and a member of the Intelligence Community...
, for instance, is more aptly described as a catalog of file fingerprints than of hash values.
Hash tables
Hash functions are primarily used in hash tableHash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
s, to quickly locate a data record (for example, a dictionary
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...
definition) given its search key (the headword). Specifically, the hash function is used to map the search key to the hash. The index gives the place where the corresponding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic sets.
In general, a hashing function may map several different keys to the same index. Therefore, each slot of a hash table is associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices.
Thus, the hash function only hints at the record's location—it tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.
Caches
Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items.Bloom filters
Hash functions are an essential ingredient of the Bloom filterBloom filter
A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not; i.e. a query returns either "inside set " or "definitely not in set"...
, a compact data structure that provides an enclosing approximation to a set of them.
Finding duplicate records
When storing records in a large unsorted file, one may use a hash function to map each record to an index into a table T, and collect in each bucket T[i] a list of the numbers of all records with the same hash value i. Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T[i] which contains two or more members, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs).Finding similar records
Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where k ranges between -m and m.This class includes the so-called acoustic fingerprint
Acoustic fingerprint
An acoustic fingerprint is a condensed digital summary, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in an audio database....
algorithms, that are used to locate similar-sounding entries in large collection of audio files. For this application, the hash function must be as insensitive as possible to data capture or transmission errors, and to "trivial" changes such as timing and volume changes, compression, etc.
Finding similar substrings
The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic databaseBiological database
Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas including genomics, proteomics, metabolomics, microarray...
. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above.
The Rabin-Karp algorithm
Rabin-Karp string search algorithm
The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O in space O,...
is a relatively fast string searching algorithm
String searching algorithm
String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....
that works in O(n)
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
time on average. It is based on the use of hashing to compare strings.
Geometric hashing
This principle is widely used in computer graphicsComputer graphics
Computer graphics are graphics created using computers and, more generally, the representation and manipulation of image data by a computer with help from specialized software and hardware....
, computational geometry
Computational geometry
Computational geometry is a branch of computer science devoted to the study of algorithms which can be stated in terms of geometry. Some purely geometrical problems arise out of the study of computational geometric algorithms, and such problems are also considered to be part of computational...
and many other disciplines, to solve many proximity problems in the plane or in three-dimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar image
Image processing
In electrical engineering and computer science, image processing is any form of signal processing for which the input is an image, such as a photograph or video frame; the output of image processing may be either an image or, a set of characteristics or parameters related to the image...
s in an image database
Image retrieval
An image retrieval system is a computer system for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, or descriptions to the images so that...
, and so on. In these applications, the set of all inputs is some sort of metric space
Metric space
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...
, and the hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array with two or more indices (called a grid file
Grid file
In computer science, a grid file or bucket grid is a point access method which splits a space into a non-periodic grid where one or more cells of the grid refer to a small set of points...
, grid index, bucket grid, and similar names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing
Geometric hashing
In computer science, geometric hashing is originally a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation, though extensions exist to some other object representations and transformations. In an off-line step, the...
or the grid method. Geometric hashing is also used in telecommunication
Telecommunication
Telecommunication is the transmission of information over significant distances to communicate. In earlier times, telecommunications involved the use of visual signals, such as beacons, smoke signals, semaphore telegraphs, signal flags, and optical heliographs, or audio messages via coded...
s (usually under the name vector quantization
Vector quantization
Vector quantization is a classical quantization technique from signal processing which allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points into groups having...
) to encode and compress
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
multi-dimensional signals.
Properties
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below. Note that different requirements apply to the other related concepts (cryptographic hash functionCryptographic hash function
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...
s, checksums, etc.).
Low cost
The cost of computing a hash function must be small enough to make a hashing-based solution more efficient than alternative approaches. For instance, a self-balancing binary tree can locate an item in a sorted table of n items with O(log n) key comparisons. Therefore, a hash tableHash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
solution will be more efficient than a self-balancing binary tree if the number of items is large and the hash function produces few collisions and less efficient if the number of items is small and the hash function is complex.
Determinism
A hash procedure must be deterministicDeterministic algorithm
In computer science, a deterministic algorithm is an algorithm which, in informal terms, behaves predictably. Given a particular input, it will always produce the same output, and the underlying machine will always pass through the same sequence of states...
—meaning that for a given input value it must always generate the same hash value. In other words, it must be a function
Function (mathematics)
In mathematics, a function associates one quantity, the argument of the function, also known as the input, with another quantity, the value of the function, also known as the output. A function assigns exactly one output to each input. The argument and the value may be real numbers, but they can...
of the hashed data, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed, because that address may change during execution (as may happen on systems that use certain methods of garbage collection
Garbage collection (computer science)
In computer science, garbage collection is a form of automatic memory management. The garbage collector, or just collector, attempts to reclaim garbage, or memory occupied by objects that are no longer in use by the program...
), although sometimes rehashing of the item is possible).
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probabilityProbability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...
. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions—pairs of inputs that are mapped to the same hash value—increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function
Perfect hash function
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented.- Properties...
", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m -- see the birthday paradox
Birthday paradox
In probability theory, the birthday problem or birthday paradox pertains to the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 366. However, 99%...
).
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-squared test.
Variable range
In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters—the input data z, and the number n of allowed hash values.A common solution is to compute a fixed hash function with a very large range (say, 0 to 232−1), divide the result by n, and use the division's remainder
Modulo operation
In computing, the modulo operation finds the remainder of division of one number by another.Given two positive numbers, and , a modulo n can be thought of as the remainder, on division of a by n...
. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n−1, for any n that may occur in the application. Depending on the function, the remainder may be uniform only for certain n, e.g. odd or prime number
Prime number
A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. A natural number greater than 1 that is not a prime number is called a composite number. For example 5 is prime, as only 1 and 5 divide it, whereas 6 is composite, since it has the divisors 2...
s.
It is possible to relax the restriction of the table size being a power of 2 and not having to perform any modulo, remainder or division operation -as these operation are considered computational costly in some contexts. For example, when n is significantly less than 2b begin with a pseudo random number generator (PRNG) function P(key), uniform on the interval [0, 2b−1]. Consider the ratio q = 2b / n. Now the hash function can be seen as the value of P(key) / q. Rearranging the calculation and replacing the 2b-division by bit shifting right (>>) b times you end up with hash function n * P(key) >> b.
Variable range with minimal movement (dynamic hash function)
When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table.A hash function that will relocate the minimum number of records when the table is resized is desirable.
What is needed is a hash function H(z,n) – where z is the key being hashed and n is the number of allowed hash values – such that H(z,n+1) = H(z,n) with probability close to n/(n+1).
Linear hashing and spiral storage are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property.
Extendible hashing
Extendible hashing
Extendible hashing is a type of hash system which treats a hash as a bit string, and uses a trie for bucket lookup. Because of the hierarchical nature of the system, re-hashing is an incremental operation...
uses a dynamic hash function that requires space proportional to n to compute the hash function, and it becomes a function of the previous keys that have been inserted.
Several algorithms that preserve the uniformity property but require time proportional to n to compute the value of H(z,n) have been invented.
Data normalization
In some applications, the input data may contain features that are irrelevant for comparison purposes. For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalenceEquivalence relation
In mathematics, an equivalence relation is a relation that, loosely speaking, partitions a set so that every element of the set is a member of one and only one cell of the partition. Two elements of the set are considered equivalent if and only if they are elements of the same cell...
criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper-casing all letters.
Continuity
A hash function that is used to search for similar (as opposed to equivalent) data must be as continuousContinuous function
In mathematics, a continuous function is a function for which, intuitively, "small" changes in the input result in "small" changes in the output. Otherwise, a function is said to be "discontinuous". A continuous function with a continuous inverse function is called "bicontinuous".Continuity of...
as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.
Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash function
Cryptographic hash function
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...
s, and other related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that use linear search
Linear search
In computer science, linear search or sequential search is a method for finding a particular value in a list, that consists of checking every one of its elements, one at a time and in sequence, until the desired one is found....
.
Hash function algorithms
For most types of hashing functions the choice of the function depends strongly on the nature of the input data, and their probability distributionProbability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
in the intended application.
Trivial hash function
If the datum to be hashed is small enough, one can use the datum itself (reinterpreted as an integer in binary notation) as the hashed value. The cost of computing this "trivial" (identityIdentity function
In mathematics, an identity function, also called identity map or identity transformation, is a function that always returns the same value that was used as its argument...
) hash function is effectively zero. This hash function is perfect
Perfect hash function
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented.- Properties...
, as it maps each input to a distinct hash value.
The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, the hash code is a 32-bit integer. Thus the 32-bit integer
Integer
and 32-bit floating-point Float
objects can simply use the value directly; whereas the 64-bit integer Long
and 64-bit floating-point Double
cannot use this method.Other types of data can also use this perfect hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
characters, the table would have 17×216 = 1114112 entries.
The same technique can be used to map two-letter country codes
ISO 3166-1 alpha-2
ISO 3166-1 alpha-2 codes are two-letter country codes defined in ISO 3166-1, part of the ISO 3166 standard published by the International Organization for Standardization , to represent countries, dependent territories, and special areas of geographical interest...
like "us" or "za" to country names (262=676 table entries), 5-digit zip codes like 13083 to city names (100000 entries), etc. Invalid data values (such as the country code "xx" or the zip code 00000) may be left undefined in the table, or mapped to some appropriate "null" value.
Perfect hashing
A hash function that is injectiveInjective function
In mathematics, an injective function is a function that preserves distinctness: it never maps distinct elements of its domain to the same element of its codomain. In other words, every element of the function's codomain is mapped to by at most one element of its domain...
—that is, maps each valid input to a different hash value—is said to be perfect
Perfect hash function
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented.- Properties...
. With such a function one can directly locate the desired entry in a hash table, without any additional searching.
Minimal perfect hashing
A perfect hash function for n keys is said to be minimal if its range consists of n consecutive integers, usually from 0 to n−1. Besides providing single-step lookup, a minimal perfect hash function also yields a compact hash table, without any vacant slots. Minimal perfect hash functions are much harder to find than perfect ones with a wider range.Hashing uniformly distributed data
If the inputs are bounded-length stringsString (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
(such as telephone
Telephone
The telephone , colloquially referred to as a phone, is a telecommunications device that transmits and receives sounds, usually the human voice. Telephones are a point-to-point communication system whose most basic function is to allow two people separated by large distances to talk to each other...
numbers, car license plate
Vehicle registration plate
A vehicle registration plate is a metal or plastic plate attached to a motor vehicle or trailer for official identification purposes. The registration identifier is a numeric or alphanumeric code that uniquely identifies the vehicle within the issuing region's database...
s, invoice
Invoice
An invoice or bill is a commercial document issued by a seller to the buyer, indicating the products, quantities, and agreed prices for products or services the seller has provided the buyer. An invoice indicates the buyer must pay the seller, according to the payment terms...
numbers, etc.), and each input may independently
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
occur with uniform probability, then a hash function need only map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by n/N and truncated to an integer), or many other formulas.
Warning: h = z mod n was used in many of the original random number generators, but was found to have a number of issues. One of which is that as n approaches N, this function becomes less and less uniform.
Hashing data with other distributions
These simple formulas will not do if the input values are not equally likely, or are not independent. For instance, most patrons of a supermarketSupermarket
A supermarket, a form of grocery store, is a self-service store offering a wide variety of food and household merchandise, organized into departments...
will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if n is 10000 or so, the division formula (z × n) ÷ N, which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n, which is quite sensitive to the trailing digits, may still yield a fairly even distribution.
Hashing variable-length data
When the data values are long (or variable-length) character strings—such as personal names, web page addresses, or mail messages—their distribution is usually very uneven, with complicated dependencies. For example, text in any natural languageNatural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
has highly non-uniform distributions of character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
s, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way.
In cryptographic hash functions, a Merkle–Damgård construction is usually used. In general, the scheme for hashing such data is to break the input into a sequence of small units (bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
s, byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
s, words, etc.) and combine all the units b[1], b[2], ..., b[m] sequentially, as follows
S ← S0; // Initialize the state.
for k in 1, 2, ..., m do // Scan the input data units:
S ← F(S, b[k]); // Combine data unit k into the state.
return G(S, n) // Extract the hash value from the state.
This schema is also used in many text checksum and fingerprint algorithms. The state variable S may be a 32- or 64-bit unsigned integer; in that case, S0 can be 0, and G(S,n) can be just S mod n. The best choice of F is a complex issue and depends on the nature of the data. If the units b[k] are single bits, then F(S,b) could be, for instance
if highbit(S) = 0 then
return 2 * S + b
else
return (2 * S + b) ^ P
Here highbit(S) denotes the most significant bit of S; the '*' operator denotes unsigned integer multiplication with lost overflow
Overflow (software)
OVERFLOW - the OVERset grid FLOW solver - is a software package for simulating fluid flow around solid bodies using computational fluid dynamics...
; '^' is the bitwise exclusive or operation applied to words; and P is a suitable fixed word.
Special-purpose hash functions
In many cases, one can design a special-purpose (heuristic) hash function that yields many fewer collisions than a good general-purpose hash function. For example, suppose that the input data are file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution.Rolling hash
In some applications, such as substring searchString searching algorithm
String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....
, one must compute a hash function h for every k-character substring
Substring
A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved...
of a given n-character string t; where k is a fixed integer, and n is k. The straightforward solution, which is to extract every such substring s of t and compute h(s) separately, requires a number of operations proportional to k·n. However, with the proper choice of h, one can use the technique of rolling hash to compute all those hashes with an effort proportional to k+n.
Universal hashing
A universal hashingUniversal hashing
Using universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property . This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary...
scheme is a randomized algorithm
Randomized algorithm
A randomized algorithm is an algorithm which employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits...
that selects a hashing function h among a family of such functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will however have more collisions than perfect hashing, and may require more operations than a special-purpose hash function.
Hashing with checksum functions
One can adapt certain checksum or fingerprinting algorithms for use as hash functions. Some of those algorithms will map arbitrary long string data z, with any typical real-world distribution—no matter how non-uniform and dependent—to a 32-bit or 64-bit string, from which one can extract a hash value in 0 through n−1.This method may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. However, some checksums fare poorly in the avalanche test
Avalanche effect
In cryptography, the avalanche effect refers to a desirable property of cryptographic algorithms, typically block ciphers and cryptographic hash functions. The avalanche effect is evident if, when an input is changed slightly the output changes significantly...
, which may be a concern in some applications. In particular, the popular CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing. Moreover, each bit of the input has a deterministic effect on each bit of the CRC32, that is one can tell without looking at the rest of the input, which bits of the output will flip if the input bit is flipped; so care must be taken to use all 32 bits when computing the hash from the checksum.
Hashing with cryptographic hash functions
Some cryptographic hash functionCryptographic hash function
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...
s, such as SHA-1, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions.
In ordinary applications, this advantage may be too small to offset their much higher cost. However, this method can provide uniformly distributed hashes even when the keys are chosen by a malicious agent. This feature may help protect services against denial of service attacks.
Origins of the term
The term "hash" comes by way of analogy with its non-technical meaning, to "chop and mix". Indeed, typical hash functions, like the modModular arithmetic
In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" after they reach a certain value—the modulus....
operation, "chop" the input domain into many sub-domains that get "mixed" into the output range to improve the uniformity of the key distribution.
Donald Knuth
Donald Knuth
Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...
notes that Hans Peter Luhn
Hans Peter Luhn
Hans Peter Luhn was a computer scientist for IBM, and creator of the Luhn algorithm and KWIC indexing. He was awarded over 80 patents....
of IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris
Robert Morris (cryptographer)
Robert Morris , was an American cryptographer and computer scientist. -Family and Education:Morris was born in Boston, Massachusetts. His parents were Walter W. Morris, a salesman, and Helen Kelly Morris...
used the term in a survey paper in CACM
Communications of the ACM
Communications of the ACM is the flagship monthly journal of the Association for Computing Machinery . First published in 1957, CACM is sent to all ACM members, currently numbering about 80,000. The articles are intended for readers with backgrounds in all areas of computer science and information...
which elevated the term from technical jargon to formal terminology.
List of hash functions
- Bernstein hash
- Fowler-Noll-Vo hash function (32, 64, 128, 256, 512, or 1024 bits)
- Jenkins hash functionJenkins hash functionThe Jenkins hash functions are a collection of hash functions for multi-byte keys designed by Bob Jenkins. They can be used also as checksums to detect accidental data corruption or detect identical records in a database...
(32 bits) - Pearson hashingPearson hashingPearson hashing is a hash function designed for fast execution on processors with 8-bit registers. Given an input consisting of any number of bytes, it produces as output a single byte that is strongly dependent on every byte of the input...
(8 bits) - Zobrist hashingZobrist hashingZobrist hashing is a hash function construction used in computer programs that play abstract board games, such as chess and Go, to implement transposition tables, a special kind of hash table that is indexed by a board position and used to avoid analyzing the same position more than once...
See also
- Bloom filterBloom filterA Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not; i.e. a query returns either "inside set " or "definitely not in set"...
- Coalesced hashingCoalesced hashingCoalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a hash table that forms a hybrid of separate chaining and open addressing. In a separate chaining hash table, items that hash to the same address are placed on a list at that address...
- Cuckoo hashingCuckoo hashingCuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table. Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001...
- Cryptographic hash functionCryptographic hash functionA cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...
- Distributed hash tableDistributed hash tableA distributed hash table is a class of a decentralized distributed system that provides a lookup service similar to a hash table; pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key...
- Geometric hashingGeometric hashingIn computer science, geometric hashing is originally a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation, though extensions exist to some other object representations and transformations. In an off-line step, the...
- Hash tableHash tableIn computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
- HMACHMACIn cryptography, HMAC is a specific construction for calculating a message authentication code involving a cryptographic hash function in combination with a secret key. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message...
- IdenticonIdenticonAn Identicon is a visual representation of a hash value, usually of the IP address, serving to identify a user of a computer system; compare avatars. The original Identicon is a 9-block graphic, which has been extended to other graphic forms by third parties, some of whom have used MD5 instead of...
- Linear hashLinear hashLinear hashing is a dynamic hash table algorithm invented by Witold Litwin , and later popularized by Paul Larson. Linear hashing allows for the expansion of the hash table one slot at a time....
- List of hash functions
- Locality sensitive hashingLocality sensitive hashingLocality-sensitive hashing is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability .-Definition:An LSH family \mathcal F is defined fora...
- Perfect hash functionPerfect hash functionA perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented.- Properties...
- Rabin-Karp string search algorithmRabin-Karp string search algorithmThe Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O in space O,...
- Rolling hashRolling hashA rolling hash is a hash function where the input is hashed in a window that moves through the input.A few hash functions allow a rolling hash to be computed very quickly -- the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new...
- Transposition tableTransposition tableIn computer chess and other computer games, transposition tables are used to speed up the search of the game tree. Transposition tables are primarily useful in perfect information games, meaning the entire state of the game is known to all players at all times....
- Universal hashingUniversal hashingUsing universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property . This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary...
External links
- General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby)
- Hash Functions and Block Ciphers by Bob Jenkins
- The Goulburn Hashing Function (PDFPortable Document FormatPortable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
) by Mayur Patel - MIT's Introduction to Algorithms: Hashing 1 MIT OCW lecture Video
- MIT's Introduction to Algorithms: Hashing 2 MIT OCW lecture Video
- Hash Fuction Construction for Textual and Geometrical Data Retrieval Latest Trends on Computers, Vol.2, pp.483-489, CSCC conference, Corfu, 2010