Jaccard index
Encyclopedia
The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard
), is a statistic
used for comparing the similarity
and diversity of sample
sets.
The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection
divided by the size of the union
of the sample sets:
The MinHash
min-wise independent permutations locality sensitive hashing
scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function
.
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
This distance is a proper metric.
attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: represents the total number of attributes where A and B both have a value of 1. represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1. represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0. represents the total number of attributes where A and B both have a value of 0.
Each attribute must fall into one of these four categories, meaning that
The Jaccard similarity coefficient, J, is given as
The Jaccard distance, J, is given as
In "A Computer Program for Classifying Plants", published in October 1960, a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto Similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard Distance.
Presented in mathematical terms, if samples X and Y are bitmaps, is the ith bit of X, and are bitwise and, or operators respectively, then the similarity ratio is
If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard Coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.
Tanimoto goes on to define a distance coefficient based on this ratio, defined over values with non-zero similarity:
This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different to each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.
If Jaccard or Tanimoto Similarity is expressed over a bit vector, then it can be written as
where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then and .
This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of do not necessarily extend to . In particular, the difference function does not preserve triangle inequality, and is not therefore a proper distance metric, whereas is.
There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.
Lipkus uses a definition of Tanimoto similarity which is equivalent to , and refers to Tanimoto distance as the function . It is however made clear within the paper that the context is restricted by the use of a (positive) weighting vector such that, for any vector A being considered, . Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.
Paul Jaccard
Paul Jaccard was a professor of botany and plant physiology at the ETH Zurich. He studied at the University of Lausanne and ETH Zurich...
), is a statistic
Statistic
A statistic is a single measure of some attribute of a sample . It is calculated by applying a function to the values of the items comprising the sample which are known together as a set of data.More formally, statistical theory defines a statistic as a function of a sample where the function...
used for comparing the similarity
Similarity
-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...
and diversity of sample
Sample (statistics)
In statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size...
sets.
The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection
Intersection (set theory)
In mathematics, the intersection of two sets A and B is the set that contains all elements of A that also belong to B , but no other elements....
divided by the size of the union
Union (set theory)
In set theory, the union of a collection of sets is the set of all distinct elements in the collection. The union of a collection of sets S_1, S_2, S_3, \dots , S_n\,\! gives a set S_1 \cup S_2 \cup S_3 \cup \dots \cup S_n.- Definition :...
of the sample sets:
The MinHash
MinHash
In computer science, MinHash is a technique for quickly estimating how similar two sets are...
min-wise independent permutations locality sensitive hashing
Locality sensitive hashing
Locality-sensitive hashing is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability .-Definition:An LSH family \mathcal F is defined fora...
scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
.
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
This distance is a proper metric.
Similarity of asymmetric binary attributes
Given two objects, A and B, each with n binaryBinary numeral system
The binary numeral system, or base-2 number system, represents numeric values using two symbols, 0 and 1. More specifically, the usual base-2 system is a positional notation with a radix of 2...
attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: represents the total number of attributes where A and B both have a value of 1. represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1. represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0. represents the total number of attributes where A and B both have a value of 0.
Each attribute must fall into one of these four categories, meaning that
The Jaccard similarity coefficient, J, is given as
The Jaccard distance, J, is given as
Tanimoto Similarity and Distance
Various forms of functions described as Tanimoto Similarity and Tanimoto Distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard Similarity and Jaccard Distance, but some are mathematically different. Many sources cite an unavailable IBM Technical Report as the seminal reference.In "A Computer Program for Classifying Plants", published in October 1960, a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto Similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard Distance.
Tanimoto's Definitions of Similarity and Distance
In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set in either sample.Presented in mathematical terms, if samples X and Y are bitmaps, is the ith bit of X, and are bitwise and, or operators respectively, then the similarity ratio is
If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard Coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.
Tanimoto goes on to define a distance coefficient based on this ratio, defined over values with non-zero similarity:
This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different to each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.
Other Definitions of Tanimoto Distance
Tanimoto Distance is often referred to, erroneously, as a synonym for Jaccard Distance (). This function is a proper distance metric. "Tanimoto Distance" is often stated as being a proper distance metric, probably because of its confusion with Jaccard Distance.If Jaccard or Tanimoto Similarity is expressed over a bit vector, then it can be written as
where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then and .
This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of do not necessarily extend to . In particular, the difference function does not preserve triangle inequality, and is not therefore a proper distance metric, whereas is.
There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.
Lipkus uses a definition of Tanimoto similarity which is equivalent to , and refers to Tanimoto distance as the function . It is however made clear within the paper that the context is restricted by the use of a (positive) weighting vector such that, for any vector A being considered, . Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.
See also
- Sørensen's quotient of similaritySørensen similarity indexThe Sørensen index, also known as Sørensen’s similarity coefficient, is a statistic used for comparing the similarity of two samples. It was developed by the botanist Thorvald Sørensen and published in 1948....
- Mountford's index of similarity
- Hamming distanceHamming distanceIn information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...
- Dice's coefficient, which is equivalent: and
- Tversky indexTversky indexThe Tversky index, named after Amos Tversky, is an asymmetric similarity measure that compares a variant to a prototype. The Tversky index can be seen as a generalization of Dice's coefficient and Tanimoto coefficient....
- CorrelationCorrelationIn statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
- Mutual informationMutual informationIn probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...
, a normalized metricated variant of which is an entropic Jaccard distance.
External links
- Jaccard's index and species diversity
- Introduction to Data Mining lecture notes from Tan, Steinbach, Kumar
- http://sourceforge.net/projects/simmetrics/ SimMetrics a sourceforge implementation of Jaccard index and many other similarity metrics
- Web based tool for comparing texts using Jaccard coefficient