Mahalanobis distance
Encyclopedia
In statistics
, Mahalanobis distance is a distance
measure introduced by P. C. Mahalanobis in 1936. It is based on correlation
s between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean distance
in that it takes into account the correlations of the data set
and is scale-invariant
. In other words, it is a multivariate
effect size
.
is defined as:
Mahalanobis distance (or "generalized squared interpoint distance" for its squared value) can also be defined as a dissimilarity measure between two random vectors and of the same distribution
with the covariance matrix
:
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the
Euclidean distance
. If the covariance matrix is diagonal
, then the resulting distance measure is called the normalized Euclidean distance:
where is the standard deviation
of the and over the sample set.
that a test point in N-dimensional Euclidean space
belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the average or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.
However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not.
The simplistic approach is to estimate the standard deviation
of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.
This intuitive approach can be made quantitative by
defining the normalized distance between the test point and the set to be .
By plugging this into the normal distribution we can derive the probability of the test point belonging to the set.
The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass
in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.
Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.
Mahalanobis distance is widely used in cluster analysis
and classification techniques. It is closely related to Hotelling's T-square distribution
used for multivariate statistical testing and Fisher's Linear Discriminant Analysis
that is used for supervised classification.
In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.
Mahalanobis distance and leverage are often used to detect outlier
s, especially in the development of linear regression
models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers. Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores. A point can be a multivariate outlier even if it is not a univariate outlier on any variable.
Mahalanobis distance was also widely used in biology, such as predicting protein structural class,
predicting membrane protein type,
predicting protein subcellular localization,
as well as predicting many other attributes of proteins through their pseudo amino acid composition
.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, Mahalanobis distance is a distance
Distance
Distance is a numerical description of how far apart objects are. In physics or everyday discussion, distance may refer to a physical length, or an estimation based on other criteria . In mathematics, a distance function or metric is a generalization of the concept of physical distance...
measure introduced by P. C. Mahalanobis in 1936. It is based on correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
s between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean distance
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
in that it takes into account the correlations of the data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...
and is scale-invariant
Scale invariance
In physics and mathematics, scale invariance is a feature of objects or laws that do not change if scales of length, energy, or other variables, are multiplied by a common factor...
. In other words, it is a multivariate
Multivariate statistics
Multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one statistical variable. The application of multivariate statistics is multivariate analysis...
effect size
Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity...
.
Definition
Formally, the Mahalanobis distance of a multivariate vector from a group of values with mean and covariance matrixCovariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector...
is defined as:
Mahalanobis distance (or "generalized squared interpoint distance" for its squared value) can also be defined as a dissimilarity measure between two random vectors and of the same distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
with the covariance matrix
Covariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector...
:
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the
Euclidean distance
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
. If the covariance matrix is diagonal
Diagonal matrix
In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. The diagonal entries themselves may or may not be zero...
, then the resulting distance measure is called the normalized Euclidean distance:
where is the standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...
of the and over the sample set.
Intuitive explanation
Consider the problem of estimating the probabilitythat a test point in N-dimensional Euclidean space
Euclidean space
In mathematics, Euclidean space is the Euclidean plane and three-dimensional space of Euclidean geometry, as well as the generalizations of these notions to higher dimensions...
belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the average or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.
However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not.
The simplistic approach is to estimate the standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...
of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.
This intuitive approach can be made quantitative by
defining the normalized distance between the test point and the set to be .
By plugging this into the normal distribution we can derive the probability of the test point belonging to the set.
The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass
in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.
Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.
Relationship to leverage
Mahalanobis distance is closely related to the leverage statistic, h, but has a different scale:- Squared Mahalanobis distance = (N − 1)(h − 1/N).
Applications
Mahalanobis' discovery was prompted by the problem of identifying the similarities of skulls based on measurements in 1927.Mahalanobis distance is widely used in cluster analysis
Data clustering
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....
and classification techniques. It is closely related to Hotelling's T-square distribution
Hotelling's T-square distribution
In statistics Hotelling's T-squared distribution is important because it arises as the distribution of a set of statistics which are natural generalisations of the statistics underlying Student's t distribution...
used for multivariate statistical testing and Fisher's Linear Discriminant Analysis
Linear discriminant analysis
Linear discriminant analysis and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events...
that is used for supervised classification.
In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.
Mahalanobis distance and leverage are often used to detect outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....
s, especially in the development of linear regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers. Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores. A point can be a multivariate outlier even if it is not a univariate outlier on any variable.
Mahalanobis distance was also widely used in biology, such as predicting protein structural class,
predicting membrane protein type,
predicting protein subcellular localization,
as well as predicting many other attributes of proteins through their pseudo amino acid composition
Pseudo amino acid composition
Pseudo amino acid composition, or PseAA composition, was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction.- Background :...
.
See also
- Bregman divergenceBregman divergenceIn mathematics, the Bregman divergence or Bregman distance is similar to a metric, but does not satisfy the triangle inequality nor symmetry. There are two ways in which Bregman divergences are important. Firstly, they generalize squared Euclidean distance to a class of distances that all share...
(the Mahalanobis distance is an example of a Bregman divergence) - Bhattacharyya distanceBhattacharyya distanceIn statistics, the Bhattacharyya distance measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations. Both measures are named after A...
related, for measuring similarity between data sets (and not between a point and a data set) - Hellinger distanceHellinger distanceIn probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence...
, also a measure of distance between data sets
External links
- Mahalanobis distance tutorial — interactive online program and spreadsheet computation
- Mahalanobis distance — intuitive, illustrated explanation, from AIAccess.net
- Mahalanobis distance (Nov-17-2006) Overview of Mahalanobis distance, including MATLAB code