Maximum a posteriori
Encyclopedia
In Bayesian statistics
, a maximum a posteriori probability (MAP) estimate is a mode
of the posterior distribution. The MAP can be used to obtain a point estimate
of an unobserved quantity on the basis of empirical data. It is closely related to Fisher
's method of maximum likelihood
(ML), but employs an augmented optimization objective
which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization
of ML estimation.
of , so that is the probability of when the underlying population parameter is . Then the function
is known as the likelihood function
and the estimate
is the maximum likelihood estimate of .
Now assume that a prior distribution over exists. This allows us to treat as a random variable
as in Bayesian statistics
. Then the posterior distribution of is as follows:
where is density function of , is the domain of . This is a straightforward application of Bayes' theorem
.
The method of maximum a posteriori estimation then estimates as the mode
of the posterior distribution of this random variable:
The denominator of the posterior distribution (so-called partition function
) does not depend on and therefore plays no role in the optimization. Observe that the MAP estimate of coincides with the ML estimate when the prior is uniform (that is, a constant function
). The MAP estimate is a limit of Bayes estimators under a sequence of 0-1 loss functions, but generally not a Bayes estimator
per se, unless is discrete.
or median
instead, together with credible interval
s. This is both because these estimators are optimal under squared-error and linear-error loss respectively - which are more representative of typical loss function
s - and because the posterior distribution may not have a simple analytic form: in this case, the distribution can be simulated using Markov chain Monte Carlo
techniques, while optimization to find its mode(s) may be difficult or impossible.
In many types of models, such as mixture model
s, the posterior may be multi-modal
. In such a case, the usual recommendation is that one should choose the highest mode: this is not always feasible (global optimization
is a difficult problem), nor in some cases even possible (such as when identifiability issues arise). Furthermore, the highest mode may be uncharacteristic of the majority of the posterior.
Finally, unlike ML estimators, the MAP estimate is not invariant under reparameterization. Switching from one parameterization to another involves introducing a Jacobian that impacts on the location of the maximum.
As an example of the difference between Bayes estimators mentioned above (mean and median estimators) and using an MAP estimate, consider the case where there is a need to classify inputs as either positive or negative (for example, loans as risky or safe). Suppose there are just three possible hypotheses about the correct method of classification , and with posteriors 0.4, 0.3 and 0.3 respectively. Suppose given a new instance, , classifies it as positive, whereas the other two classify it as negative. Using the MAP estimate for the correct classifier , is classified as positive, whereas the Bayes estimators would average over all hypotheses and classify as negative.
s and an a priori distribution of is given by . We wish to find the MAP estimate of .
The function to be maximized is then given by
which is equivalent to minimizing the following function of :
Thus, we see that the MAP estimator for μ is given by
which turns out to be a linear interpolation between the prior mean and the sample mean weighted by their respective covariances.
The case of is called a non-informative prior and leads to an ill-defined a priori probability distribution; in this case
Bayesian statistics
Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...
, a maximum a posteriori probability (MAP) estimate is a mode
Mode (statistics)
In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution. In some fields, notably education, sample data are often called scores, and the sample mode is known as the modal score....
of the posterior distribution. The MAP can be used to obtain a point estimate
Point estimation
In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter....
of an unobserved quantity on the basis of empirical data. It is closely related to Fisher
Ronald Fisher
Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...
's method of maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
(ML), but employs an augmented optimization objective
Optimization (mathematics)
In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....
which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization
Regularization (mathematics)
In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...
of ML estimation.
Description
Assume that we want to estimate an unobserved population parameter on the basis of observations . Let be the sampling distributionSampling distribution
In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification on the route to statistical inference...
of , so that is the probability of when the underlying population parameter is . Then the function
is known as the likelihood function
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...
and the estimate
is the maximum likelihood estimate of .
Now assume that a prior distribution over exists. This allows us to treat as a random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
as in Bayesian statistics
Bayesian statistics
Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...
. Then the posterior distribution of is as follows:
where is density function of , is the domain of . This is a straightforward application of Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....
.
The method of maximum a posteriori estimation then estimates as the mode
Mode (statistics)
In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution. In some fields, notably education, sample data are often called scores, and the sample mode is known as the modal score....
of the posterior distribution of this random variable:
The denominator of the posterior distribution (so-called partition function
Partition function (mathematics)
The partition function or configuration integral, as used in probability theory, information science and dynamical systems, is an abstraction of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann...
) does not depend on and therefore plays no role in the optimization. Observe that the MAP estimate of coincides with the ML estimate when the prior is uniform (that is, a constant function
Constant function
In mathematics, a constant function is a function whose values do not vary and thus are constant. For example the function f = 4 is constant since f maps any value to 4...
). The MAP estimate is a limit of Bayes estimators under a sequence of 0-1 loss functions, but generally not a Bayes estimator
Bayes estimator
In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function . Equivalently, it maximizes the posterior expectation of a utility function...
per se, unless is discrete.
Computing
MAP estimates can be computed in several ways:- Analytically, when the mode(s) of the posterior distribution can be given in closed form. This is the case when conjugate priorConjugate priorIn Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
s are used. - Via numericalNumerical analysisNumerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis ....
optimizationOptimization (mathematics)In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....
such as the conjugate gradient methodConjugate gradient methodIn mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is an iterative method, so it can be applied to sparse systems that are too...
or Newton's methodNewton's method in optimizationIn mathematics, Newton's method is an iterative method for finding roots of equations. More generally, Newton's method is used to find critical points of differentiable functions, which are the zeros of the derivative function.-Method:...
. This usually requires first or second derivativeDerivativeIn calculus, a branch of mathematics, the derivative is a measure of how a function changes as its input changes. Loosely speaking, a derivative can be thought of as how much one quantity is changing in response to changes in some other quantity; for example, the derivative of the position of a...
s, which have to be evaluated analytically or numerically. - Via a modification of an expectation-maximization algorithmExpectation-maximization algorithmIn statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
. This does not require derivatives of the posterior density. - Via a Monte Carlo methodMonte Carlo methodMonte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...
using simulated annealingSimulated annealingSimulated annealing is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete...
Criticism
While MAP estimation is a limit of Bayes estimators (under the 0-1 loss function), it is not very representative of Bayesian methods in general. This is because MAP estimates are point estimates, whereas Bayesian methods are characterized by the use of distributions to summarize data and draw inferences: thus, Bayesian methods tend to report the posterior meanMean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....
or median
Median
In probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...
instead, together with credible interval
Credible interval
In Bayesian statistics, a credible interval is an interval in the domain of a posterior probability distribution used for interval estimation. The generalisation to multivariate problems is the credible region...
s. This is both because these estimators are optimal under squared-error and linear-error loss respectively - which are more representative of typical loss function
Loss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...
s - and because the posterior distribution may not have a simple analytic form: in this case, the distribution can be simulated using Markov chain Monte Carlo
Markov chain Monte Carlo
Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...
techniques, while optimization to find its mode(s) may be difficult or impossible.
In many types of models, such as mixture model
Mixture model
In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs...
s, the posterior may be multi-modal
Bimodal distribution
In statistics, a bimodal distribution is a continuous probability distribution with two different modes. These appear as distinct peaks in the probability density function, as shown in Figure 1....
. In such a case, the usual recommendation is that one should choose the highest mode: this is not always feasible (global optimization
Global optimization
Global optimization is a branch of applied mathematics and numerical analysis that deals with the optimization of a function or a set of functions to some criteria.- General :The most common form is the minimization of one real-valued function...
is a difficult problem), nor in some cases even possible (such as when identifiability issues arise). Furthermore, the highest mode may be uncharacteristic of the majority of the posterior.
Finally, unlike ML estimators, the MAP estimate is not invariant under reparameterization. Switching from one parameterization to another involves introducing a Jacobian that impacts on the location of the maximum.
As an example of the difference between Bayes estimators mentioned above (mean and median estimators) and using an MAP estimate, consider the case where there is a need to classify inputs as either positive or negative (for example, loans as risky or safe). Suppose there are just three possible hypotheses about the correct method of classification , and with posteriors 0.4, 0.3 and 0.3 respectively. Suppose given a new instance, , classifies it as positive, whereas the other two classify it as negative. Using the MAP estimate for the correct classifier , is classified as positive, whereas the Bayes estimators would average over all hypotheses and classify as negative.
Example
Suppose that we are given a sequence of IID random variableRandom variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
s and an a priori distribution of is given by . We wish to find the MAP estimate of .
The function to be maximized is then given by
which is equivalent to minimizing the following function of :
Thus, we see that the MAP estimator for μ is given by
which turns out to be a linear interpolation between the prior mean and the sample mean weighted by their respective covariances.
The case of is called a non-informative prior and leads to an ill-defined a priori probability distribution; in this case