Conjugate prior
Encyclopedia
In Bayesian probability
theory, if the posterior distributions
p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa
and Robert Schlaifer
in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard
.
Consider the general problem of inferring a distribution for a parameter θ given some datum or data x. From Bayes' theorem
, the posterior distribution is equal to the product of the likelihood function
and prior p(θ), normalized (divided) by the probability of the data p(x):
Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process. It is clear that different choices of the prior distribution p(θ) may make the integral more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.
A conjugate prior is an algebraic convenience, giving a closed-form expression
for the posterior: otherwise a difficult numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a distribution.
All members of the exponential family
have conjugate priors. See Gelman et al. for a catalog.
or probability mass function
of a distribution. For example, consider a random variable
which is a Bernoulli trial
with unknown probability of success q in [0,1]. The probability density function has the form
Expressed as a function of , this has the form
for some constants and . Generally, this functional form will have an additional multiplicative factor (the normalizing constant
) ensuring that the function is a probability distribution
, i.e. the integral over the entire range is 1. This factor will often be a function of and , but never of .
In fact, the usual conjugate prior is the beta distribution with
where and are chosen to reflect any existing belief or information ( = 1 and = 1 would give a uniform distribution
) and Β(, ) is the Beta function acting as a normalising constant.
In this context, and are called hyperparameter
s (parameters of the prior), to distinguish them from parameters of the underlying model (here q). It is a typical characteristic of conjugate priors that the dimensionality of the hyperparameters is one greater than that of the parameters of the original distribution. If all parameters are scalar values, then this means that there will be one more hyperparameter than parameter; but this also applies to vector-valued and matrix-valued parameters. (See the general article on the exponential family
, and consider also the Wishart distribution, conjugate prior of the covariance matrix
of a multivariate normal distribution, for an example where a large dimensionality is involved.)
If we then sample this random variable and get s successes and f failures, we have
|-
| Gamma
with known shape α|| β (inverse scale) || Gamma || ||
|-
| Inverse Gamma
with known shape α|| β (inverse scale) || Gamma || ||
|-
| Gamma || α (shape), β (inverse scale) || || ||
|-
|}
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
theory, if the posterior distributions
Posterior probability
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...
p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa
Howard Raiffa
Howard Raiffa is the Frank P. Ramsey Professor of Managerial Economics, a joint chair held by the Business School and the Kennedy School of Government at Harvard University...
and Robert Schlaifer
Robert Schlaifer
Robert O. Schlaifer was a pioneer of Bayesian decision theory. At the time of his death he was William Ziegler Professor of Business Administration Emeritus of the Harvard Business School....
in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard
George Alfred Barnard
George Alfred Barnard was a British statistician known particularly for his work on the foundations of statistics and on quality control.-Biography:...
.
Consider the general problem of inferring a distribution for a parameter θ given some datum or data x. From Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....
, the posterior distribution is equal to the product of the likelihood function
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...
and prior p(θ), normalized (divided) by the probability of the data p(x):
Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process. It is clear that different choices of the prior distribution p(θ) may make the integral more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.
A conjugate prior is an algebraic convenience, giving a closed-form expression
Closed-form expression
In mathematics, an expression is said to be a closed-form expression if it can be expressed analytically in terms of a bounded number of certain "well-known" functions...
for the posterior: otherwise a difficult numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a distribution.
All members of the exponential family
Exponential family
In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential...
have conjugate priors. See Gelman et al. for a catalog.
Example
The form of the conjugate prior can generally be determined by inspection of the probability densityProbability density
Probability density may refer to:* Probability density function in probability theory* The product of the probability amplitude with its complex conjugate in quantum mechanics...
or probability mass function
Probability mass function
In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value...
of a distribution. For example, consider a random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
which is a Bernoulli trial
Bernoulli trial
In the theory of probability and statistics, a Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure"....
with unknown probability of success q in [0,1]. The probability density function has the form
Expressed as a function of , this has the form
for some constants and . Generally, this functional form will have an additional multiplicative factor (the normalizing constant
Normalizing constant
The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics.-Definition and examples:In probability theory, a normalizing constant is a constant by which an everywhere non-negative function must be multiplied so the area under its graph is 1, e.g.,...
) ensuring that the function is a probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
, i.e. the integral over the entire range is 1. This factor will often be a function of and , but never of .
In fact, the usual conjugate prior is the beta distribution with
where and are chosen to reflect any existing belief or information ( = 1 and = 1 would give a uniform distribution
Uniform distribution
-Probability theory:* Discrete uniform distribution* Continuous uniform distribution-Other:* "Uniform distribution modulo 1", see Equidistributed sequence*Uniform distribution , a type of species distribution* Distribution of military uniforms...
) and Β(, ) is the Beta function acting as a normalising constant.
In this context, and are called hyperparameter
Hyperparameter
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis...
s (parameters of the prior), to distinguish them from parameters of the underlying model (here q). It is a typical characteristic of conjugate priors that the dimensionality of the hyperparameters is one greater than that of the parameters of the original distribution. If all parameters are scalar values, then this means that there will be one more hyperparameter than parameter; but this also applies to vector-valued and matrix-valued parameters. (See the general article on the exponential family
Exponential family
In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential...
, and consider also the Wishart distribution, conjugate prior of the covariance matrix
Covariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector...
of a multivariate normal distribution, for an example where a large dimensionality is involved.)
If we then sample this random variable and get s successes and f failures, we have
|-
| Gamma
with known shape α|| β (inverse scale) || Gamma || ||
|-
| Inverse Gamma
with known shape α|| β (inverse scale) || Gamma || ||
|-
| Gamma || α (shape), β (inverse scale) || || ||
|-
|}