Hyperprior
Encyclopedia
In Bayesian statistics
, a hyperprior is a prior distribution on a hyperparameter
, that is, on a parameter of a prior distribution.
As with the term hyperparameter, the use of hyper is to distinguish it from a prior distribution of a parameter of the model for the underlying system. They arise particularly in the use of conjugate prior
s.
For example, if one is using a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:
In principle, one can iterate the above: if the hyperprior itself has parameters, these may be called hyperhyperparameters, and so forth.
One can analogously call the posterior distribution on the hyperparameter the hyperposterior, and, if these are in the same family, call them conjugate hyperdistributions or a conjugate hyperprior. However, this rapidly becomes very abstract and removed from the original problem.
: it is the weighted average of the various prior distributions (over different hyperparameters), with the hyperprior being the weighting. This adds additional possible distributions (beyond the parametric family one is using), because parametric families of distributions are generally not convex set
s – as a mixture density is a convex combination
of distributions, it will in general lie outside the family.
For instance, the mixture of two normal distributions is not a normal distribution: if one takes different means (sufficiently distant) and mix 50% of each, one obtains a bimodal distribution, which is thus not normal. In fact, the convex hull of normal distributions is dense in all distributions, so in some cases, you can arbitrarily closely approximate a given prior by using a family with a suitable hyperprior.
What makes this approach particularly useful is if one uses conjugate priors: individual conjugate priors have easily computed posteriors, and thus a mixture of conjugate priors is the same mixture of posteriors: one only needs to know how each conjugate prior changes.
Using a single conjugate prior may be too restrictive, but using a mixture of conjugate priors may give one the desired distribution in a form that is easy to compute with.
This is similar to decomposing a function in terms of eigenfunctions – see Conjugate prior: Analogy with eigenfunctions.
(each point of hyperparameter space evolving to the updated hyperparameters), over time converging, just as the prior itself converges.
Bayesian statistics
Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...
, a hyperprior is a prior distribution on a hyperparameter
Hyperparameter
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis...
, that is, on a parameter of a prior distribution.
As with the term hyperparameter, the use of hyper is to distinguish it from a prior distribution of a parameter of the model for the underlying system. They arise particularly in the use of conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
s.
For example, if one is using a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:
- The Bernoulli distribution (with parameter p) is the model of the underlying system;
- p is a parameter of the underlying system (Bernoulli distribution);
- The beta distribution (with parameters α and β) is the prior distribution of p;
- α and β are parameters of the prior distribution (beta distribution), hence hyperparameters;
- A prior distribution of α and β is thus a hyperprior.
In principle, one can iterate the above: if the hyperprior itself has parameters, these may be called hyperhyperparameters, and so forth.
One can analogously call the posterior distribution on the hyperparameter the hyperposterior, and, if these are in the same family, call them conjugate hyperdistributions or a conjugate hyperprior. However, this rapidly becomes very abstract and removed from the original problem.
Purpose
Hyperpriors, like conjugate priors, are a computational convenience – they do not change the process of Bayesian inference, but simply allow one to more easily describe and compute with the prior.Uncertainty
Firstly, use of a hyperprior allows one to express uncertainty in a hyperparameter: taking a fixed prior is an assumption, varying a hyperparameter of the prior allows one to do sensitivity analysis on this assumption, and taking a distribution on this hyperparameter allows one to express uncertainty in this assumption: "assume that the prior is of this form (this parametric family), but that we are uncertain as to precisely what the values of the parameters should be".Mixture distribution
More abstractly, if one uses a hyperprior, then the prior distribution (on the parameter of the underlying model) itself is a mixture densityMixture density
In probability and statistics, a mixture distribution is the probability distribution of a random variable whose values can be interpreted as being derived in a simple way from an underlying set of other random variables. In particular, the final outcome value is selected at random from among the...
: it is the weighted average of the various prior distributions (over different hyperparameters), with the hyperprior being the weighting. This adds additional possible distributions (beyond the parametric family one is using), because parametric families of distributions are generally not convex set
Convex set
In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object...
s – as a mixture density is a convex combination
Convex combination
In convex geometry, a convex combination is a linear combination of points where all coefficients are non-negative and sum up to 1....
of distributions, it will in general lie outside the family.
For instance, the mixture of two normal distributions is not a normal distribution: if one takes different means (sufficiently distant) and mix 50% of each, one obtains a bimodal distribution, which is thus not normal. In fact, the convex hull of normal distributions is dense in all distributions, so in some cases, you can arbitrarily closely approximate a given prior by using a family with a suitable hyperprior.
What makes this approach particularly useful is if one uses conjugate priors: individual conjugate priors have easily computed posteriors, and thus a mixture of conjugate priors is the same mixture of posteriors: one only needs to know how each conjugate prior changes.
Using a single conjugate prior may be too restrictive, but using a mixture of conjugate priors may give one the desired distribution in a form that is easy to compute with.
This is similar to decomposing a function in terms of eigenfunctions – see Conjugate prior: Analogy with eigenfunctions.
Dynamical system
A hyperprior is a distribution on the space of possible hyperparameters. If one is using conjugate priors, then this space is preserved by moving to posteriors – thus as data arrives, the distribution changes, but remains on this space: as data arrives, the distribution evolves as a dynamical systemDynamical system
A dynamical system is a concept in mathematics where a fixed rule describes the time dependence of a point in a geometrical space. Examples include the mathematical models that describe the swinging of a clock pendulum, the flow of water in a pipe, and the number of fish each springtime in a...
(each point of hyperparameter space evolving to the updated hyperparameters), over time converging, just as the prior itself converges.