Empirical Bayes method
Encyclopedia
Empirical Bayes methods are procedures for statistical inference
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

 in which the prior distribution is estimated from the data. This approach stands in contrast to standard
Bayesian methods
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...

, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model
Hierarchical Bayes model
The hierarchical Bayes model is a method in modern Bayesian statistical inference. It is a framework for describing statistical models that can capture dependencies more realistically than non-hierarchical models....

 wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood
Marginal likelihood
In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalised...

, represents one approach for setting hyperparameters.

Introduction

Empirical Bayes methods can be seen as an approximation to a fully Bayesian treatment of a hierarchical Bayes model
Hierarchical Bayes model
The hierarchical Bayes model is a method in modern Bayesian statistical inference. It is a framework for describing statistical models that can capture dependencies more realistically than non-hierarchical models....

.

In, for example, a two-stage hierarchical Bayes model, observed data are assumed to be generated from an unobserved set of parameters according to a probability distribution . In turn, the parameters θ can be considered samples drawn from a population characterised by hyperparameters  according to a probability distribution . In the hierarchical Bayes model, though not in the Empirical Bayes approximation, the hyperparameters are considered to be drawn from an unparameterized distribution .

Information about a particular quantity of interest therefore comes not only from the properties of those data which directly depend on it, but also from the properties of the population of parameters as a whole, inferred from the data as a whole, summarised by the hyperparameters .

Using Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

,


In general, this integral will not be tractable analytically and must be evaluated by numerical methods; typically stochastic approximations such as from Markov Chain Monte Carlo
Markov chain Monte Carlo
Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...

 sampling or deterministic approximations such as quadrature
Numerical integration
In numerical analysis, numerical integration constitutes a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations. This article focuses on calculation of...

.

Alternatively, the expression can be written as
and we can expand

These suggest an iterative scheme, qualitatively similar in structure to a Gibbs sampler, to evolve successively improved approximations to and . First, calculate an initial approximation to ignoring the dependence completely; then calculate an approximation to based upon the initial approximate distribution of ; then use this to update the approximation for ; then update ; and so on.

When the true distribution is sharply peaked, the integral determining may be not much changed by replacing the probability distribution over with a point estimate representing the distribution's peak (or, alternatively, its mean),
With this approximation, the above iterative scheme becomes the EM algorithm.

"Empirical Bayes" as a label can cover a wide variety of methods, but most can be regarded as an early truncation of either the above scheme or something quite like it. Point estimates, rather than the whole distribution, are typically used for the parameter(s) ; the estimates for are typically made from the first approximation to without subsequent refinement; these estimates for are usually made without considering an appropriate prior distribution for .

Robbins method (1956): non-parametric empirical Bayes (NPEB)

We consider a case of compound sampling, where probability for each (conditional on ) is specified by a Poisson distribution
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...

,


while the prior is unspecified except that it is also i.i.d. (call it ). Compound sampling arises in a variety of statistical estimation problems, such as accident rates and clinical trials. We simply seek a point prediction of given all the observed data. Because the prior is unspecified, we seek to do this without knowledge of G (see Carlin and Louis, Sec. 3.2 and Appendix B).

Under squared error loss (SEL), the conditional expectation
Conditional expectation
In probability theory, a conditional expectation is the expected value of a real random variable with respect to a conditional probability distribution....

 Ei | Yi = yi) is a reasonable quantity to use for prediction. For the Poisson compound sampling model, this quantity is


This can be simplified by multiplying the expression by , yielding


where pG is the marginal distribution obtained by integrating out θ over G.

To take advantage of this, Robbins (1955) suggested estimating the marginals with their empirical frequencies, yielding the fully non-parametric estimate as:


(see also Good–Turing frequency estimation).

Example: Accident rates

Suppose each customer of an insurance company has an "accident rate" Θ and is insured against accidents; the probability distribution of is the underlying distribution, and is unknown. The number of accidents suffered by each customer in a specified time period has a Poisson distribution
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...

 with expected value equal to the particular customer's accident rate. The actual number of accidents experienced by a customer is the observable quantity. A crude way to estimate the underlying probability distribution of the accident rate Θ is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, ... accidents during the specified time period as the corresponding proportion in the observed random sample. Having done so, it is then desired to predict the accident rate of each customer in the sample. As above, one may use the conditional
Conditional probability
In probability theory, the "conditional probability of A given B" is the probability of A if B is known to occur. It is commonly notated P, and sometimes P_B. P can be visualised as the probability of event A when the sample space is restricted to event B...

 expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 of the accident rate Θ given the observed number of accidents during the baseline period. Thus, if a customer suffers six accidents during the baseline period, that customer's estimated accident rate is 7 × [the proportion of the sample who suffered 7 accidents] / [the proportion of the sample who suffered 6 accidents]. Note that if the proportion of people suffering k accidents is a decreasing function of k, the customer's predicted accident rate is lower than their observed number of accidents. This shrinkage effect is typical of Empirical Bayes analyses.

Parametric empirical Bayes

If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simple conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

s), then the empirical Bayes problem is only to estimate the marginal and the hyperparameters using the complete set of empirical measurements. For example, one common approach, called parametric Empirical Bayes point estimation, is to approximate the marginal using the maximum likelihood estimate (MLE), or a Moments
Moment (mathematics)
In mathematics, a moment is, loosely speaking, a quantitative measure of the shape of a set of points. The "second moment", for example, is widely used and measures the "width" of a set of points in one dimension or in higher dimensions measures the shape of a cloud of points as it could be fit by...

 expansion, which allows one to express the hyperparameters in terms of the empirical mean and variance. This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior . The resulting equation for the prior is greatly simplified, as shown below.

There are several common Parametric Empirical Bayes models, including the Poisson-Gamma model (below), the Beta-binomial model, the Gaussian-Gaussian model, the multinomial-Dirichlet (or multivariate Pólya) model
Multivariate Polya distribution
The multivariate Pólya distribution, named after George Pólya, also called the Dirichlet compound multinomial distribution, is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector \alpha, and a set of discrete samples is...

, as well specific models for Bayesian linear regression
Bayesian linear regression
In statistics, Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference...

 (see below) and Bayesian multivariate linear regression. More advanced approaches include hierarchical Bayes model
Hierarchical Bayes model
The hierarchical Bayes model is a method in modern Bayesian statistical inference. It is a framework for describing statistical models that can capture dependencies more realistically than non-hierarchical models....

s and Bayesian mixture models.

Poisson-Gamma model

For example, in the example above, let the likelihood be a Poisson distribution
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...

, and let the prior now be specified by the conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

, which is a Gamma distribution () (where ):


It is straightforward to show the posterior is also a Gamma distribution. Write


where we have omitted the marginal since it does not depend explicitly on .
Expanding terms which do depend on gives the posterior as:


So we see that the posterior density is also a Gamma distribution , where , and . Also notice that the marginal is simply the integral of the posterior over all , which turns out to be a negative binomial distribution
Negative binomial distribution
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified number of failures occur...

.

To apply empirical Bayes, we will approximate the marginal using the maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

 estimate (MLE). But since the posterior is a Gamma distribution, the MLE of the marginal turns out to be just the mean of the posterior, which is the point estimate we need. Recalling that the mean of a Gamma distribution is simply , we have


To obtain the values of and , empirical Bayes prescribes estimating mean and variance using the complete set of empirical data.

The resulting point estimate is therefore like a weighted average of the sample mean and the prior mean . This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e. mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance).

See also

  • Bayes estimator
    Bayes estimator
    In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function . Equivalently, it maximizes the posterior expectation of a utility function...

  • Bayes' theorem
    Bayes' theorem
    In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

  • Bayesian brain
    Bayesian brain
    Bayesian brain is a term that is used to refer to the ability of the nervous system to operate in situations of uncertainty in a fashion that is close to the optimal prescribed by Bayesian statistics. This term is used in behavioural sciences and neuroscience and studies associated with this term...

  • Bayesian probability
    Bayesian probability
    Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...

  • Best linear unbiased prediction
    Best linear unbiased prediction
    In statistics, best linear unbiased prediction is used in linear mixed models for the estimation of random effects. BLUP was derived by Charles Roy Henderson in 1950 but the term "best linear unbiased predictor" seems not to have been used until 1962...

  • Conditional probability
    Conditional probability
    In probability theory, the "conditional probability of A given B" is the probability of A if B is known to occur. It is commonly notated P, and sometimes P_B. P can be visualised as the probability of event A when the sample space is restricted to event B...

  • Monty Hall problem
  • Posterior probability
    Posterior probability
    In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...

  • Bayesian coding hypothesis

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK