Overdispersion
Encyclopedia
In statistics
, overdispersion is the presence of greater variability (statistical dispersion
) in a data set than would be expected based on a given simple statistical model.
A common task in applied statistics
is choosing a parametric model
to fit a given set of empirical observations. This necessitates an assessment of the fit
of the chosen model. It is usually possible to choose the model parameters in such a way that the theoretical population mean of the model is approximately equal to the sample mean. However, especially for simple models with few parameters, theoretical predictions may not match empirical observations for higher moment
s. When the observed variance
is higher than the variance of a theoretical model, overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted. Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently heterogeneous contrary to the assumptions implicit within widely used simple parametric models.
. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example, Poisson regression
analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of the count data, a Poisson mixture model like the negative binomial distribution
can be used instead where the mean of the Poisson distribution can itself be thought of as a random variable drawn - in this case - from the gamma distribution thereby introducing an additional free parameter (note the resulting negative binomial distribution has 2 parameters).
In this case, the beta-binomial model is a popular and analytically tractable alternative to the binomial that captures the overdispersion absent from the binomial model thereby providing a better fit to the observed data. To capture the heterogeneity of the families, one can think of the p parameter (proportion of boys) in the binomial model as itself a random variable (i.e. random effects model) drawn for each family from a beta distribution as the mixing distribution. The resulting compound distribution (Beta-Binomial) has an additional free parameter.
Another common model for overdispersion - when some of the observations are not Bernoulli - arises from introducing a normal random variable into a logistic model
. Software is widely available for fitting this type of multilevel model
. In this case, if the variance of the normal variable is zero, the model reduces to the classical (undispersed) logistic regression
. Note that this model has an additional free parameter - namely the variance of the normal variable.
It should be noted with respect to Binomial random variables that the concept of overdispersion only makes sense if n>1 (i.e. overdispersion is nonsensical for Bernoulli random variables).
, the term 'overdispersion' is generally used as defined here — meaning a distribution with a higher than expected variance.
In some areas of ecology
, however, meanings have been transposed, so that overdispersion is actually taken to mean more even (lower variance) than expected. This confusion has caused some ecologists to suggest that the terms 'aggregated', or 'contagious', would be better used in ecology for 'overdispersed'. Such preferences are creeping into parasitology
too. Generally this suggestion has not been heeded, and confusion persists in the literature.
Furthermore in demography
overdispersion is often evident in the analysis of death count data, but demographers prefer the term 'unobserved heterogeneity'.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, overdispersion is the presence of greater variability (statistical dispersion
Statistical dispersion
In statistics, statistical dispersion is variability or spread in a variable or a probability distribution...
) in a data set than would be expected based on a given simple statistical model.
A common task in applied statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
is choosing a parametric model
Parametric model
In statistics, a parametric model or parametric family or finite-dimensional model is a family of distributions that can be described using a finite number of parameters...
to fit a given set of empirical observations. This necessitates an assessment of the fit
Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g...
of the chosen model. It is usually possible to choose the model parameters in such a way that the theoretical population mean of the model is approximately equal to the sample mean. However, especially for simple models with few parameters, theoretical predictions may not match empirical observations for higher moment
Moment (mathematics)
In mathematics, a moment is, loosely speaking, a quantitative measure of the shape of a set of points. The "second moment", for example, is widely used and measures the "width" of a set of points in one dimension or in higher dimensions measures the shape of a cloud of points as it could be fit by...
s. When the observed variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...
is higher than the variance of a theoretical model, overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted. Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently heterogeneous contrary to the assumptions implicit within widely used simple parametric models.
Poisson
Overdispersion is often encountered when fitting very simple parametric models, such as those based on the Poisson distributionPoisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...
. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example, Poisson regression
Poisson regression
In statistics, Poisson regression is a form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown...
analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of the count data, a Poisson mixture model like the negative binomial distribution
Negative binomial distribution
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified number of failures occur...
can be used instead where the mean of the Poisson distribution can itself be thought of as a random variable drawn - in this case - from the gamma distribution thereby introducing an additional free parameter (note the resulting negative binomial distribution has 2 parameters).
Binomial
As a more concrete example, it has been observed that the random number of boys born to each family do not — as might be expected — conform faithfully to a binomial distribution. Instead, each family seems to skew the sex ratio of their children in favor of either boys or girls (see, for example the Trivers–Willard hypothesis for one possible explanation) i.e. there are too many all boy families, too many all girls families, and not enough families close to the population 51:49 boy-to-girl mean ratio thereby yielding an estimated variance that is larger than predicted by the binomial model.In this case, the beta-binomial model is a popular and analytically tractable alternative to the binomial that captures the overdispersion absent from the binomial model thereby providing a better fit to the observed data. To capture the heterogeneity of the families, one can think of the p parameter (proportion of boys) in the binomial model as itself a random variable (i.e. random effects model) drawn for each family from a beta distribution as the mixing distribution. The resulting compound distribution (Beta-Binomial) has an additional free parameter.
Another common model for overdispersion - when some of the observations are not Bernoulli - arises from introducing a normal random variable into a logistic model
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
. Software is widely available for fitting this type of multilevel model
Multilevel model
Multilevel models are statistical models of parameters that vary at more than one level...
. In this case, if the variance of the normal variable is zero, the model reduces to the classical (undispersed) logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
. Note that this model has an additional free parameter - namely the variance of the normal variable.
It should be noted with respect to Binomial random variables that the concept of overdispersion only makes sense if n>1 (i.e. overdispersion is nonsensical for Bernoulli random variables).
Differences in terminology between disciplines
Over- and underdispersion are terms which have been adopted in branches of the biological sciences. In parasitologyParasitology
Parasitology is the study of parasites, their hosts, and the relationship between them. As a biological discipline, the scope of parasitology is not determined by the organism or environment in question, but by their way of life...
, the term 'overdispersion' is generally used as defined here — meaning a distribution with a higher than expected variance.
In some areas of ecology
Ecology
Ecology is the scientific study of the relations that living organisms have with respect to each other and their natural environment. Variables of interest to ecologists include the composition, distribution, amount , number, and changing states of organisms within and among ecosystems...
, however, meanings have been transposed, so that overdispersion is actually taken to mean more even (lower variance) than expected. This confusion has caused some ecologists to suggest that the terms 'aggregated', or 'contagious', would be better used in ecology for 'overdispersed'. Such preferences are creeping into parasitology
Parasitology
Parasitology is the study of parasites, their hosts, and the relationship between them. As a biological discipline, the scope of parasitology is not determined by the organism or environment in question, but by their way of life...
too. Generally this suggestion has not been heeded, and confusion persists in the literature.
Furthermore in demography
Demography
Demography is the statistical study of human population. It can be a very general science that can be applied to any kind of dynamic human population, that is, one that changes over time or space...
overdispersion is often evident in the analysis of death count data, but demographers prefer the term 'unobserved heterogeneity'.
External links
- "Overdispersion" at Planet Math