Count data
Encyclopedia
In statistics
, count data is data
in which the observations can take only the non-negative integer
values {0, 1, 2, 3, ...}, and where these integers arise from counting
rather than ranking
. The statistical treatment of count data is distinct from that of binary data, in which the observations can take only two values, usually represented by 0 and 1.
The latter are treated separately as different methodologies apply, and the following applies to simple counts.
, binomial and negative binomial
distributions are commonly used to represent the distributions of count data when these are treated as random variable
s.
Graphical examination of count data may be aided by the use of data transformation
s chosen to have the property of stabilising the sample variance. In particular, the square root
transformation might be used when data can be approximated by a Poisson distribution
(although other transformation have modestly improved properties), while an inverse sine transformation is available when a binomial distribution is preferred.
and analysis of variance
are designed to deal with continuous dependent variables. These can be adapted to deal with count data by using data transformation
s such as the square root
transformation, but such methods have several drawbacks; they are approximate at best and estimate parameter
s that are often hard to interpret.
The Poisson distribution
can form the basis for some analyses of count data and in this case Poisson regression
may be used. This is a special case of the class of generalized linear model
s which also contains specific forms of model capable of using the binomial distribution (binomial regression
, logistic regression
) or the negative binomial distribution
where the assumptions of the Poisson model are violated, in particular when the range of count values is limited or when overdispersion
is present.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, count data is data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
in which the observations can take only the non-negative integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
values {0, 1, 2, 3, ...}, and where these integers arise from counting
Counting
Counting is the action of finding the number of elements of a finite set of objects. The traditional way of counting consists of continually increasing a counter by a unit for every element of the set, in some order, while marking those elements to avoid visiting the same element more than once,...
rather than ranking
Ranking
A ranking is a relationship between a set of items such that, for any two items, the first is either 'ranked higher than', 'ranked lower than' or 'ranked equal to' the second....
. The statistical treatment of count data is distinct from that of binary data, in which the observations can take only two values, usually represented by 0 and 1.
Introduction
Statistical analyses involving count data can take several forms depending on the context in which the data arise.- simple counts, such as the number of occurrences of thunderstorms in a calendar year, observed for several years.
- categorical dataCategorical dataIn statistics, categorical data is that part of an observed dataset that consists of categorical variables, or for data that has been converted into that form, for example as grouped data...
in which the counts represent the numbers of items falling into each of several categories.
The latter are treated separately as different methodologies apply, and the following applies to simple counts.
Analysing simple count data alone
The PoissonPoisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...
, binomial and negative binomial
Negative binomial distribution
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified number of failures occur...
distributions are commonly used to represent the distributions of count data when these are treated as random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
s.
Graphical examination of count data may be aided by the use of data transformation
Data transformation (statistics)
In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the transformed value yi = f, where f is a function...
s chosen to have the property of stabilising the sample variance. In particular, the square root
Square root
In mathematics, a square root of a number x is a number r such that r2 = x, or, in other words, a number r whose square is x...
transformation might be used when data can be approximated by a Poisson distribution
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...
(although other transformation have modestly improved properties), while an inverse sine transformation is available when a binomial distribution is preferred.
Relating count data to other variables
Here the count data would be treated as a dependent variable. Statistical methods such as least squaresLeast squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
and analysis of variance
Analysis of variance
In statistics, analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation...
are designed to deal with continuous dependent variables. These can be adapted to deal with count data by using data transformation
Data transformation (statistics)
In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the transformed value yi = f, where f is a function...
s such as the square root
Square root
In mathematics, a square root of a number x is a number r such that r2 = x, or, in other words, a number r whose square is x...
transformation, but such methods have several drawbacks; they are approximate at best and estimate parameter
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....
s that are often hard to interpret.
The Poisson distribution
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...
can form the basis for some analyses of count data and in this case Poisson regression
Poisson regression
In statistics, Poisson regression is a form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown...
may be used. This is a special case of the class of generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
s which also contains specific forms of model capable of using the binomial distribution (binomial regression
Binomial regression
In statistics, binomial regression is a technique in which the response is the result of a series of Bernoulli trials, or a series of one of two possible disjoint outcomes...
, logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
) or the negative binomial distribution
Negative binomial distribution
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified number of failures occur...
where the assumptions of the Poisson model are violated, in particular when the range of count values is limited or when overdispersion
Overdispersion
In statistics, overdispersion is the presence of greater variability in a data set than would be expected based on a given simple statistical model....
is present.
Further reading
- Cameron, A.C. and P.K. Trivedi (1998). Regression analysis of count data, Cambridge University Press. ISBN 0-521-63201-3