Diversity index
Encyclopedia
A diversity index is a statistic
which is intended to measure the local members of a set consisting of various types of objects. Diversity indices can be used in many fields of study to assess the diversity of any population in which each member belongs to a unique group, type or species. For instance, it is used in ecology
to measure biodiversity
in an ecosystem
, in demography
to measure the distribution of population of various demographic groups, in economics
to measure the distribution over sectors of economic activity in a region, and in information science
to describe the complexity of a set of information.
In measuring human diversity, the diversity index measures the probability that any two residents, chosen at random, would be of different ethnicities. If all residents are of the same ethnic group it's zero. If half are from one group and half from another it's .50.
Below, a series of diversity indices is discussed.
is simply the number of species present in an ecosystem. This index makes no use of relative abundances. In practice, measuring the total species richness in an ecosystem is impossible, except in very depauperate systems. The observed number of species in the system is a biased estimator of the true species richness in the system, and the observed species number increases non-linearly with sampling effort. Thus , if indicating the observed species richness in an ecosystem, is usually referred to as species density.
is the relative abundance or proportion of individuals among the species.
is a crude indicator of the extent to which a few groups such as species, demographic groups or companies dominate an environment, the total share taken by the top n species or firms. However by itself the concentration ratio does not indicate how much that share is divided between those top n firms or species.
is most commonly defined as the statistic
This quantity was introduced by Edward Hugh Simpson in 1949. The Herfindahl index
in competition economics is essentially the same.
If is the number of individuals of species which are counted, and is the total number of all individuals counted, then
is an estimator for Simpson's index for sampling
without replacement.
Note that , with values near zero corresponding to highly diverse or heterogeneous ecosystems and values near one corresponding to more homogeneous ecosystems. Biologists who find this confusing sometimes use instead; confusingly, this reciprocal quantity is also called Simpson's index. Another response is to redefine Simpson's index as
This quantity is called by statisticians the index of diversity.
In sociology, psychology and management studies the index is often known as Blau's Index, as it was introduced into the literature by the sociologist Peter Blau
.
In economics essentially the same quantity is called the Hirschman-Herfindahl index
(HHI), defined as the sum of the squares of the shares in the population across groups (with E as the group size, that is, the number of employees or the number of specimina):
Note that a HHI is also used within sectors, to measure competition
.
The index of diversity (also referred to as the Index of Variability) is a commonly used measure, in demographic research, to determine the variation in categorical data
.
Gibbs and Martin defined the Simpson's diversity index for use in sociology as:
where
A perfectly homogeneous population would have a diversity index score of 0. A perfectly heterogeneous population would have a diversity index score of 1 (assuming infinite categories with equal representation in each category). As the number of categories increases, the maximum value of the diversity index score also increases (e.g., 4 categories at 25% = .75, 5 categories with 20% = .8, etc.)
An example of the use of the index of diversity would be a measure of racial diversity in a city. Thus, if Sunflower City was 85% white and 15% black, the index of diversity would be: .255.
The interpretation of the diversity index score would be that the population of Sunflower City is not very heterogeneous but is also not homogeneous.
is simply the ecologist's name for the communication entropy introduced by Claude Shannon:
where is the fraction of individuals belonging to the i-th species. This is by far the most widely used diversity index. The intuitive significance of this index can be described as follows. Suppose we devise binary codewords for each species in our ecosystem, with short codewords used for the most abundant species, and longer codewords for rare species. As we walk around and observe individual organisms, we call out the corresponding codeword. This gives a binary sequence. If we have used an efficient code, we will be able to save some breath by calling out a shorter sequence than would otherwise be the case. If so, the average codeword length we call out as we wander around will be close to the Shannon diversity index.
It is possible to write down estimators which attempt to correct for bias in finite sample sizes, but this would be misleading since communication entropy does not really fit expectations based upon parametric statistics. Differences arising from using two different estimators are likely to be overwhelmed by errors arising from other sources. Current best practice tends to use bootstrapping
procedures to estimate communication entropy.
Shannon himself showed that his communication entropy enjoys some powerful formal properties, and furthermore, it is the unique quantity which does so. These observations are the foundation of its interpretation as a measure of statistical diversity (or "surprise", in the arena of communications). The applications of this quantity go far beyond the one discussed here; see the textbook cited below for an elementary survey of the extraordinary richness of modern information theory.
This is an example of an index which uses only partial information about the relative abundances of the various species in its definition.
,
for approaching respectively.
Unfortunately, the powerful formal properties of communication entropy do not generalize to Rényi entropy, which largely explains the much greater power and popularity of Shannon's index with respect to its competitors.
, such as the Gini index and the Theil index
. Generally these measure a lack of diversity, but the only difference with the measures mentioned above is a minus sign.
The Theil index
in particular is the maximum possible diversity log(N) minus Shannon's diversity index. It is the maximum possible entropy of the data minus the observed entropy. The Theil index is called redundancy
in information theory.
Statistic
A statistic is a single measure of some attribute of a sample . It is calculated by applying a function to the values of the items comprising the sample which are known together as a set of data.More formally, statistical theory defines a statistic as a function of a sample where the function...
which is intended to measure the local members of a set consisting of various types of objects. Diversity indices can be used in many fields of study to assess the diversity of any population in which each member belongs to a unique group, type or species. For instance, it is used in ecology
Ecology
Ecology is the scientific study of the relations that living organisms have with respect to each other and their natural environment. Variables of interest to ecologists include the composition, distribution, amount , number, and changing states of organisms within and among ecosystems...
to measure biodiversity
Biodiversity
Biodiversity is the degree of variation of life forms within a given ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems. Biodiversity is in part a function of climate. In terrestrial habitats, tropical regions are typically rich whereas polar regions...
in an ecosystem
Ecosystem
An ecosystem is a biological environment consisting of all the organisms living in a particular area, as well as all the nonliving , physical components of the environment with which the organisms interact, such as air, soil, water and sunlight....
, in demography
Demography
Demography is the statistical study of human population. It can be a very general science that can be applied to any kind of dynamic human population, that is, one that changes over time or space...
to measure the distribution of population of various demographic groups, in economics
Economics
Economics is the social science that analyzes the production, distribution, and consumption of goods and services. The term economics comes from the Ancient Greek from + , hence "rules of the house"...
to measure the distribution over sectors of economic activity in a region, and in information science
Information science
-Introduction:Information science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...
to describe the complexity of a set of information.
In measuring human diversity, the diversity index measures the probability that any two residents, chosen at random, would be of different ethnicities. If all residents are of the same ethnic group it's zero. If half are from one group and half from another it's .50.
Below, a series of diversity indices is discussed.
Species richness
The species richnessSpecies richness
Species richness is the number of different species in a given area. It is represented in equation form as S.Species richness is the fundamental unit in which to assess the homogeneity of an environment. Typically, species richness is used in conservation studies to determine the sensitivity of...
is simply the number of species present in an ecosystem. This index makes no use of relative abundances. In practice, measuring the total species richness in an ecosystem is impossible, except in very depauperate systems. The observed number of species in the system is a biased estimator of the true species richness in the system, and the observed species number increases non-linearly with sampling effort. Thus , if indicating the observed species richness in an ecosystem, is usually referred to as species density.
Species Evenness
The species evennessSpecies evenness
Species evenness refers to how close in numbers each species in an environment are. Mathematically it is defined as a diversity index, a measure of biodiversity which quantifies how equal the community is numerically. So if there are 40 foxes, and 1000 dogs, the community is not very even. But if...
is the relative abundance or proportion of individuals among the species.
Concentration ratio
Concentration ratioConcentration ratio
In economics, a concentration ratio is a measure of the total output produced in an industry by a given number of firms in the industry. The most common concentration ratios are the CR4 and the CR8, which means the four and the eight largest firms...
is a crude indicator of the extent to which a few groups such as species, demographic groups or companies dominate an environment, the total share taken by the top n species or firms. However by itself the concentration ratio does not indicate how much that share is divided between those top n firms or species.
Simpson's diversity index
If is the fraction of all organisms which belong to the i-th species, then Simpson's diversity indexSimpson index
Simpson's diversity index is one of a number of diversity indices, used to measure diversity. In ecology, it is often used to quantify the biodiversity of a habitat. It takes into account the number of species present, as well as the relative abundance of each species...
is most commonly defined as the statistic
This quantity was introduced by Edward Hugh Simpson in 1949. The Herfindahl index
Herfindahl index
The Herfindahl index is a measure of the size of firms in relation to the industry and an indicator of the amount of competition among them. Named after economists Orris C. Herfindahl and Albert O. Hirschman, it is an economic concept widely applied in competition law, antitrust and also...
in competition economics is essentially the same.
If is the number of individuals of species which are counted, and is the total number of all individuals counted, then
is an estimator for Simpson's index for sampling
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
without replacement.
Note that , with values near zero corresponding to highly diverse or heterogeneous ecosystems and values near one corresponding to more homogeneous ecosystems. Biologists who find this confusing sometimes use instead; confusingly, this reciprocal quantity is also called Simpson's index. Another response is to redefine Simpson's index as
This quantity is called by statisticians the index of diversity.
In sociology, psychology and management studies the index is often known as Blau's Index, as it was introduced into the literature by the sociologist Peter Blau
Peter Blau
Peter Michael Blau was an American sociologist and theorist. Born in Vienna, Austria, he immigrated to the United States in 1939. He received his PhD at Columbia University in 1952, and was an instructor at Wayne State University in Detroit, Michigan from 1949–1951, before moving on to teach...
.
In economics essentially the same quantity is called the Hirschman-Herfindahl index
Herfindahl index
The Herfindahl index is a measure of the size of firms in relation to the industry and an indicator of the amount of competition among them. Named after economists Orris C. Herfindahl and Albert O. Hirschman, it is an economic concept widely applied in competition law, antitrust and also...
(HHI), defined as the sum of the squares of the shares in the population across groups (with E as the group size, that is, the number of employees or the number of specimina):
Note that a HHI is also used within sectors, to measure competition
Competition
Competition is a contest between individuals, groups, animals, etc. for territory, a niche, or a location of resources. It arises whenever two and only two strive for a goal which cannot be shared. Competition occurs naturally between living organisms which co-exist in the same environment. For...
.
The index of diversity (also referred to as the Index of Variability) is a commonly used measure, in demographic research, to determine the variation in categorical data
Categorical data
In statistics, categorical data is that part of an observed dataset that consists of categorical variables, or for data that has been converted into that form, for example as grouped data...
.
Gibbs and Martin defined the Simpson's diversity index for use in sociology as:
where
- p = proportion of individuals or objects in a category
- N = number of categories.
A perfectly homogeneous population would have a diversity index score of 0. A perfectly heterogeneous population would have a diversity index score of 1 (assuming infinite categories with equal representation in each category). As the number of categories increases, the maximum value of the diversity index score also increases (e.g., 4 categories at 25% = .75, 5 categories with 20% = .8, etc.)
An example of the use of the index of diversity would be a measure of racial diversity in a city. Thus, if Sunflower City was 85% white and 15% black, the index of diversity would be: .255.
The interpretation of the diversity index score would be that the population of Sunflower City is not very heterogeneous but is also not homogeneous.
Shannon's diversity index
Shannon's diversity indexShannon index
The Shannon index, sometimes referred to as the Shannon-Wiener Index or the Shannon-Weaver Index, is one of several diversity indices used to measure diversity in categorical data. It is simply the Information entropy of the distribution, treating species as symbols and their relative population...
is simply the ecologist's name for the communication entropy introduced by Claude Shannon:
where is the fraction of individuals belonging to the i-th species. This is by far the most widely used diversity index. The intuitive significance of this index can be described as follows. Suppose we devise binary codewords for each species in our ecosystem, with short codewords used for the most abundant species, and longer codewords for rare species. As we walk around and observe individual organisms, we call out the corresponding codeword. This gives a binary sequence. If we have used an efficient code, we will be able to save some breath by calling out a shorter sequence than would otherwise be the case. If so, the average codeword length we call out as we wander around will be close to the Shannon diversity index.
It is possible to write down estimators which attempt to correct for bias in finite sample sizes, but this would be misleading since communication entropy does not really fit expectations based upon parametric statistics. Differences arising from using two different estimators are likely to be overwhelmed by errors arising from other sources. Current best practice tends to use bootstrapping
Bootstrapping
Bootstrapping or booting refers to a group of metaphors that share a common meaning: a self-sustaining process that proceeds without external help....
procedures to estimate communication entropy.
Shannon himself showed that his communication entropy enjoys some powerful formal properties, and furthermore, it is the unique quantity which does so. These observations are the foundation of its interpretation as a measure of statistical diversity (or "surprise", in the arena of communications). The applications of this quantity go far beyond the one discussed here; see the textbook cited below for an elementary survey of the extraordinary richness of modern information theory.
Berger-Parker index
The Berger-Parker diversity index is simplyThis is an example of an index which uses only partial information about the relative abundances of the various species in its definition.
Rényi entropy
The Species richness, the Shannon index, Simpson's index, and the Berger-Parker index can all be identified as particular examples of quantities bearing a simple relation to the Rényi entropyRényi entropy
In information theory, the Rényi entropy, a generalisation of Shannon entropy, is one of a family of functionals for quantifying the diversity, uncertainty or randomness of a system...
,
for approaching respectively.
Unfortunately, the powerful formal properties of communication entropy do not generalize to Rényi entropy, which largely explains the much greater power and popularity of Shannon's index with respect to its competitors.
Income inequality
Related to diversity indices are many income inequality indicesIncome inequality metrics
The concept of inequality is distinct from that of poverty and fairness. Income inequality metrics or income distribution metrics are used by social scientists to measure the distribution of income, and economic inequality among the participants in a particular economy, such as that of a specific...
, such as the Gini index and the Theil index
Theil index
The Theil index is a statistic used to measure economic inequality. It has also been used to measure the lack of racial diversity. The basic Theil index TT is the same as redundancy in information theory which is the maximum possible entropy of the data minus the observed entropy. It is a special...
. Generally these measure a lack of diversity, but the only difference with the measures mentioned above is a minus sign.
The Theil index
Theil index
The Theil index is a statistic used to measure economic inequality. It has also been used to measure the lack of racial diversity. The basic Theil index TT is the same as redundancy in information theory which is the maximum possible entropy of the data minus the observed entropy. It is a special...
in particular is the maximum possible diversity log(N) minus Shannon's diversity index. It is the maximum possible entropy of the data minus the observed entropy. The Theil index is called redundancy
Redundancy (information theory)
Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data...
in information theory.
See also
- Alpha diversityAlpha diversityAlpha diversity is the biodiversity within a particular area, community or ecosystem, and is usually expressed as the species richness of the area. This can be measured by counting the number of taxa within the ecosystem...
- Qualitative variationQualitative variationAn index of qualitative variation is a measure of statistical dispersion in nominal distributions. There are a variety of these, but they have been relatively little-studied in the statistics literature...
- Shannon indexShannon indexThe Shannon index, sometimes referred to as the Shannon-Wiener Index or the Shannon-Weaver Index, is one of several diversity indices used to measure diversity in categorical data. It is simply the Information entropy of the distribution, treating species as symbols and their relative population...
- Isolation indexIsolation indexAn isolation index is a measure of the segregation of the activities of multiple populations. They have been used in studies of racial segregation and ideological segregation....
Further reading
See chapter 5 for an elaboration of coding procedures described informally above.- Chao, A.; Shen, T-J. (2003) "Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample", Environmental and Ecological Statistics, 10 (4),429-443
External links
- Simpson's Diversity index
- Diversity indices gives some examples of estimates of Simpson's index for real ecosystems.