Compositional data
Encyclopedia
In statistics
, compositional data are quantitative descriptions of the parts of some whole, conveying exclusively relative information.
This definition, given by John Aitchison
(1986) has several consequences:
where D is the number of parts (components) and denotes a row vector.
This is the reason why is considered to be the sample space of compositional data. The positive constant is arbitrary. Frequent values for are 1 (per unit), 100 (percent, %), 1000, 106 (ppm), 109 (ppb), ...
Remarks on the definition of the simplex:
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, compositional data are quantitative descriptions of the parts of some whole, conveying exclusively relative information.
This definition, given by John Aitchison
John Aitchison
John Aitchison is a Scottish statistician. He studied at the Universities of Edinburgh and Cambridge. From 1966 to 1976 he was Titular Professor of Statistics, and Mitchell Lecturer in Statistics at the University of Glasgow. He was made a Fellow of the Royal Society of Edinburgh in 1968...
(1986) has several consequences:
- A compositional data point, or composition for short, can be represented by a positive real vector with as many parts as considered. Sometimes, if the total amount is fixed and known, one component of the vector can be omitted.
- As compositions only carry relative information, the only information is given by the ratios between components. Consequently, a composition multiplied by any positive constant contains the same information as the former. Therefore, proportional positive vectors are equivalent when considered as compositions.
- As usual in mathematics, equivalent classes are represented by some element of the class, called a representative. Thus, equivalent compositions can be represented by positive vectors whose components add to a given constant . The vector operation assigning the constant sum representative is called closure and is denoted by :
where D is the number of parts (components) and denotes a row vector.
- Compositional data can be represented by constant sum real vectors with positive components, and this vectors span a simplexSimplexIn geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimension. Specifically, an n-simplex is an n-dimensional polytope which is the convex hull of its n + 1 vertices. For example, a 2-simplex is a triangle, a 3-simplex is a tetrahedron,...
, defined as
This is the reason why is considered to be the sample space of compositional data. The positive constant is arbitrary. Frequent values for are 1 (per unit), 100 (percent, %), 1000, 106 (ppm), 109 (ppb), ...
- In statisticsStatisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, compositional data is frequently considered to be data in which each data pointData pointIn statistics, a data point is a set of measurements on a single member of a statistical population, or a subset of those measurements for a given individual...
is an D-tuple of nonnegative numbers whose sum is 1. Typically each of the D components xi of each data point [x1, ..., xD] says what proportion (or "percentage") of a statistical unit falls into the ith category in a list of D categories. Very often ternary plotTernary plotA ternary plot, ternary graph, triangle plot, simplex plot, or de Finetti diagram is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equilateral triangle...
s are used in analysis of compositional data to represent a three part composition.
- An alternative nomenclatures for compositional analysis is simplicial analysis, motivated by the concept of simplicial setSimplicial setIn mathematics, a simplicial set is a construction in categorical homotopy theory which is a purely algebraic model of the notion of a "well-behaved" topological space...
s.
Remarks on the definition of the simplex:
- In mathematical frameworks, the superscript of , accounting for the number of parts, is often changed to D − 1, describing the dimension.
- The components of the vector are assumed to be positive. However, in some definitions of the simplex, non-negative components are admitted. Here null components are avoided, because ratios between components of which some are zero are meaningless.
Examples
- Each data point may correspond to a rock composed of three different minerals; a rock of which 10% is the first mineral, 30% is the second, and the remaining 60% is the third would correspond to the triple [0.1, 0.3, 0.6]; a data setData setA data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...
would contain one such triple for each rock in a sample of rocks.
- Each data point may correspond to a town; a town in which 35% of the people are Christians, 55% are Muslims, 6% are Jews, and the remaining 4% are others would correspond to the quadruple [0.35, 0.55, 0.06, 0.04]; a data set would correspond to a list of towns.
- In chemistry, compositions can be expressed as molar concentrations of each component. As the sum of all concentrations is not determined, the whole composition of D parts is needed and thus expressed as a vector of D molar concentrations. These compositions can be translated into weight per cent multiplying each component by the appropriated constant.
- In a survey, the proportions of people positively answering some different items can be expressed as percentages. As the total amount is identified as 100, the compositional vector of D components can be defined using only D − 1 components, assuming that the remaining component is the percentage needed for the whole vector to add to 100.
- In probability and statistics, a partition of the sampling space into disjoint events is described by the probabilities assigned to such events. The vector of D probabilities can be considered as a composition of D parts. As they add to one, one probability can be suppressed and the composition is completely determined.