Quantities of information
Encyclopedia
The mathematical theory of information
is based on probability theory
and statistics
, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit
of information entropy
that is used. The most common unit of information is the bit
, based on the binary logarithm
. Other units include the nat
, based on the natural logarithm
, and the hartley, based on the base 10 or common logarithm
.
In what follows, an expression of the form is considered by convention to be equal to zero whenever p is zero. This is justified because for any logarithmic base.
or "surprisal" of a message m:
where is the probability that message m is chosen from all possible choices in the message space . The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of bit
s.
Information is transferred from a source to a recipient only if the recipient of the information did not already have the information to begin with. Messages that convey information that is certain to happen and already known by the recipient contain no real information. Infrequently occurring messages contain more information than more frequently occurring messages. This fact is reflected in the above equation - a certain message, i.e. of probability 1, has an information measure of zero. In addition, a compound message of two (or more) unrelated (or mutually independent) messages would have a quantity of information that is the sum of the measures of information of each message individually. That fact is also reflected in the above equation, supporting the validity of its derivation.
An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami.
of a discrete message space is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the average
self-information of a message from that message space:
where denotes the expected value
operation.
An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g. ). In this case .
Sometimes the function H is expressed in terms of the probabilities of the distribution:
where each and
An important special case of this is the binary entropy function:
of and :
If and are independent
, then the joint entropy is simply the sum of their individual entropies.
(Note: The joint entropy should not be confused with the cross entropy
, despite similar notations.)
where is the conditional probability
of given .
The conditional entropy
of given , also called the equivocation of about is then given by:
A basic property of the conditional entropy is that:
p, and an arbitrary probability distribution q. If we compress data in a manner that assumes q is the distribution underlying some data, when, in reality, p is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically,
It is in some sense the "distance" from q to p, although it is not a true metric
due to its not being symmetric.
, or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of relative to (which represents conceptually the average amount of information about that can be gained by observing ) is given by:
A basic property of the mutual information is that:
That is, knowing Y, we can save an average of bits in encoding X compared to not knowing Y. Mutual information is symmetric
:
Mutual information can be expressed as the average Kullback–Leibler divergence
(information gain) of the posterior probability distribution
of X given the value of Y to the prior distribution
on X:
In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
Mutual information is closely related to the log-likelihood ratio test
in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test
: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
The basic measures of discrete entropy have been extended by analogy to continuous spaces by replacing sums with integrals and probability mass function
s with probability density function
s. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does not imply identical properties; for example, differential entropy may be negative.
The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:
where is the joint density function, and are the marginal distributions, and is the conditional distribution.
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
is based on probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...
and statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit
Units of measurement
A unit of measurement is a definite magnitude of a physical quantity, defined and adopted by convention and/or by law, that is used as a standard for measurement of the same physical quantity. Any other value of the physical quantity can be expressed as a simple multiple of the unit of...
of information entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...
that is used. The most common unit of information is the bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
, based on the binary logarithm
Binary logarithm
In mathematics, the binary logarithm is the logarithm to the base 2. It is the inverse function of n ↦ 2n. The binary logarithm of n is the power to which the number 2 must be raised to obtain the value n. This makes the binary logarithm useful for anything involving powers of 2,...
. Other units include the nat
Nat (information)
A nat is a logarithmic unit of information or entropy, based on natural logarithms and powers of e, rather than the powers of 2 and base 2 logarithms which define the bit. The nat is the natural unit for information entropy...
, based on the natural logarithm
Natural logarithm
The natural logarithm is the logarithm to the base e, where e is an irrational and transcendental constant approximately equal to 2.718281828...
, and the hartley, based on the base 10 or common logarithm
Common logarithm
The common logarithm is the logarithm with base 10. It is also known as the decadic logarithm, named after its base. It is indicated by log10, or sometimes Log with a capital L...
.
In what follows, an expression of the form is considered by convention to be equal to zero whenever p is zero. This is justified because for any logarithmic base.
Self-information
Shannon derived a measure of information content called the self-informationSelf-information
In information theory, self-information is a measure of the information content associated with the outcome of a random variable. It is expressed in a unit of information, for example bits,nats,or...
or "surprisal" of a message m:
where is the probability that message m is chosen from all possible choices in the message space . The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
s.
Information is transferred from a source to a recipient only if the recipient of the information did not already have the information to begin with. Messages that convey information that is certain to happen and already known by the recipient contain no real information. Infrequently occurring messages contain more information than more frequently occurring messages. This fact is reflected in the above equation - a certain message, i.e. of probability 1, has an information measure of zero. In addition, a compound message of two (or more) unrelated (or mutually independent) messages would have a quantity of information that is the sum of the measures of information of each message individually. That fact is also reflected in the above equation, supporting the validity of its derivation.
An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami.
Entropy
The entropyInformation entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...
of a discrete message space is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the average
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
self-information of a message from that message space:
where denotes the expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
operation.
An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g. ). In this case .
Sometimes the function H is expressed in terms of the probabilities of the distribution:
where each and
An important special case of this is the binary entropy function:
Joint entropy
The joint entropy of two discrete random variables and is defined as the entropy of the joint distributionJoint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...
of and :
If and are independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
, then the joint entropy is simply the sum of their individual entropies.
(Note: The joint entropy should not be confused with the cross entropy
Cross entropy
In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.The cross entropy...
, despite similar notations.)
Conditional entropy (equivocation)
Given a particular value of a random variable , the conditional entropy of given is defined as:where is the conditional probability
Conditional probability
In probability theory, the "conditional probability of A given B" is the probability of A if B is known to occur. It is commonly notated P, and sometimes P_B. P can be visualised as the probability of event A when the sample space is restricted to event B...
of given .
The conditional entropy
Conditional entropy
In information theory, the conditional entropy quantifies the remaining entropy of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H...
of given , also called the equivocation of about is then given by:
A basic property of the conditional entropy is that:
Kullback–Leibler divergence (information gain)
The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions, a "true" probability distributionProbability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
p, and an arbitrary probability distribution q. If we compress data in a manner that assumes q is the distribution underlying some data, when, in reality, p is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically,
It is in some sense the "distance" from q to p, although it is not a true metric
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
due to its not being symmetric.
Mutual information (transinformation)
It turns out that one of the most useful and important measures of information is the mutual informationMutual information
In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...
, or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of relative to (which represents conceptually the average amount of information about that can be gained by observing ) is given by:
A basic property of the mutual information is that:
That is, knowing Y, we can save an average of bits in encoding X compared to not knowing Y. Mutual information is symmetric
Symmetric function
In algebra and in particular in algebraic combinatorics, the ring of symmetric functions, is a specific limit of the rings of symmetric polynomials in n indeterminates, as n goes to infinity...
:
Mutual information can be expressed as the average Kullback–Leibler divergence
Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...
(information gain) of the posterior probability distribution
Posterior probability
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...
of X given the value of Y to the prior distribution
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...
on X:
In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
Mutual information is closely related to the log-likelihood ratio test
Likelihood-ratio test
In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which is a special case of the other . The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other...
in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test
Pearson's chi-squared test
Pearson's chi-squared test is the best-known of several chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900...
: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
Differential entropy
- See main article: Differential entropyDifferential entropyDifferential entropy is a concept in information theory that extends the idea of entropy, a measure of average surprisal of a random variable, to continuous probability distributions.-Definition:...
.
The basic measures of discrete entropy have been extended by analogy to continuous spaces by replacing sums with integrals and probability mass function
Probability mass function
In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value...
s with probability density function
Probability density function
In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...
s. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does not imply identical properties; for example, differential entropy may be negative.
The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:
where is the joint density function, and are the marginal distributions, and is the conditional distribution.