Entropy encoding
Encyclopedia
In information theory
an entropy encoding is a lossless data compression
scheme that is independent of the specific characteristics of the medium.
One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol by the corresponding variable-length prefix-free output codeword. The length of each codeword is approximately proportional
to the negative logarithm
of the probability
. Therefore, the most common symbols use the shortest codes.
According to Shannon's source coding theorem, the optimal code length for a symbol is −logbP, where b is the number of symbols used to make output codes and P is the probability of the input symbol.
Two of the most common entropy encoding techniques are Huffman coding
and arithmetic coding
.
If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression
), a simpler static code may be useful.
These static codes include universal codes
(such as Elias gamma coding
or Fibonacci coding
) and Golomb codes
(such as unary coding
or Rice coding).
also be used to measure the amount of similarity between streams of data. This is done by generating an entropy
coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each
compressor and seeing which compressor yields the highest compression. The coder with the best compression is
probably the coder trained on the data that was most similar to the unknown data.
----
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
an entropy encoding is a lossless data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
scheme that is independent of the specific characteristics of the medium.
One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol by the corresponding variable-length prefix-free output codeword. The length of each codeword is approximately proportional
Proportionality (mathematics)
In mathematics, two variable quantities are proportional if one of them is always the product of the other and a constant quantity, called the coefficient of proportionality or proportionality constant. In other words, are proportional if the ratio \tfrac yx is constant. We also say that one...
to the negative logarithm
Logarithm
The logarithm of a number is the exponent by which another fixed value, the base, has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3, because 1000 is 10 to the power 3: More generally, if x = by, then y is the logarithm of x to base b, and is written...
of the probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...
. Therefore, the most common symbols use the shortest codes.
According to Shannon's source coding theorem, the optimal code length for a symbol is −logbP, where b is the number of symbols used to make output codes and P is the probability of the input symbol.
Two of the most common entropy encoding techniques are Huffman coding
Huffman coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...
and arithmetic coding
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...
.
If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression
Signal compression
In telecommunication, the term signal compression has the following meanings:In analog systems, reduction of the dynamic range of a signal by controlling it as a function of the inverse relationship of its instantaneous value relative to a specified reference level.Signal compression is usually...
), a simpler static code may be useful.
These static codes include universal codes
Universal code (data compression)
In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic , the expected lengths of the codewords are...
(such as Elias gamma coding
Elias gamma coding
Elias gamma code is a universal code encoding positive integers developed by Peter Elias. It is used most commonly when coding integers whose upper-bound cannot be determined beforehand.-Encoding:To code a number:#Write it in binary....
or Fibonacci coding
Fibonacci coding
In mathematics, Fibonacci coding is a universal code which encodes positive integers into binary code words. Each code word ends with "11" and contains no other instances of "11" before the end.-Definition:...
) and Golomb codes
Golomb coding
Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the...
(such as unary coding
Unary coding
Unary coding, sometimes called thermometer code, is an entropy encoding that represents a natural number, n, with n ones followed by a zero or with n − 1 ones followed by a zero...
or Rice coding).
Entropy as a measure of similarity
Besides using entropy encoding as a way to compress digital data, an entropy encoder canalso be used to measure the amount of similarity between streams of data. This is done by generating an entropy
coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each
compressor and seeing which compressor yields the highest compression. The coder with the best compression is
probably the coder trained on the data that was most similar to the unknown data.
External links
- On-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKayDavid MacKay (scientist)David John Cameron MacKay, FRS, is the professor of natural philosophy in the department of Physics at the University of Cambridge and chief scientific adviser to the UK Department of Energy and Climate Change...
, gives an accessible introduction to Shannon theory and data compression, including the Huffman codingHuffman codingIn computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...
and arithmetic codingArithmetic codingArithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...
.
----