Huffman coding
Encyclopedia
Char | Freq | Code |
---|---|---|
space | 7 | 111 |
a | 4 | 010 |
e | 4 | 000 |
f | 3 | 1101 |
h | 2 | 1010 |
i | 2 | 1000 |
m | 2 | 0111 |
n | 2 | 0010 |
s | 2 | 1011 |
t | 2 | 0110 |
l | 1 | 11001 |
o | 1 | 00110 |
p | 1 | 10011 |
r | 1 | 11000 |
u | 1 | 00111 |
x | 1 | 10010 |
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
and information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
, Huffman coding is an entropy encoding
Entropy encoding
In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium....
algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
used for lossless data compression
Lossless data compression
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be reconstructed, in exchange...
. The term refers to the use of a variable-length code
Variable-length code
In coding theory a variable-length code is a code which maps source symbols to a variable number of bits.Variable-length codes can allow sources to be compressed and decompressed with zero error and still be read back symbol by symbol...
table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman
David A. Huffman
David Albert Huffman was a pioneer in computer science. He is well-known for his Huffman coding. David Huffman died at the age of 74 after a 10-month battle with cancer.-Education:...
while he was a Ph.D.
Doctor of Philosophy
Doctor of Philosophy, abbreviated as Ph.D., PhD, D.Phil., or DPhil , in English-speaking countries, is a postgraduate academic degree awarded by universities...
student at MIT
Massachusetts Institute of Technology
The Massachusetts Institute of Technology is a private research university located in Cambridge, Massachusetts. MIT has five schools and one college, containing a total of 32 academic departments, with a strong emphasis on scientific and technological education and research.Founded in 1861 in...
, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code
Prefix code
A prefix code is a type of code system distinguished by its possession of the "prefix property"; which states that there is no valid code word in the system that is a prefix of any other valid code word in the set...
(sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to design a Huffman code in linear time if input probabilities (also known as weights) are sorted.
For a set of symbols with a uniform probability distribution and a number of members which is a power of two
Power of two
In mathematics, a power of two means a number of the form 2n where n is an integer, i.e. the result of exponentiation with as base the number two and as exponent the integer n....
, Huffman coding is equivalent to simple binary block encoding
Block code
In coding theory, block codes refers to the large and important family of error-correcting codes that encode data in blocks.There is a vast number of examples for block codes, many of which have a wide range of practical applications...
, e.g., ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
coding. Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm.
Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e. a stream of unrelated symbols) with a known input probability distribution, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass function
Probability mass function
In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value...
s are unknown, not identically distributed, or not independent (e.g., "cat" is more common than "cta"). Other methods such as arithmetic coding
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...
and LZW
LZW
Lempel–Ziv–Welch is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978...
coding often have better compression capability: both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, the latter of which is useful when input probabilities are not precisely known or vary significantly within the stream. However, the limitations of Huffman coding should not be overstated; it can be used adaptively, accommodating unknown, changing, or context-dependent probabilities. In the case of known independent and identically-distributed random variables, combining symbols together reduces inefficiency in a way that approaches optimality as the number of symbols combined increases.
History
In 1951, David A. HuffmanDavid A. Huffman
David Albert Huffman was a pioneer in computer science. He is well-known for his Huffman coding. David Huffman died at the age of 74 after a 10-month battle with cancer.-Education:...
and his MIT information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper
Term paper
A term paper is a research paper written by students over an academic term, accounting for a large part of a grade. Term papers are generally intended to describe an event, a concept, or argue a point. A term paper is a written original work discussing a topic in detail, usually several typed pages...
on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree
Binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...
and quickly proved this method the most efficient.
In doing so, the student outdid his professor, who had worked with information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of the suboptimal Shannon-Fano coding by building the tree from the bottom up instead of from the top down.
Informal description
Given: A set of symbols and their weights (usually proportionalProportionality (mathematics)
In mathematics, two variable quantities are proportional if one of them is always the product of the other and a constant quantity, called the coefficient of proportionality or proportionality constant. In other words, are proportional if the ratio \tfrac yx is constant. We also say that one...
to probabilities).
Find: A prefix-free binary code
Prefix code
A prefix code is a type of code system distinguished by its possession of the "prefix property"; which states that there is no valid code word in the system that is a prefix of any other valid code word in the set...
(a set of codewords) with minimum expected
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
codeword length (equivalently, a tree with minimum weighted path length from the root).
Formalized description
Input.Alphabet , which is the symbol alphabet of size .
Set , which is the set of the (positive) symbol weights (usually proportional to probabilities), i.e. .
Output.
Code , which is the set of (binary) codewords, where is the codeword for .
Goal.
Let be the weighted path length of code . Condition: for any code .
Samples
Input (A, W) | Symbol (ai) | a | b | c | d | e | Sum |
---|---|---|---|---|---|---|---|
Weights (wi) | 0.10 | 0.15 | 0.30 | 0.16 | 0.29 | = 1 | |
Output C | Codewords (ci) | 010 | 011 | 11 | 00 | 10 | |
Codeword length (in bits) (li) |
3 | 3 | 2 | 2 | 2 | ||
Weighted path length (li wi ) |
0.30 | 0.45 | 0.60 | 0.32 | 0.58 | L(C) = 2.25 | |
Optimality | Probability budget (2-li) |
1/8 | 1/8 | 1/4 | 1/4 | 1/4 | = 1.00 |
Information content (in bits) (−log2 wi) ≈ |
3.32 | 2.74 | 1.74 | 2.64 | 1.79 | ||
Entropy (−wi log2 wi) |
0.332 | 0.411 | 0.521 | 0.423 | 0.518 | H(A) = 2.205 |
For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, you can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique.
As defined by Shannon (1948)
A Mathematical Theory of Communication
"A Mathematical Theory of Communication" is an influential 1948 article by mathematician Claude E. Shannon. As of November 2011, Google Scholar has listed more than 48,000 unique citations of the article and the later-published book version...
, the information content h (in bits) of each symbol ai with non-null probability is
The entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...
H (in bits) is the weighted sum, across all symbols ai with non-zero probability wi, of the information content of each symbol:
(Note: A symbol with zero probability has zero contribution to the entropy, since So for simplicity, symbols with zero probability can be left out of the formula above.)
As a consequence of Shannon's source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon.
Note that, in general, a Huffman code need not be unique, but it is always one of the codes minimizing .
Compression
The technique works by creating a binary treeBinary tree
In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...
of nodes. These can be stored in a regular array
Array data type
In computer science, an array type is a data type that is meant to describe a collection of elements , each selected by one or more indices that can be computed at run time by the program. Such a collection is usually called an array variable, array value, or simply array...
, the size of which depends on the number of symbols, . A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to leaf nodes and internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.
The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree.
The simplest construction algorithm uses a priority queue
Priority queue
A priority queue is an abstract data type in computer programming.It is exactly like a regular queue or stack data structure, but additionally, each element is associated with a "priority"....
where the node with lowest probability is given highest priority:
- Create a leaf node for each symbol and add it to the priority queue.
- While there is more than one node in the queue:
- Remove the two nodes of highest priority (lowest probability) from the queue
- Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
- Add the new node to the queue.
- The remaining node is the root node and the tree is complete.
Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols.
If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:
- Start with as many leaves as there are symbols.
- Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
- While there is more than one node in the queues:
- Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
- Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
- Enqueue the new node into the rear of the second queue.
- The remaining node is the root node; the tree has now been generated.
Although this algorithm may appear "faster" complexity-wise than the previous algorithm using a priority queue, this is not actually the case because the symbols need to be sorted by probability before-hand, a process that takes O(n log n) time in itself.
In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large.
It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.
Here's an example using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs":
Decompression
Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using canonical encodingCanonical Huffman code
A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner....
, the compression model can be precisely reconstructed with just bits of information (where is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).
Main properties
The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed.(This variation requires that a frequency table or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store tables efficiently.)
Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. "Blocking", or expanding the alphabet size by grouping multiple symbols into "words" of fixed or variable-length before Huffman coding helps both to reduce that inefficiency and to take advantage of statistical dependencies between input symbols within the group (as in the case of natural language text). The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called run-length encoding
Run-length encoding
Run-length encoding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run...
; for the simple case of Bernoulli process
Bernoulli process
In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variables Xi are identical and independent...
es, Golomb coding
Golomb coding
Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the...
is a provably optimal run-length code.
Arithmetic coding
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...
produces some gains over Huffman coding, although arithmetic coding has higher computational complexity. Also, arithmetic coding was historically a subject of some concern over patent
Patent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....
issues. However, as of mid-2010, various well-known effective techniques for arithmetic coding have passed into the public domain as the early patents have expired.
Variations
Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time. An exhaustive list of papers on Huffman coding and its variations is given by "Code and Parse Trees for Lossless Source Encoding"http://scholar.google.com/scholar?hl=en&lr=&cluster=6556734736002074338.n-ary Huffman coding
The n-ary Huffman algorithm uses the {0, 1, ... , n − 1} alphabet to encode message and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source words can properly form an n-ary tree for Huffman coding. In this case, additional 0-probability place holders must be added. This is because the tree must form an n to 1 contractor; for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the number of source words is congruent to 1 modulo n-1, then the set of source words will form a proper Huffman tree.Adaptive Huffman coding
A variation called adaptive Huffman codingAdaptive Huffman coding
Adaptive Huffman coding is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to changing conditions in data.The benefit of...
involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates.
Huffman template algorithm
Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a totally orderedTotal order
In set theory, a total order, linear order, simple order, or ordering is a binary relation on some set X. The relation is transitive, antisymmetric, and total...
commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing , a problem first applied to circuit design http://citeseer.ist.psu.edu/context/665634/0.
Length-limited Huffman coding
Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithmPackage-merge algorithm
The package-merge algorithm is an O-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word is longer than L. It is a greedy algorithm, and a generalization of Huffman's original algorithm...
solves this problem with a simple greedy
Greedy algorithm
A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....
approach very similar to that used by Huffman's algorithm. Its time complexity is , where is the maximum length of a codeword. No algorithm is known to solve this problem in linear or linearithmic time, unlike the presorted and unsorted conventional Huffman problems, respectively.
Huffman coding with unequal letter costs
In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.Huffman coding with unequal letter costs is the generalization in which this assumption is no longer assumed true: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code
Morse code
Morse code is a method of transmitting textual information as a series of on-off tones, lights, or clicks that can be directly understood by a skilled listener or observer without special equipment...
, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding.
Optimal alphabetic binary trees (Hu-Tucker coding)
In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, could not be assigned code , but instead should be assigned either or . This is also known as the Hu-Tucker problem, after the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search treeBinary search tree
In computer science, a binary search tree , which may sometimes also be called an ordered or sorted binary tree, is a node-based binary tree data structurewhich has the following properties:...
s.
The canonical Huffman code
If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu-Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman codeCanonical Huffman code
A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner....
and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon-Fano coding. The Huffman-Shannon-Fano code corresponding to the example is , which, having the same codeword lengths as the original solution, is also optimal.
Applications
Arithmetic codingArithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...
can be viewed as a generalization of Huffman coding, in the sense that they produce the same output when every symbol has a probability of the form 1/2k; in particular it tends to offer significantly better compression for small alphabet sizes. Huffman coding nevertheless remains in wide use because of its simplicity and high speed. Intuitively, arithmetic coding can offer better compression than Huffman coding because its "code words" can have effectively non-integer bit lengths, whereas code words in Huffman coding can only have an integer number of bits. Therefore, there is an inefficiency in Huffman coding where a code word of length k only optimally matches a symbol of probability 1/2k and other probabilities are not represented as optimally; whereas the code word length in arithmetic coding can be made to exactly match the true probability of the symbol.
Huffman coding today is often used as a "back-end" to some other compression methods.
DEFLATE (PKZIP
PKZIP
PKZIP is an archiving tool originally written by Phil Katz and marketed by his company PKWARE, Inc. The common "PK" prefix used in both PKZIP and PKWARE stands for "Phil Katz".-History:...
's algorithm) and multimedia codec
Codec
A codec is a device or computer program capable of encoding or decoding a digital data stream or signal. The word codec is a portmanteau of "compressor-decompressor" or, more commonly, "coder-decoder"...
s such as JPEG
JPEG
In computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....
and MP3
MP3
MPEG-1 or MPEG-2 Audio Layer III, more commonly referred to as MP3, is a patented digital audio encoding format using a form of lossy data compression...
have a front-end model and quantization
Quantization (signal processing)
Quantization, in mathematics and digital signal processing, is the process of mapping a large set of input values to a smaller set – such as rounding values to some unit of precision. A device or algorithmic function that performs quantization is called a quantizer. The error introduced by...
followed by Huffman coding (or variable-length prefix-free codes with a similar structure, although perhaps not necessarily designed by using Huffman's algorithm).
See also
- Adaptive Huffman codingAdaptive Huffman codingAdaptive Huffman coding is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to changing conditions in data.The benefit of...
- Canonical Huffman codeCanonical Huffman codeA canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner....
- HuffyuvHuffyuvHuffyuv is a lossless video codec created by Ben Rudiak-Gould which is meant to replace uncompressed YCbCr as a video capture format.Despite the "YUV" in the name, it does not compress the YUV color space, but YCbCr...
- Modified Huffman codingModified Huffman codingModified Huffman coding is used in fax machines to encode black on white images . It combines the variable length codes of Huffman coding with the coding of repetitive data in run-length encoding....
- used in fax machines - Shannon-Fano coding
- Data compressionData compressionIn computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
- Lempel–Ziv–Welch
- VaricodeVaricodeVaricode is a Huffman code for use in PSK31. It supports all ASCII characters, but the characters used most frequently have shorter codes. The space between characters is indicated by a 00 sequence, a variation of Fibonacci coding. Originally created for speeding up real-time keyboard-to-keyboard...
External links
- Huffman Encoding process animation
- Huffman Encoding & Decoding Animation
- Program for explaining the Huffman Coding procedure.
- n-ary Huffman Template Algorithm
- Huffman Tree visual graph generator
- Sloane A098950 Minimizing k-ordered sequences of maximum height Huffman tree
- Mordecai J. Golin, Claire Kenyon, Neal E. Young "Huffman coding with unequal letter costs" (PDF), STOC 2002: 785-791
- Huffman Coding: A CS2 Assignment a good introduction to Huffman coding
- A quick tutorial on generating a Huffman tree
- Pointers to Huffman coding visualizations
- Huffman in C
- Description of an implementation in Python
- Explanation of Huffman coding with examples in several languages
- Interactive Huffman Tree Construction
- A C program doing basic Huffman coding on binary and text files
- Efficient implementation of Huffman codes for blocks of binary sequences