Canonical Huffman code
Encyclopedia
A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner.
Data compressor
s generally work in one of two ways. Either the decompressor can infer what codebook
the compressor has used from previous context, or the compressor must tell the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and then convert it to canonical Huffman before using it.
assigns a variable
length code to every symbol in the alphabet. More frequently used symbols will be assigned a shorter code. For example, suppose we have the following non-canonical codebook:
A = 11
B = 0
C = 101
D = 100
Here the letter A has been assigned 2 bit
s, B has 1 bit, and C and D both have 3 bits. To make the code a canonical Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sorted first by codeword length and secondly by alphabetical value
:
B = 0
A = 11
C = 101
D = 100
Each of the existing codes are replaced with a new one of the same length, using the following algorithm:
By following these three rules, the canonical version of the code book produced will be:
B = 0
A = 10
C = 110
D = 111
Let us take our original Huffman codebook:
A = 11
B = 0
C = 101
D = 100
There are several ways we could encode this Huffman tree. For example, we could write each symbol followed by the number of bits and code:
('A',2,11), ('B',1,0), ('C',3,101), ('D',3,100)
Since we are listing the symbols in sequential alphabetical order, we can omit the symbols themselves, listing just the number of bits and code:
(2,11), (1,0), (3,101), (3,100)
With our canonical version we have the knowledge that the symbols are in sequential alphabetical order and that a later code will always be higher in value than an earlier one. The only parts left to transmit are the bit-length
s (number of bits) for each symbol. Note that our canonical Huffman tree always has higher values for longer bit lengths and that any symbols of the same bit length (C and D) have higher code values for higher symbols:
A = 10 (code value: 2 decimal, bits: 2)
B = 0 (code value: 0 decimal, bits: 1)
C = 110 (code value: 6 decimal, bits: 3)
D = 111 (code value: 7 decimal, bits: 3)
Since two-thirds of the constraints are known, only the number of bits for each symbol need be transmitted:
2, 1, 3, 3
With knowledge of the canonical Huffman algorithm, it is then possible to recreate the entire table (symbol and code values) from just the bit-lengths. Unused symbols are normally transmitted as having zero bit length.
code = 0
while more symbols:
print symbol, code
code = (code + 1) << ((next bit length) - (current bit length))
"A Method for the Construction of Minimum-Redundancy Codes"
David A. Huffman, Proceedings of the I.R.E.
is:
compute huffman code:
input: message ensemble (set of (message, probability)).
base D.
output: code ensemble (set of (message, code)).
algorithm:
1- sort the message ensemble by decreasing probability.
2- N is the cardinal of the message ensemble (number of different
messages).
3- compute the integer n_0 such as 2<=n_0<=D and (N-n_0)/(D-1) is integer.
4- select the n_0 least probable messages, and assign them each a
digit code.
5- substitute the selected messages by a composite message summing
their probability, and re-order it.
6- while there remains more than one message, do steps thru 8.
7- select D least probable messages, and assign them each a
digit code.
8- substitute the selected messages by a composite message
summing their probability, and re-order it.
9- the code of each message is given by the concatenation of the
code digits of the aggregate they've been put in.
Data compressor
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
s generally work in one of two ways. Either the decompressor can infer what codebook
Codebook
A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.-Cryptography:...
the compressor has used from previous context, or the compressor must tell the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and then convert it to canonical Huffman before using it.
Algorithm
The normal Huffman coding algorithmAlgorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
assigns a variable
Variable
Variable may refer to:* Variable , a logical set of attributes* Variable , a symbol that represents a quantity in an algebraic expression....
length code to every symbol in the alphabet. More frequently used symbols will be assigned a shorter code. For example, suppose we have the following non-canonical codebook:
A = 11
B = 0
C = 101
D = 100
Here the letter A has been assigned 2 bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
s, B has 1 bit, and C and D both have 3 bits. To make the code a canonical Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sorted first by codeword length and secondly by alphabetical value
Value (computer science)
In computer science, a value is an expression which cannot be evaluated any further . The members of a type are the values of that type. For example, the expression "1 + 2" is not a value as it can be reduced to the expression "3"...
:
B = 0
A = 11
C = 101
D = 100
Each of the existing codes are replaced with a new one of the same length, using the following algorithm:
- The first symbol in the list gets assigned a codeword which is the same length as the symbol's original codeword but all zeros. This will often be a single zero ('0').
- Each subsequent symbol is assigned the next binaryBinary- Mathematics :* Binary numeral system, a representation for numbers using only two digits * Binary function, a function in mathematics that takes two arguments- Computing :* Binary file, composed of something other than human-readable text...
number in sequence, ensuring that following codes are always higher in value. - When you reach a longer codeword, then after incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as a left shiftLogical shiftIn computer science, a logical shift is a bitwise operation that shifts all the bits of its operand. Unlike an arithmetic shift, a logical shift does not preserve a number's sign bit or distinguish a number's exponent from its mantissa; every bit in the operand is simply moved a given number of bit...
.
By following these three rules, the canonical version of the code book produced will be:
B = 0
A = 10
C = 110
D = 111
Encoding the Codebook
The whole advantage of a canonical Huffman tree is that one can encode the description (the codebook) in fewer bits than a fully described tree.Let us take our original Huffman codebook:
A = 11
B = 0
C = 101
D = 100
There are several ways we could encode this Huffman tree. For example, we could write each symbol followed by the number of bits and code:
('A',2,11), ('B',1,0), ('C',3,101), ('D',3,100)
Since we are listing the symbols in sequential alphabetical order, we can omit the symbols themselves, listing just the number of bits and code:
(2,11), (1,0), (3,101), (3,100)
With our canonical version we have the knowledge that the symbols are in sequential alphabetical order and that a later code will always be higher in value than an earlier one. The only parts left to transmit are the bit-length
Bit-length
The length, in integers, of a binary number. The term "bit" is an abbreviation of "binary digits."At their most fundamental level, digital computers and telecommunications devices can process only data that has been expressed in binary format...
s (number of bits) for each symbol. Note that our canonical Huffman tree always has higher values for longer bit lengths and that any symbols of the same bit length (C and D) have higher code values for higher symbols:
A = 10 (code value: 2 decimal, bits: 2)
B = 0 (code value: 0 decimal, bits: 1)
C = 110 (code value: 6 decimal, bits: 3)
D = 111 (code value: 7 decimal, bits: 3)
Since two-thirds of the constraints are known, only the number of bits for each symbol need be transmitted:
2, 1, 3, 3
With knowledge of the canonical Huffman algorithm, it is then possible to recreate the entire table (symbol and code values) from just the bit-lengths. Unused symbols are normally transmitted as having zero bit length.
Pseudo code
Given a list of symbols sorted by bit-length, the following pseudo code will print a canonical Huffman code book:code = 0
while more symbols:
print symbol, code
code = (code + 1) << ((next bit length) - (current bit length))
Algorithm
The algorithm described in:"A Method for the Construction of Minimum-Redundancy Codes"
David A. Huffman, Proceedings of the I.R.E.
is:
compute huffman code:
input: message ensemble (set of (message, probability)).
base D.
output: code ensemble (set of (message, code)).
algorithm:
1- sort the message ensemble by decreasing probability.
2- N is the cardinal of the message ensemble (number of different
messages).
3- compute the integer n_0 such as 2<=n_0<=D and (N-n_0)/(D-1) is integer.
4- select the n_0 least probable messages, and assign them each a
digit code.
5- substitute the selected messages by a composite message summing
their probability, and re-order it.
6- while there remains more than one message, do steps thru 8.
7- select D least probable messages, and assign them each a
digit code.
8- substitute the selected messages by a composite message
summing their probability, and re-order it.
9- the code of each message is given by the concatenation of the
code digits of the aggregate they've been put in.