Package-merge algorithm
Encyclopedia
The package-merge algorithm is an O
(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word
is longer than L. It is a greedy algorithm
, and a generalization of Huffman's original algorithm
. Package-merge works by reducing the code construction problem to the binary coin collector's problem.
. The coin collector has run out of money and needs to use some of his coin collection to buy something of cost N. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total N.
The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.
Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.
Note that the time of the algorithm is linear in the number of coins.
Let p1, …, pn be the frequencies of the
symbols of the alphabet to be encoded. We first sort the symbols so that pi ≤ pi+1. Create L coins for each symbol, of denominations 2−1, …, 2−L, each of numismatic value pi. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total n − 1. Let hi be the number of coins of numismatic value pi selected.
The optimal length-limited Huffman code will encode symbol i with a bit string of length hi. The canonical Huffman code
can easily be constructed by a simple bottom-up greedy method, given that the hi are known, and this can be the basis for fast data compression
.
Many other improvements have been made to the Package-Merge Algorithm to reduce the multiplicative constant
and to make it faster in special cases, such as those problems having repeated pis. The Package-Merge approach has also been adapted to related problems such as alphabetic coding.
Methods involving graph theory
have been shown to have better asymptotic space complexity than the Package-Merge Algorithm, but these have not seen as much practical application.
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word
Code word
In communication, a code word is an element of a standardized code or protocol. Each code word is assembled in accordance with the specific rules of the code and assigned a unique meaning...
is longer than L. It is a greedy algorithm
Greedy algorithm
A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....
, and a generalization of Huffman's original algorithm
Huffman coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...
. Package-merge works by reducing the code construction problem to the binary coin collector's problem.
The coin collector's problem
Suppose a coin collector has a number of coins of various denominations, each of which has a numismatic valueNumismatics
Numismatics is the study or collection of currency, including coins, tokens, paper money, and related objects. While numismatists are often characterized as students or collectors of coins, the discipline also includes the broader study of money and other payment media used to resolve debts and the...
. The coin collector has run out of money and needs to use some of his coin collection to buy something of cost N. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total N.
The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.
Description of the package-merge algorithm
Assume that the largest denomination is 1 dollar, and that N is an integer. (The algorithm works even if these assumptions do not hold, by trivial modifications.) The coin collector first separates his coins into lists, one for each denomination, sorted by numismatic value. He then packages the smallest denomination coins in pairs, starting from the pair of smallest total numismatic value. If there is one coin left over, it will be the coin of highest numismatic value of that denomination, and it is set aside and ignored henceforth. These packages are then merged into the list of coins of the next smallest denomination, again in order of numismatic value. The items in that list are then packaged in pairs, and merged into the next smallest list, and so forth.Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.
Note that the time of the algorithm is linear in the number of coins.
Reduction of length-limited Huffman coding to the coin collector's problem
Let L be the maximum length any code word is permitted to have.Let p1, …, pn be the frequencies of the
symbols of the alphabet to be encoded. We first sort the symbols so that pi ≤ pi+1. Create L coins for each symbol, of denominations 2−1, …, 2−L, each of numismatic value pi. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total n − 1. Let hi be the number of coins of numismatic value pi selected.
The optimal length-limited Huffman code will encode symbol i with a bit string of length hi. The canonical Huffman code
Canonical Huffman code
A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner....
can easily be constructed by a simple bottom-up greedy method, given that the hi are known, and this can be the basis for fast data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
.
Performance improvements and generalizations
With this reduction, the algorithm is O(nL)-time and O(nL)-space. However, the original paper, "A fast algorithm for optimal length-limited Huffman codes," shows how this can be improved to O(nL)-time and O(n)-space. The idea is to run the algorithm a first time, only keeping enough data to be able to determine two equivalent subproblems that sum to half the size of the original problem. This is done recursively, resulting in an algorithm that takes about twice as long but requires only linear space.Many other improvements have been made to the Package-Merge Algorithm to reduce the multiplicative constant
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...
and to make it faster in special cases, such as those problems having repeated pis. The Package-Merge approach has also been adapted to related problems such as alphabetic coding.
Methods involving graph theory
Graph theory
In mathematics and computer science, graph theory is the study of graphs, mathematical structures used to model pairwise relations between objects from a certain collection. A "graph" in this context refers to a collection of vertices or 'nodes' and a collection of edges that connect pairs of...
have been shown to have better asymptotic space complexity than the Package-Merge Algorithm, but these have not seen as much practical application.
External links
- Michael B. Baer "Twenty (or so) Questions: $D$-ary Length-Bounded Prefix Coding." (PDF)
- Alistair Moffat and Andrew Turpin "Space-Efficient Construction of Optimal Prefix Codes."
- An implementation of the package-merge algorithm "http://www.koders.com/delphi/fid65738C1383AB6F1E0D556659A91CB5F85B09C941.aspx?s=algorithm"