Incremental encoding
Encyclopedia
Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding
compression algorithm whereby common prefixes or suffixes
and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing sorted
data, e.g., a list of word
s from a dictionary
.
For example:
The encoding used to store the common prefix length itself varies from application to application. Typical techniques are storing the value as a single byte; delta encoding
, which store only the change in the common prefix length; and various universal codes
. It may be combined with other general lossless data compression
techniques such as entropy encoding
and dictionary coder
s to compress the remaining suffixes.
; these list all the words found in all the documents and a pointer for each one to a list of locations. Typically, it compresses these indexes by about 40%.
As one example, incremental encoding is used as a starting point by the GNU locate
utility, in an index of filenames and directories. The GNU locate
utility further uses bigram
encoding to further shorten popular filepath prefixes.
Delta encoding
Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...
compression algorithm whereby common prefixes or suffixes
Affix
An affix is a morpheme that is attached to a word stem to form a new word. Affixes may be derivational, like English -ness and pre-, or inflectional, like English plural -s and past tense -ed. They are bound morphemes by definition; prefixes and suffixes may be separable affixes...
and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing sorted
Sorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...
data, e.g., a list of word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...
s from a dictionary
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...
.
For example:
Input | Common prefix | Compressed output |
---|---|---|
|
|
|
64 bytes | 46 bytes | |
The encoding used to store the common prefix length itself varies from application to application. Typical techniques are storing the value as a single byte; delta encoding
Delta encoding
Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...
, which store only the change in the common prefix length; and various universal codes
Universal code (data compression)
In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic , the expected lengths of the codewords are...
. It may be combined with other general lossless data compression
Lossless data compression
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be reconstructed, in exchange...
techniques such as entropy encoding
Entropy encoding
In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium....
and dictionary coder
Dictionary coder
A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder...
s to compress the remaining suffixes.
Applications
Incremental encoding is widely used in information retrieval to compress the lexicons used in search indexesIndex (search engine)
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science...
; these list all the words found in all the documents and a pointer for each one to a list of locations. Typically, it compresses these indexes by about 40%.
As one example, incremental encoding is used as a starting point by the GNU locate
GNU locate
locate is a Unix utility first created in 1983 used to find files on filesystems. It searches through a prebuilt database of files generated by updatedb or a daemon and compressed using incremental encoding...
utility, in an index of filenames and directories. The GNU locate
GNU locate
locate is a Unix utility first created in 1983 used to find files on filesystems. It searches through a prebuilt database of files generated by updatedb or a daemon and compressed using incremental encoding...
utility further uses bigram
Bigram
Bigrams or digrams are groups of two written letters, two syllables, or two words, and are very commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful language models for speech recognition...
encoding to further shorten popular filepath prefixes.