Incremental encoding - AbsoluteAstronomy.com

Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding

Delta encoding

Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...

compression algorithm whereby common prefixes or suffixes

Affix

An affix is a morpheme that is attached to a word stem to form a new word. Affixes may be derivational, like English -ness and pre-, or inflectional, like English plural -s and past tense -ed. They are bound morphemes by definition; prefixes and suffixes may be separable affixes...

and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing sorted

Sorting

Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...

data, e.g., a list of word

Word

In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s from a dictionary

Dictionary

A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...

.

For example:

Input	Common prefix	Compressed output
myxa myxophyta myxopod nab nabbed nabbing nabit nabk nabob nacarat nacelle	no preceding word 'myx' 'myxop' no common prefix 'nab' 'nabb' 'nab' 'nab' 'nab' 'na' 'nac'	0 myxa 3 ophyta 5 od 0 nab 3 bed 4 ing 3 it 3 k 3 ob 2 carat 3 elle
64 bytes		46 bytes

Input

Common prefix

Compressed output



myxa

myxophyta

myxopod

nab

nabbed

nabbing

nabit

nabk

nabob

nacarat

nacelle



no preceding word

'myx'

'myxop'

no common prefix

'nab'

'nabb'

'nab'

'nab'

'nab'

'na'

'nac'



0 myxa

3 ophyta

5 od

0 nab

3 bed

4 ing

3 it

3 k

3 ob

2 carat

3 elle

64 bytes

46 bytes

The encoding used to store the common prefix length itself varies from application to application. Typical techniques are storing the value as a single byte; delta encoding

Delta encoding

Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...

, which store only the change in the common prefix length; and various universal codes

Universal code (data compression)

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic , the expected lengths of the codewords are...

. It may be combined with other general lossless data compression

Lossless data compression

Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be reconstructed, in exchange...

techniques such as entropy encoding

Entropy encoding

In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium....

and dictionary coder

Dictionary coder

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder...

s to compress the remaining suffixes.

Applications

Incremental encoding is widely used in information retrieval to compress the lexicons used in search indexes

Index (search engine)

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science...

; these list all the words found in all the documents and a pointer for each one to a list of locations. Typically, it compresses these indexes by about 40%.

As one example, incremental encoding is used as a starting point by the GNU locate

GNU locate

locate is a Unix utility first created in 1983 used to find files on filesystems. It searches through a prebuilt database of files generated by updatedb or a daemon and compressed using incremental encoding...

utility, in an index of filenames and directories. The GNU locate

GNU locate

utility further uses bigram

Bigram

Bigrams or digrams are groups of two written letters, two syllables, or two words, and are very commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful language models for speech recognition...

encoding to further shorten popular filepath prefixes.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.