Rzip
Encyclopedia
The rzip program is huge-scale data compression
software designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2
-based Burrows–Wheeler transform (BWT) and entropy coding (Huffman
) on 900 kB output chunks.
) to compress the output of the first stage.
It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as PDF files containing repeated copies of the same image. Most compression programs won't be able to take advantage of this redundancy, and thus might achieve a much lower compression ratio than rzip can achieve.
The intermediate interface between the two stages is made of a byte-aligned data stream of which there are two commands, a literal ("add") with length and data:
type:8 = 0 => literal/add range of count bytes
count:16 = 1..65535
data:8..∞ = literal data to be inserted (n whole bytes)
and a match ("copy") with length and offset parameters:
type:8 = 1 => match/copy range of count bytes
count:16 = 31..65535
offset:32 = offset to position to be copied from
Literal or match/copy lengths of greater than 65535 bytes are split into multiple instructions. End-of-stream is indicated with a zero-length literal/add (type=0,count=0) command and immediately followed by a 32-bit CRC
checksum.
is used to locate potential matches from over such a large dataset. As the hash buckets fill up, previous hashes ("tags") are discarded based on twice. The tags are discarded in such a manner as to provide fairly good coverage, with a gradually decreasing match granularity as the distance increases. This implementation does not search for match lengths of fewer than 31 consecutive bytes.
uses a maximum history buffer of 32 KiB. The BWT
block sorting algorithm used in bzip2
is limited to 900 KiB of history. The history buffer in rzip can be up to 900 MiB long, several orders of magnitude larger than gzip or bzip2. Interestingly, rzip is often much faster than bzip2, despite using the bzip2 library as a backend. This is because rzip feeds bzip2 with shrunken data, so that bzip2 has to do less work. A simple comparison (although too small it to be for an authoritative benchmark) can be found in http://www.linux-mag.com/content/view/2528/0/1/0/ (log-in required). Another one is found in rzip's webpage.
. If there is a lot of RAM to spare and a very high compression ratio is required, rzip should be used, but if these conditions are not satisfied, alternate compression methods such as gzip and bzip2, which are less memory-intensive, should be used instead of rzip.
There is at least one patch to enable pipelining http://www.rachinsky.de/nicolas/misc.shtml.
archiver as preprocessor for LZMA/Tornado compression algorithms. In FreeArc, REP finds large-distance matches and then LZMA compress the remaining data. For example, on computer with 2 GB RAM, REP finds matches that's at least 512 bytes long at the distances up to 1 GB, and then LZMA finds any remaining matches at the distances up to 128 MB. So, working tohether, they provide the best compression possible on 2 GB RAM budget.
Being optimized for stream decompression and collaborative work with LZMA, REP has some differences from the original RZIP implementation. First, by default it finds only matches that are 512+ byte long, since benchmarking proved that this is optimal setting for overall REP+LZMA compression. Second, it uses a sliding dictionary that's about 1/2 RAM long, so decompression doesn't need to reread data from decompressed file. REP's advantage is its multiplicative rolling hash that's both quick to compute and has near-ideal distribution.
Larger minimal match length (512 bytes compared to 32 bytes in RZIP) allowed for additional speed optimizations, so that REP provides very fast compression (about 200 MB/s on Intel i3-2100).
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
software designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
-based Burrows–Wheeler transform (BWT) and entropy coding (Huffman
Huffman coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...
) on 900 kB output chunks.
Compression Algorithm
rzip operates in two stages. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (900 MB) in the input file. The second stage uses a standard compression algorithm (bzip2Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
) to compress the output of the first stage.
It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as PDF files containing repeated copies of the same image. Most compression programs won't be able to take advantage of this redundancy, and thus might achieve a much lower compression ratio than rzip can achieve.
The intermediate interface between the two stages is made of a byte-aligned data stream of which there are two commands, a literal ("add") with length and data:
type:8 = 0 => literal/add range of count bytes
count:16 = 1..65535
data:8..∞ = literal data to be inserted (n whole bytes)
and a match ("copy") with length and offset parameters:
type:8 = 1 => match/copy range of count bytes
count:16 = 31..65535
offset:32 = offset to position to be copied from
Literal or match/copy lengths of greater than 65535 bytes are split into multiple instructions. End-of-stream is indicated with a zero-length literal/add (type=0,count=0) command and immediately followed by a 32-bit CRC
Cyclic redundancy check
A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data...
checksum.
Reference implementation
A rolling-checksum algorithm based on the one in rsyncRsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
is used to locate potential matches from over such a large dataset. As the hash buckets fill up, previous hashes ("tags") are discarded based on twice. The tags are discarded in such a manner as to provide fairly good coverage, with a gradually decreasing match granularity as the distance increases. This implementation does not search for match lengths of fewer than 31 consecutive bytes.
Advantages
The key difference between rzip and other well known compression algorithms is its ability to take advantage of very long distance redundancy. The well known deflate algorithm used in gzipGzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...
uses a maximum history buffer of 32 KiB. The BWT
Burrows-Wheeler transform
The Burrows–Wheeler transform , is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California...
block sorting algorithm used in bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
is limited to 900 KiB of history. The history buffer in rzip can be up to 900 MiB long, several orders of magnitude larger than gzip or bzip2. Interestingly, rzip is often much faster than bzip2, despite using the bzip2 library as a backend. This is because rzip feeds bzip2 with shrunken data, so that bzip2 has to do less work. A simple comparison (although too small it to be for an authoritative benchmark) can be found in http://www.linux-mag.com/content/view/2528/0/1/0/ (log-in required). Another one is found in rzip's webpage.
Disadvantages
Rzip is not suited for every purpose. The two biggest disadvantages of rzip are that it cannot be pipelined (so it cannot read from standard input or write to standard output), and that it uses a high amount of memory: a typical compression run on a large file might use hundreds of megabytes of RAMRam
-Animals:*Ram, an uncastrated male sheep*Ram cichlid, a species of freshwater fish endemic to Colombia and Venezuela-Military:*Battering ram*Ramming, a military tactic in which one vehicle runs into another...
. If there is a lot of RAM to spare and a very high compression ratio is required, rzip should be used, but if these conditions are not satisfied, alternate compression methods such as gzip and bzip2, which are less memory-intensive, should be used instead of rzip.
There is at least one patch to enable pipelining http://www.rachinsky.de/nicolas/misc.shtml.
Lrzip
Lrzip is an improved version of Rzip. Its file format is incompatible with Rzip's. It has the following improvements:- LzmaLZMAThe Lempel–Ziv–Markov chain algorithm is an algorithm used to perform data compression. It has been under development since 1998 and was first used in the 7z format of the 7-Zip archiver...
, LZOLZOLempel-Ziv-Oberhumer is a lossless data compression algorithm that is focused on decompression speed.- Design :The LZO library implements a number of algorithms with the following characteristics:...
, DEFLATEDEFLATEDeflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951....
, Bzip2Bzip2bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
, and ZPAQZPAQZPAQ is a proposed open format for highly compressed data, mainly created by Matt Mahoney. It is designed so that new compression algorithms may be developed without breaking compatibility with older versions of ZPAQ-compliant decompression programs. ZPAQ uses a configurable context mixing...
compression (as opposed to Bzip2 only) - No dictionary limit, not even limited by available RAM
- Ability to test data for compressibility before compressing, preventing the computer from wasting time by trying to compress incompressible data
- Ability to be pipelined from standard input / standard output (with a loss in compression ratio)
- Ability to disable last-stage compression for use with another compressor
- Optional AES-128Advanced Encryption StandardAdvanced Encryption Standard is a specification for the encryption of electronic data. It has been adopted by the U.S. government and is now used worldwide. It supersedes DES...
encryption
rzip64
rzip64 is an extension of rzip for very large files. It can utilize multiple CPU cores in parallel. Most important, however, is the ability of rzip64 to be interrupted at any time. Thereby a running compression task survives a system maintenance reboot without losing already completed work.REP
REP is an alternative implementation of RZIP algorithm by Bulat Ziganshin used in his FreeArcFreeArc
FreeArc is a free and open source file archiver developed by Bulat Ziganshin.-Algorithms:FreeArc uses LZMA, PPMD, TrueAudio, Tornado and GRzip algorithms with automatic switching by file type, and also uses set of filters—for instance it can remove repetitions from text.-Archive size:In Tom's...
archiver as preprocessor for LZMA/Tornado compression algorithms. In FreeArc, REP finds large-distance matches and then LZMA compress the remaining data. For example, on computer with 2 GB RAM, REP finds matches that's at least 512 bytes long at the distances up to 1 GB, and then LZMA finds any remaining matches at the distances up to 128 MB. So, working tohether, they provide the best compression possible on 2 GB RAM budget.
Being optimized for stream decompression and collaborative work with LZMA, REP has some differences from the original RZIP implementation. First, by default it finds only matches that are 512+ byte long, since benchmarking proved that this is optimal setting for overall REP+LZMA compression. Second, it uses a sliding dictionary that's about 1/2 RAM long, so decompression doesn't need to reread data from decompressed file. REP's advantage is its multiplicative rolling hash that's both quick to compute and has near-ideal distribution.
Larger minimal match length (512 bytes compared to 32 bytes in RZIP) allowed for additional speed optimizations, so that REP provides very fast compression (about 200 MB/s on Intel i3-2100).
SREP
SREP (SuperREP) is an implemetation of Tridgell's idea of LZ compressor that doesn't store its dictionary in RAM, using instead SHA1 hashes of processed blocks to compare their contents. It allows to compress files that are about 10x larger than RAM available. Decompression performed either by reading data from decompressed part of file, or by storing in the memory future matches (future-LZ comrpession algorithm). Of course, future-LZ compression requires 2 passes over input file, but OTOH decompression needs tiny memory. In one experiment, 22 GB file compressed with minimum match length of 512 bytes and full 22 GB dictionary required just 2 GB of RAM for decompression.See also
- List of archive formats
- List of file archivers
- Comparison of file archiversComparison of file archiversThe following tables compare general and technical information for a number of file archivers. Please see the individual products' articles for further information. They are neither all-inclusive nor are some entries necessarily up to date...
External links
- rzip
- rsyncRsyncrsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
, by the same author (Andrew Tridgell) and containing a similar matching/search algorithm to the first-stage of rzip. - lrzip, an improvement to 'rzip' allowing the second stage bzip2Bzip2bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
reduction to be replaced by LZMALZMAThe Lempel–Ziv–Markov chain algorithm is an algorithm used to perform data compression. It has been under development since 1998 and was first used in the 7z format of the 7-Zip archiver...
, LZOLZOLempel-Ziv-Oberhumer is a lossless data compression algorithm that is focused on decompression speed.- Design :The LZO library implements a number of algorithms with the following characteristics:...
, or no second-stage (raw, dictionary-only compression). The author is Con Kolivas who states that 'lrzip' stands for 'Long Range ZIP'. - rzip64, a parallel improvement to 'rzip' with stop-and-go mode from Kay Gorontzi.
- REP, improved RZIP implementation optimized for use together with LZMA
- SREP, the first LZ compressor that uses less RAM than dictionary size
- DataCompression.info – LZ77/LZSS and derivatives