Snappy
Encyclopedia
Snappy is a fast compression
and decompression library based on LZ77's ideas developed by Google
. It was designed to be very fast and stable, but not to achieve a high compression ratio. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single threaded, 64-bit Core i7 processor. The compression ratio is 20-100% lower than gzip
.
Snappy is widely used in Google projects like BigTable, MapReduce, and compression of RPC. Decompression is tested to detect any error in the compressed stream. Snappy does not use inline assembler and is portable, but optimized for 64-bit, little-endian architectures, primary x86_64.
, like huffman tree or arithmetic encoder.
The first bytes of the stream are the length of uncompressed data, stored as a little-endian varint which allows for variable-length encoding. The lower 7 bits of each byte are used for data and the first bit is a flag which tells if the next byte is used for the same integer.
The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the first byte (tag byte) of the element. The two lower bits of this byte is the type code:
The copy is referring to the dictionary (or just decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from dictionary. The size of the dictionary is limited by the current Snappy compressor to 32768 bytes.
will be compressed to
0000000: ca02 f042 5769 6b69 7065 6469 6120 6973 ...BWikipedia is
Length of data is 0x02ca(varint) = 0x014a = 330 bytes; 0xf042 = literal of 66+1 bytes follows
0000010: 2061 2066 7265 652c 2077 6562 2d62 6173 a free, web-bas
0000020: 6564 2c20 636f 6c6c 6162 6f72 6174 6976 ed, collaborativ
0000030: 652c 206d 756c 7469 6c69 6e67 7561 6c20 e, multilingual
0000040: 656e 6379 636c 6f09 3ff0 8170 726f 6a65 encyclo.?..proje
0x09 is tag-byte of type 01 with length of 0b10= 2 +4 = 6 and offset = 0x03f = 63 or "pedia ";
0xf081 is a literal with length of 129+1 bytes
0000050: 6374 2073 7570 706f 7274 6564 2062 7920 ct supported by
0000060: 7468 6520 6e6f 6e2d 7072 6f66 6974 2057 the non-profit W
0000070: 696b 696d 6564 6961 2046 6f75 6e64 6174 ikimedia Foundat
0000080: 696f 6e2e 2049 7473 2031 3920 6d69 6c6c ion. Its 19 mill
0000090: 696f 6e20 6172 7469 636c 6573 2028 6f76 ion articles (ov
00000a0: 6572 2033 2e36 206d 696c 6c69 6f6e 2069 er 3.6 million i
00000b0: 6e20 456e 676c 6973 6829 2068 6176 6520 n English) have
00000c0: 6265 656e 2077 7269 7474 656e 2032 ab00 been written 2..
0x32 is tag-byte of type 10 with length of 12 = and offset of 0x00ab = 171 bytes or " collaborative";
00000d0: 046c 7901 8030 766f 6c75 6e74 6565 7273 .ly..0volunteers
0x04 is literal's tag-byte with length of 1+1 = 2 or just "ly" suffix;
0x0180 is type 01 with length of 0+4 and offset of 0x080 = 128 bytes or " by "
0x30 is literal of length 12+1
00000e0: 2061 7201 7305 9268 776f 726c 642c 2061 ar.s..hworld, a
0x0173 is copy-01 with length of 0+4 and offset of 0x073 = 115 or "ound"
0x0592 is type 01 with length of 1+4 and offset of 0x092 = 146 or " the "
0x68 is literal of length 26+1
00000f0: 6e64 2061 6c6d 6f73 7420 616c 6c20 6f66 nd almost all of
0000100: 2069 7401 2804 7469 057f 2463 616e 2062 it.(.ti..$can b
0x0128 is copy-01 with length of 0+4 and offset of 0x028 = 40 or "s ar"
0x04 is a literal of length 2 or "ti"
0x057f is copy-01 with length of 0+4 and offset of 0x07f = 127 or "cles "
0x24 is literal of 9+1
0000110: 6520 6564 690d cd78 616e 796f 6e65 2077 e edi..xanyone w
0x0dcd is copy-01 with length of 0x3 + 4 = 7 and offset of 0x0cd=205 or "ted by "
0x78 is literal of 0x1e = 30 chars
0000120: 6974 6820 6163 6365 7373 2074 6f20 7468 ith access to th
0000130: 6520 7369 7465 2e e site.
With this example, 330 bytes were compressed to 311 bytes with all common substrings with 4 or more chars eliminated by the compression process. More common compressors can compress this better. With bzip2, the data will be compressed to 261 bytes by using a combination of entropy enconding and compression of the limited alphabet into the bit-stream.
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
and decompression library based on LZ77's ideas developed by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
. It was designed to be very fast and stable, but not to achieve a high compression ratio. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single threaded, 64-bit Core i7 processor. The compression ratio is 20-100% lower than gzip
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...
.
Snappy is widely used in Google projects like BigTable, MapReduce, and compression of RPC. Decompression is tested to detect any error in the compressed stream. Snappy does not use inline assembler and is portable, but optimized for 64-bit, little-endian architectures, primary x86_64.
Stream format
Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no entropy encoderEntropy encoding
In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium....
, like huffman tree or arithmetic encoder.
The first bytes of the stream are the length of uncompressed data, stored as a little-endian varint which allows for variable-length encoding. The lower 7 bits of each byte are used for data and the first bit is a flag which tells if the next byte is used for the same integer.
The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the first byte (tag byte) of the element. The two lower bits of this byte is the type code:
- 00 - Literal - uncompressed data; upper 6 bits are used to store length of data; if the length of data is more 60 bytes, additional variable-length encoding is added
- 01 - Copy with length stored as 3 bits and offset stored as 11 bits; one byte after tag byte is used for part of offset;
- 10 - Copy with length stored as 6 bits of tag byte and offset stored as two-byte integer after the tag byte;
- 11 - Copy with length stored as 6 bits of tag byte and offset stored as four-byte little-endian integer after the tag byte;
The copy is referring to the dictionary (or just decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from dictionary. The size of the dictionary is limited by the current Snappy compressor to 32768 bytes.
Example of compressed stream
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 19 million articles (over 3.6 million in English) have been written collaboratively by volunteers around the world, and almost all of its articles can be edited by anyone with access to the site.
will be compressed to
0000000: ca02 f042 5769 6b69 7065 6469 6120 6973 ...BWikipedia is
Length of data is 0x02ca(varint) = 0x014a = 330 bytes; 0xf042 = literal of 66+1 bytes follows
0000010: 2061 2066 7265 652c 2077 6562 2d62 6173 a free, web-bas
0000020: 6564 2c20 636f 6c6c 6162 6f72 6174 6976 ed, collaborativ
0000030: 652c 206d 756c 7469 6c69 6e67 7561 6c20 e, multilingual
0000040: 656e 6379 636c 6f09 3ff0 8170 726f 6a65 encyclo.?..proje
0x09 is tag-byte of type 01 with length of 0b10= 2 +4 = 6 and offset = 0x03f = 63 or "pedia ";
0xf081 is a literal with length of 129+1 bytes
0000050: 6374 2073 7570 706f 7274 6564 2062 7920 ct supported by
0000060: 7468 6520 6e6f 6e2d 7072 6f66 6974 2057 the non-profit W
0000070: 696b 696d 6564 6961 2046 6f75 6e64 6174 ikimedia Foundat
0000080: 696f 6e2e 2049 7473 2031 3920 6d69 6c6c ion. Its 19 mill
0000090: 696f 6e20 6172 7469 636c 6573 2028 6f76 ion articles (ov
00000a0: 6572 2033 2e36 206d 696c 6c69 6f6e 2069 er 3.6 million i
00000b0: 6e20 456e 676c 6973 6829 2068 6176 6520 n English) have
00000c0: 6265 656e 2077 7269 7474 656e 2032 ab00 been written 2..
0x32 is tag-byte of type 10 with length of 12 = and offset of 0x00ab = 171 bytes or " collaborative";
00000d0: 046c 7901 8030 766f 6c75 6e74 6565 7273 .ly..0volunteers
0x04 is literal's tag-byte with length of 1+1 = 2 or just "ly" suffix;
0x0180 is type 01 with length of 0+4 and offset of 0x080 = 128 bytes or " by "
0x30 is literal of length 12+1
00000e0: 2061 7201 7305 9268 776f 726c 642c 2061 ar.s..hworld, a
0x0173 is copy-01 with length of 0+4 and offset of 0x073 = 115 or "ound"
0x0592 is type 01 with length of 1+4 and offset of 0x092 = 146 or " the "
0x68 is literal of length 26+1
00000f0: 6e64 2061 6c6d 6f73 7420 616c 6c20 6f66 nd almost all of
0000100: 2069 7401 2804 7469 057f 2463 616e 2062 it.(.ti..$can b
0x0128 is copy-01 with length of 0+4 and offset of 0x028 = 40 or "s ar"
0x04 is a literal of length 2 or "ti"
0x057f is copy-01 with length of 0+4 and offset of 0x07f = 127 or "cles "
0x24 is literal of 9+1
0000110: 6520 6564 690d cd78 616e 796f 6e65 2077 e edi..xanyone w
0x0dcd is copy-01 with length of 0x3 + 4 = 7 and offset of 0x0cd=205 or "ted by "
0x78 is literal of 0x1e = 30 chars
0000120: 6974 6820 6163 6365 7373 2074 6f20 7468 ith access to th
0000130: 6520 7369 7465 2e e site.
With this example, 330 bytes were compressed to 311 bytes with all common substrings with 4 or more chars eliminated by the compression process. More common compressors can compress this better. With bzip2, the data will be compressed to 261 bytes by using a combination of entropy enconding and compression of the limited alphabet into the bit-stream.