Calgary Corpus
Encyclopedia
The Calgary Corpus is a collection of text and binary data files, commonly used for comparing data compression
algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary
in 1987 and was commonly used in the 1990s. In 1997 it was replaced by the Canterbury Corpus
, but the Calgary Corpus still exists for comparison and is still useful for its original intended purpose.
There is also a less commonly used 18 file version which include 4 additional text files in UNIX "troff" format, PAPER3 through PAPER6.
For some data compressors it is possible to compress the corpus smaller by combining the inputs into an uncompressed archive (such as a tar
file) before compression because of mutual information
between the text files. In other cases, the compression is worse because the compressor handles nonuniform statistics poorly. This method was used in a benchmark in the online book Data Compression Explained by Matt Mahoney http://mattmahoney.net/dc/dce.html#Section_214.
The table below shows the compressed sizes of the 14 file Calgary corpus using both methods for some popular compression programs. Options, when used, select best compression. For a more complete list, see the above benchmarks.
According to the rules of the contest, an entry must consist of both the compressed data and the decompression program packed into one of several standard archive formats. Time and memory limits, archive formats, and decompression languages have been relaxed over time. Currently the program must run within 24 hours on a 2000 MIPS machine under Windows or Linux
and use less than 800 MB memory. An SHA-1 challenge was later added. It allows the decompression program to output files different from the Calgary corpus as long as they hash to the same values as the original files. So far, that part of the challenge has not been met.
The first entry received was 759,881 bytes in September, 1997 by Malcolm Taylor (author of RK and WinRK). The most recent entry was 580,170 bytes by Alexander Ratushnyak on July, 2, 2010. The entry consists of a compressed file of size 572,465 bytes and a decompression program written in C++ and compressed to 7700 bytes as a PPMd var. I archive, plus 5 bytes for the compressed file name and size. The history is as follows.
Since 2004, all submissions are variants of PAQ
and submitted as source code licensed under GPL.
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary
University of Calgary
The University of Calgary is a public research university located in Calgary, Alberta, Canada. Founded in 1966 the U of C is composed of 14 faculties and more than 85 research institutes and centres.More than 25,000 undergraduate and 5,500 graduate students are currently...
in 1987 and was commonly used in the 1990s. In 1997 it was replaced by the Canterbury Corpus
Canterbury Corpus
The Canterbury Corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary Corpus.- External links :...
, but the Calgary Corpus still exists for comparison and is still useful for its original intended purpose.
Contents
In its most commonly used form, the corpus consists of 14 files totaling 3,141,622 bytes as follows.Size (bytes) | File name | Description |
---|---|---|
111,261 | BIB | ASCII text in UNIX "refer" format - 725 bibliographic references. |
768,771 | BOOK1 | unformatted ASCII text - Thomas Hardy: Far from the Madding Crowd. |
610,856 | BOOK2 | ASCII text in UNIX "troff" format - Witten: Principles of Computer Speech. |
102,400 | GEO | 32 bit numbers in IBM floating point format - seismic data. |
377,109 | NEWS | ASCII text - USENET batch file on a variety of topics. |
21,504 | OBJ1 | VAX executable program - compilation of PROGP. |
246,814 | OBJ2 | Macintosh executable program - "Knowledge Support System". |
53,161 | PAPER1 | UNIX "troff" format - Witten, Neal, Cleary: Arithmetic Coding for Data Compression. |
82,199 | PAPER2 | UNIX "troff" format - Witten: Computer (in)security. |
513,216 | PIC | 1728 x 2376 bitmap image (MSB first): text in French and line diagrams. |
39,611 | PROGC | Source code in C - UNIX compress v4.0. |
71,646 | PROGL | Source code in Lisp - system software. |
49,379 | PROGP | Source code in Pascal - program to evaluate PPM compression. |
93,695 | TRANS | ASCII and control characters - transcript of a terminal session. |
There is also a less commonly used 18 file version which include 4 additional text files in UNIX "troff" format, PAPER3 through PAPER6.
Benchmarks
The Calgary corpus was a commonly used benchmark for data compression in the 1990's. Results were most commonly listed in bits per byte (bpb) for each file and then summarized by averaging. More recently, it has been common to just add the compressed sizes of all of the files. This is called a weighted average because it is equivalent to weighting the compression ratios by the original file sizes. The UCLC benchmark by Johan de Bock uses this method.For some data compressors it is possible to compress the corpus smaller by combining the inputs into an uncompressed archive (such as a tar
Tar (file format)
In computing, tar is both a file format and the name of a program used to handle such files...
file) before compression because of mutual information
Mutual information
In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...
between the text files. In other cases, the compression is worse because the compressor handles nonuniform statistics poorly. This method was used in a benchmark in the online book Data Compression Explained by Matt Mahoney http://mattmahoney.net/dc/dce.html#Section_214.
The table below shows the compressed sizes of the 14 file Calgary corpus using both methods for some popular compression programs. Options, when used, select best compression. For a more complete list, see the above benchmarks.
Compressor | Options | As 14 separate files | As a tar file |
---|---|---|---|
Uncompressed | 3,141,622 | 3,152,896 | |
compress Compress Compress is a UNIX compression program based on the LZC compression method, which is an LZW implementation using variable size pointers as in LZ78.- Description of program :Files compressed by compress are typically given the extension .Z... |
1,272,772 | 1,319,521 | |
Info-ZIP Info-ZIP Info-ZIP is a set of open-source software to handle ZIP archives. It has been in circulation since 1989. It consists of 4 separately-installable packages: the Zip and UnZip command-line utilities; and WiZ and MacZip, which are graphical user interfaces for archiving programs in Microsoft Windows... 2.32 |
-9 | 1,020,781 | 1,023,042 |
gzip Gzip Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding... 1.3.5 |
-9 | 1,017,624 | 1,022,810 |
bzip2 Bzip2 bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:... 1.0.3 |
-9 | 828,347 | 860,097 |
7-zip 7-Zip 7-Zip is an open source file archiver. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. The program can be used from a command line interface, graphical user interface, or with Microsoft Windows shell integration. 7-Zip began in 1999 and is actively... 9.12b |
848,687 | 824,573 | |
ppmd Jr1 | -m256 -o16 | 740,737 | 754,243 |
ppmonstr J | 675,485 | 669,497 |
Compression Challenge
The Calgary corpus Compression and SHA-1 crack Challenge http://mailcom.com/challenge/ is a contest started by Leonid A. Broukhis on May 21, 1996 to compress the 14 file version of the Calgary corpus. The contest offers a small cash prize which has varied over time. Currently the prize is US $1 per 111 byte improvement over the previous result.According to the rules of the contest, an entry must consist of both the compressed data and the decompression program packed into one of several standard archive formats. Time and memory limits, archive formats, and decompression languages have been relaxed over time. Currently the program must run within 24 hours on a 2000 MIPS machine under Windows or Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
and use less than 800 MB memory. An SHA-1 challenge was later added. It allows the decompression program to output files different from the Calgary corpus as long as they hash to the same values as the original files. So far, that part of the challenge has not been met.
The first entry received was 759,881 bytes in September, 1997 by Malcolm Taylor (author of RK and WinRK). The most recent entry was 580,170 bytes by Alexander Ratushnyak on July, 2, 2010. The entry consists of a compressed file of size 572,465 bytes and a decompression program written in C++ and compressed to 7700 bytes as a PPMd var. I archive, plus 5 bytes for the compressed file name and size. The history is as follows.
Size (bytes) | Month/year | Author |
---|---|---|
759,881 | 09/1997 | Malcolm Taylor |
692,154 | 08/2001 | Maxim Smirnov |
680,558 | 09/2001 | Maxim Smirnov |
653,720 | 11/2002 | Serge Voskoboynikov |
645,667 | 01/2004 | Matt Mahoney |
637,116 | 04/2004 | Alexander Ratushnyak |
608,980 | 12/2004 | Alexander Ratushnyak |
603,416 | 04/2005 | Przemysław Skibiński |
596,314 | 10/2005 | Alexander Ratushnyak |
593,620 | 12/2005 | Alexander Ratushnyak |
589,863 | 05/2006 | Alexander Ratushnyak |
580,170 | 07/2010 | Alexander Ratushnyak |
Since 2004, all submissions are variants of PAQ
PAQ
PAQ is a series of lossless data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio . Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge...
and submitted as source code licensed under GPL.