The Canterbury Corpus
The canterbury corpus is a set of files named after the university of Canterbury in New Zealand where they were developed. The files were designed in order to test different lossless compression algorithms against a standardized set of data. Comparing algorithms against a standardized set of data allows provides a way for testing which algorithm performs better in different criteria such as speed or compression ratios. The corpus however is only useful if algorithm designers are not using it for the development of the algorithm and are not optimizing their algorithms for perform best against the standardized corpus.
There are five canterbury corpus sets. Here is some information about each and a link for downloading the complete set:
Main canterbury corpus
the main set of files developed in 1997. The files were chosen for their quality of producing normal results for the algorithms that were available at the time. The prediction was that future algorithms would also yield normal results with those files.
Downlaod the full set canterbury corpus files
Artificial corpus
This file set was designed in order to test algorithms under worst case extreme conditions. Running compression algorithms on the artificial corpus will yield no relevant results but is designed mainly to test extreme conditions.
Download the artificial corpus files
Large corpus
a set of relatively large files intended to be used with compression algorithms that include a very large dictionary or that otherwise require a large set of data in order to ramp up to their highest ratio.
Download the large corpus files
Miscellaneous corpus
A set of files that were added over the years by algorithms creators and researches.
Download the large miscellaneous corpus files
Calgary corpus
An older set of files developed in the early 80s that is still used as the de-facto standard for comparion algorithms. The calgary corpus is now replaced by the canterbury corpus.
Download the large calgary corpus files